QA/TDAI/Crash URL Analysis
The Goal
We need a better way to analyze the myriad of crash reports that are coming in. Some of these reports come in with handy URLs and we have a semi-automated method for trying to reproduce those crashes. This proposal is to completely automate that process.
We need a way to take these URLs and analyze them to find out if they are reproducible and exploitable. Once we have that analysis, we need a way to make the logs of which crashes are reproducible and exploitable and make that available so that some number of trusted eyes can see it. This way we get a bit more redundancy w.r.t. crash detection and get an early jump on fixing them. The goal is to ultimately reduce our security exposure to crashes that are found in the wild.
The Plan
Step 1: Gathering
We will take a daily dump of URLs and put them into a queue. We will then potentially filter the queue to remove URLs that are known to be invalid (such as "about:blank, for example). We will also filter the queue into OS specific lists to target the URLs to the proper OS's.
(Phase 2 Addition:) We will also want to filter out crashes that we have reproduced and have bugs filed as NEW. Once that bug is verified, we want to start running the analysis on that URL again (to ensure the bug remains fixed afterward).
(Phase 2 Addition:) There should also be an ability to put in a manually generated queue. For things that are reported outside the normal channels, and for bugs like the ones above.
Step 2: Reproduction
The daemon on the VM/machine will take the queue and run through each URL. The daemon will simply:
- Open a Browser
- Load the URL
- Wait a moment
- try to manipulate the page with actions like resize window and other interactions (future)
- also maybe spider to links off the page to catch next clicks users might have taken and resulted in crash (future)
- Report whether or not a crash occurred and log this fact.
- Captures and uploads a minidump/coredump from the crash
- Saves the page locally using wget -p
- Attempts to run !exploitable/crash wrangler on the crash and logs that output
- Uploads: Log of results, zipped page files, exploitable/crashwrangler output, minidump to the Reporting Service
The daemon that runs the analysis should have the ability to run a configurable hook that does more testing with the URL loaded. There are many different things that the hook can do such as:
Page Manipulation
- Zooming
- Resizing
- Testing some number of random links from this page (because the page reported in crash reporter is usually the page that the user was viewing immediately prior to the crash. So by clicking some links on the page we might find the page that the user actually crashed on)
- Print preview
- Bubbling events through the page (like mouse over, random clicks etc)
Web Content Manipulation
(Phase 2 Items (in order of priority, highest to lowest))
- Test accounts on popular web apps (google, facebook, myspace, orkut, etc). The hook would ensure that the guest user was logged into the account needed before attempting to load the URL.
- Testing on un-clean profiles, large profiles, with extensions installed etc
- With plugins installed
- virus scanners active on the VM/machine (Out of band, will require automation of VM snapshots, possibly manual step or dedicated VM for this)
- Using VM settings to change network/RAM/Graphics card settings
- Check for known exploits to determine if specific crash URLs are malware installers
(Phase 3 Item (in order of priority, highest to lowest))
- Once we have a reproducible crash, we could try running an automatic miniaturization tool on it to generate a possible testcase.
- Potentially use valgrind builds for crashes we get over and over again and/or bugs that appear to only be semi-reproducible or hangs
Proposed Design
To users of this service, it will be a web app.
The Web app will populate a database of crashes on a predetermined schedule by grabbing a dump from socorro (possibly by finding the socorro dumps in a well-known location).
Users will also be able to input crash URLs by hand by filling out a form and submitting those URLs into the database queue.
Once a URL goes into a queue, it is displayed on the results screen as "in process".
The daemon described above runs on a window VM, a linux VM, and a mac machine in the colo. It will query the database queue to get the next URL, and run it as above.
- TODO: Will we have the ability to specify how the daemon runs the tests? We could do that through the web app, and have the hook be configured by reading that information from the database.
- TODO: We will need to snapshot the machine and VM so that we can roll back to a snapshot and prevent these machines from being exploited by malware.
Once the daemon finishes the run, it zips up all log output, minidump etc, and posts a log message back into the database.
The user will see the URL change from "In Progress" to having a result displayed.
We're going to keep the design as simple as possible, and bring up only what is absolutely necessary at first.