Socorro/Hang Processing Proposal
The current system of processing plugin hang reports in crash-stats is not producing especially useful data, and with the introduction of Flash protected mode it is almost useless. This is a proposal by bsmedberg to radically change how hangs are processed in Socorro/crash-stats.
The current procedure when a plugin hang occurs involves submitting two linked reports:
- One report processtype="" (browser) hangid="UUID"
- One report processtype="plugin" hangid="UUID"
These reports are submitted separately and are only linked by the client-generated hang UUID. The are processed separately and have separate signatures, and are cross-linked only minimally. Correlating things across both reports cannot be done directly in the SQL database and typically requires external processing.
Submitting All Minidumps In One Report
Instead of submitting separate reports, I would like to instead submit all of the information in one hang report. This may include two or more minidumps as well as a single metadata blob.
- Each hang report will generate a single signature (exact algorithm TBD, but probably focusing on the plugin-side stack at first).
- Each hang report will generate a single report ID in both SQL and hbase.
- Each of the minidumps will be stored in the same hbase row in a separate column:
browser_dump:dump plugin_dump:dump flash_dump1:dump (optional) flash_dump2:dump (optional)
Since the existing data is generally poor, we shouldn't worry about it too much, it's mainly used for large aggregate counting.
- Add processor support for submitting and retrieving all minidumps in a single report.
- Stop processing separate browser-side hang reports
- (optional) Do hbase magic to migrate existing hang pairs for non-release builds into the single hbase row using the child report ID
- (optional) Remove existing plugin-browser reports
- Add Firefox support for submitting both browser/p-c minidumps in a single report.
- Add Firefox support for submitting minidumps for the Flash processes NOTE: this means that most hang reports will end up with *four* minidumps needing to be processed and stored, instead of the current two.
- Add crash-stats frontend support for displaying multiple minidumps stacks
Socorro Implementation Ramifications
This is a major change to how crashes are collected, processed and displayed. Changes will have to be made in the backend, middleware and UI components.
Currently the collector reads the binary minidump as a stream. The collector code will have to be modified to read as many streams as are offered in the crash. The minidumps will have to either be named or enumerated so that they can be stored and retrieved individually.
The size of a given crash will increase. In 2009, 8% of crash submissions failed because of timeouts in the crash data upload. We theorized that these were crashes from people on dial up lines. We need to see what that upload failure rate is in 2012 and project ramifications to the increased crash size.
The interface for the crash storage sub-system will need to be extended to allow for multiple minidumps. For filesystem storage (collector storage), this will mean a new naming scheme to differentiate the minidumps. For HBase this will have to be coordinated with metrics on storage. We need to look at total storage ramifications: how much will this increase the average size of a crash. Minidump compression has been mentioned.
The processor will be responsible for feeding all minidumps to MDSW. Having multiple minidumps, if submitted serially, will significantly increase the amount of time that it takes to process a crash. However, since MDSW is invoked as a separate process, it would not be too difficult to run them in parallel.
The existing Hang Pair code will have to be excised from the processor.
Signature generation: a signature is generated from the output of MDSW. With multiple minidumps in a single crash, does the processor make a collection of signatures? answer from bsmedberg: The goal is to end up with one signature. At first this will just be the signature of the "plugin" minidump/MDSW. In future iterations we will probably want to use information from the other output to modify the signature.``
The API will need extensions to take into account multiple mindumps. The get crash service is the first example that comes to mind.
Display of the results of multiple MDSW output will require revamping the UI. bsmedberg says: At first this can be minimal: the main "Details" page can just display the plugin minidump; the other MDSW outputs should all be listed in the "Raw Dump" tab.