Socorro/2011-Q2-Crashkill-Workflow-Improvements

From MozillaWiki
Jump to: navigation, search

Crashkill - Socorro meetup Several workflows for analyzing crashes. Some of todays focus is on the missing parts of the workflows needed to analyze crashes, features in the UI etc. Kairo has been retriaging Socorro bugs, filing new ones, and prioritizing things for the roadmap.

Data retention

  • Old data: regressions
  • Signatures are very rough
    • Even stacks can be rough, multiple causes of same crash
  • Would like data about when a signature / stack first appeared (becomes less valuable over time)
    • Same signature can be caused by different things, might want to know when crash started appearing with greater frequency
  • Interesting questions are mostly about nightlies, not releases
    • Maybe keep nightly crashes forever, throw away most of release crashes
  • Answering different kinds of questions: code bugs vs. malware spikes
  • How much does data become less useful over time?
    • Can we keep part of the data? Remove minidumps?
    • Can we keep only derived data?
    • Perhaps keep dumps linked from bugs, as they're presumably important

Workflows

  • PDF docs from Sheila, she will also draw on the whiteboard

Browser Crashes

  • top crash lists
    • looking for trends
    • 7d, 3d
      • would like 1d
    • nightly
    • betas
    • new channels (aurora)
  • stability reports
    • crashes per 100 users
    • release comparisons
    • comp to nightly
    • distribution of crashes/build
    • explosive crashes
  • monitoring, figuring out which crashes are important (crash classification)
    • signature not a very good classification metric
  • investigation, finding data to fix a specific crash
  • verification that a crash is fixed
    • may not have unique signature
  • The way we generate signatures/crash bucketization needs to improve. dbaron, bsmedberg, xstevens, chofmann all have ideas on this problem. (see blog posts)
    • multiple sigs for the one crash is one of these issues
    • componentization is hard to do from code/stack, can take a crack at it from bugzilla component
  • really important to present crashes by buildid as well as calendar time
  • 3rd party correlation
  • for a given signature, do we have many crashes versus one stack
  • signatures are a little bit better than useless
    • need to come up with a better way to identify crashes
    • kick off metrics project, need someone with a solid stats background
  • both better UI tools (for reclassifying on the fly) and automated tools are useful
    • faceted search should help for reclassifying on the fly
      • investigating elastic search to provide this
    • looking at enabling developers to write and run pig scripts (not M/R jobs, same goal)
    • user tagging / classification
  • need statistical summary of all metadata
    • right now users looking at many reports and doing correlation manually
    • how close to startup
    • do people have a certain version of flash
    • surfacing version information critical to reproducability
  • daily (nightlies) jobs good candidates for nightly M/R jobs
  • user notes / discussion forums associated with a crash
  • isolate a regression window
    • way to link from a signature to a suspected peice of code
    • who, what, when, link to bug
    • dbaron reads through list of commit messages
    • having a link to commit messages for nightly would help
    • show with fewer clicks who has worked on code lately

Plugin Crashes

  • top crash list
    • plugin and browser crashes currently are listed independently
      • these are really report pairs
      • throttling means we don't always process both halves
  • "browser hang" terminology is confusing
    • it means the state of the browser when a plugin hangs

Hangs

  • isolate report pairs

Crashes

  • get a plugin report
    • outreach to plugin vendor

Short Release Cycle

  • need more than 3 "current versions" on crash-stats homepage
    • would rather have more links on the homepage than keep the long crash list
  • comparing release-to-release missing
  • should do ranks less and crashes per ADU more

Action items

  • Copy this to wiki.mozilla.org
  • data retention issue is critical; running out of disk space (laura)
    • need data size for each step of the process
      • identify places we are duplicating the information
    • throw away raw data after 6 months
  • Use ADU predictions from Blake Cutler's awesome ADU model until we have solid blocklist data
  • determine better method than signature to identify crashes (xstevens, metrics)
  • integrate dbaron's correlation reports into socorro (bug filed, #?)
    • also, separate reports for browser, plugin, plugin hang
  • look at Firefox 4.0 unthrottled data to determine statistically significant throttle size, is 10% enough (gilbert/daniel, metrics)
  • laura to follow up with smooney and kairo on UI changes and priorities