CrashKill/WorkWeek2012/Analysis

From MozillaWiki
Jump to: navigation, search

Desktop crash analysis - reports, steps we do

  • Crashkill - looks at new crashes appearing in the last week
  • Better metrics on how crashy things are - only have crashes per 100 ADU
    • Release management - interested in mean time between failures.
    • What % of our users are crashing more than x times a day?
    • Bucket of people that crash a lot.
    • Can we collect this via Telemetry?
    • Ben - used to have a user id with crash reports. Difficult to go back there.
    • Look at getting the data from Telemetry.
    • Histogram - of how users crash.
    • Laura - comparison - what does a healthy browser look like
    • Add-ons - what does the histogram look like?
  • Used correlation reports a lot - correlations only available for the top 200 reports
  • STR, correlation reports, explosive
  • Top mac crashes - breakdown by OS versions
  • Would like a top list of crashes on OS
  • Mobile crashes - top list per device - info in the app notes
  • Could put this info in some other field - now we are using the app notes
  • Client side - can we make it easier to add a field?
  • Signature summary - highly valuable
  • Breakpad - mini dump -> turn into stack. New version of the tool that spits out JSON, stick more data. Never had the time to get it integrated - change the format of the process dump, we are incompatible with the rest of the data. Get much more searchable, flexible process dumps.
  • Input perspective - would like an easier way to get emails - how to write your own queries?
  • Explosive crashes - integrated into the UI yet? Part way.
  • Dups - feedback from us on whether the detection seems right or not. Reports where a bunch of fields are identical.
  • Regression ranges - how we we isolate good regression ranges.

Mobile walk-through of how we use Socorro

  • Which OS version.
  • Mostly searches by build id
  • Device report - combines signatures and devices - yes there is a bug on this.
  • ADUs per OS and ADUs per device - is there a bug on this? No…probably not. They likely don't get the data from the client.
  • Put some information in Socorro between devices and CPUs - bug 763629
  • Add-block plus - add-on where we have had crashes
  • Crashes we cannot catch in Socorro - OOM crashes
  • We use the URL reports, build id reports, search by dates.
  • Naoki whiteboard to describe pieces of crash analysis for mobile

Summary of crash activities

  • Mondays 10:00am PT
  • Engineering Meeting Tues 11:00am PT
  • Channel Meetings (Tues/Thurs 2:00pm PT)
  • Release signoff (Wed before release)
  • Crash triage (Thurs 9:00am PT)
  • Topcrash triage (ad-hoc)
  • General crash triage (ad-hoc)

Finding Crashes

  • Bob Clary - load crash URLs
  • Fuzz testing
  • Harvesting support data and input
  • Triaging crash bugs
  • Topcrash reports and explosive reports
  • WinQual reports

Tools

  • Address sanitizer
  • Valgrind
  • Lithium
  • Stack-blame

Security Bug?

  • 0x0 - 0x0ffff (safe)
  • Out of stack space (safe)
  • 0x??? (unsafe)

What breakpad misses

Steps of crash analysis