CrashKill/Plan/Explosive

From MozillaWiki
Jump to: navigation, search

Notes on the work on a set of criteria for finding explosive crash reports - bug 629049 is the tracker bug, bug 629062 is detection. The PRD doc has some surrounding info, but no criteria yet.

Personal Notes

  • Sharp/significant increase at certain wall-clock time across versions
  • Sharp/significant increase at certain build ID (date?) on single version/series (possibly ignoring everything in version string starting with first letter if the version ends in "pre", to have e.g. 5.0a3pre->5.0b1pre or 4.0b11pre->4.0b12pre not disturb the analysis)
  • Ignore (suspected) duplicates
  • Frequency weighted by ADU more important than bare count (from something chofmann has said)
  • I'm not fond of topcrash rank comparisons, as 20 crashes with similar frequency changing place looks overvalued there, while e.g. #1 having 10,000 crashes and #3 having 500 fully mask #2 exploding from 600 to 5,000 in a day.

Criteria Proposal

This is a quite rough proposal right now.

  1. Get two sets of numbers per signature:
    • non-duplicate crashes occurred per day and total ADU for the last 10 days
    • non-duplicate crashes and ADU per combination of version series (see personal notes) and date of build ID, for the last 10 available build ID dates in the version series
  2. For each set, calculate (if there are at least 4 values in the set):
    • average crashes per ADU over 7 values before recent value ("base")
    • average ADU over those values ("avgADU")
    • distance of that average to the highest value in set ("dist"), clamped to a minimum of (30 crashes/avgADU)
    • recent value per ADU ("data")
    • (total|version)_explosiveness_1 = (data-base)/dist
  3. For each set, calculate (if there are at least 6 values in the set):
    • average crashes per ADU over 7 values before recent 3 values ("base")
    • average ADU over those values ("avgADU")
    • standard deviation of that average ("dist"), clamped to a minimum of (15 crashes/avgADU)
    • average of recent 3 values per ADU ("data")
    • (total|version)_explosiveness_3 = (data-base)/dist
  4. Mark as explosive in UI if *_explosiveness_1 > 2 or *_explosiveness_3 > 2.

Note: the parameters this can be tweaked by are the minimum values for "dist" and the limits for the resulting explosiveness values to be used to trigger marking/alarms/etc.

Problems with this proposal

  • Completely arbitrary numbers for explosiveness marking limits and "dist"-clamping, need to see if they catch all explosives and/or catch too much.
  • If there's no large enough set of numbers to work with, there's no useful explosiveness.
  • It's unclear if the version-based numbers give really useful additional value, they also create a multitude of explosiveness numbers to store (2 per version series).
  • There might be an argument for only calculating the second (*_explosiveness_3) measure, as it's fine-grained enough to catch highly explosive crashes on the first day of explosion.
  • This algorithm is too complex to be calculated efficiently across a large volume of crashes on an hourly basis. It needs to be simplified.

Upsides of this proposal

  • Recognizes that dupes and ADU changes can make base values fluctuate and gets rid of those problems.
  • The clamping of "dist" doesn't just prohibit divisions by zero, but also deals potential skew due to tiny fluctuations in small numbers.
  • Having explosiveness numbers available to UI enables flexibility in marking, sorting and changing limits.

Examples

  • bug 554660 (see below) has an interesting example of numbers to look at for this: totals of 54, 72, 86, 83, 67, 46, 47, 123, 131 for 2010-03-08 through 2010-03-16. Here's a look at how this algorithm does, ignoring ADU, which are not given there, and therefore also the clamping:
    • On 2010-03-15, total_explosiveness_1 would have been 2.7, not yet triggering (?), and total_explosiveness_3 would have been slightly negative, also not triggering.
    • On 2010-03-16, total_explosiveness_1 would have been 1.2, not triggering, but total_explosiveness_3 would have been 2.2, triggering the warning.
    • On later days, should have triggered easily on both values, even with clamping of "dist".

A larger number of examples, including dist clamping (but no ADU) is available as an ODF spreadsheet (PDF version)

http://test.kairo.at/socorro/ has reports of explosiveness on total data for March 8.

User Comments

From Socorro:PRD_Interviews

damon:
 * (initial) growth of more than 25 positions in the ranking
 * upwards change in rank and no related bugzilla id
 * time since startup < 1 minute
 * highlight these crashes in red or something

From bug 525316

morgamic:
 My suggestion for a delta to watch is an increase in crash frequency of more
 than 50-75% and new crashes in the top 20 overall signatures by version.

Data From Previous Explosive Crash Bugs

Used the explosive bug query to find those, trying to pull info out on how those were explosive.

  • bug 503946
    • #16 tc (2-week) for 3.6 on 2010-01-26, #2 in 1-day, #3 in 3-day
    • crash numbers with that signature: 2010-01-24 145 (days before similar), 2010-01-25 1950, 2010-01-26 11731
    • percentage of total crashes on 2010-01-25 was 4 times as high on 3.6 as on 3.5
  • bug 528798
    • Rise from 7-18 null signature crashes with comments to >100 (with some days of 30s or 50s) within a week or less after the 3.5.5 release.
    • Increase in total null-signature crashes from <5000 to >6000 within 2-3 days
  • bug 530074
    • crashes with that signature jumped from 45-65 to 570 and higher within 3 days
  • bug 536974
    • jumped up 41 ranks in top crasher analysis tp #15 on 3.6b5 in 3 days
    • two signatures, both from <30 total crashes per day to >300 within a week
  • bug 538687
    • uptick from 45-60 (2009-12-16 to 2010-01-02, with some single-day spikes above) to multiple consecutive days with 80-100 (2010-01-03 to -07) total crashes per day
    • percentage of total crashes on 3.6 is about a factor 50-100 higher than on 3.5
    • up to 416 crashes on 2010-03-10, 600-900 on 2010-03-12 to -14, >1600 on 2010-03-15, >1000 until -19
    • #45 topcrash in 3.6b5 (2010-01-08), #8 in early 3.6.2 top crash data (2010-03-23)
  • bug 538998
    • From 0 to >100 in two days, from 94-115 to >3700 in one day
    • #10 tc in early 3.6rc1 reports
    • from 3-12 total crashes per hour to 100-600 with a sharp cutoff hour
  • bug 543646
    • from 0 to >1000 crashes in two days, staying there at least 3 days
  • bug 546632
    • new crash in top 30 tc on 3.6
    • from 0 to >100 in a day, from 260 to >1000 in 7 days
    • roughly factor 5 between 3.6 and other versions in percentage of total crashes
  • bug 547210
    • 0 to >3000 in a day, stayed >1000 for 3 days at least
  • bug 547622
    • Top 50 Crash for 3.6 (+149!) Firefox 3.6 Crash Report
    • started showing up around the first of November with 1-10 crashes per day, then 10-40 crashes per day in Dec, 100-150 in January, and last few days of February was running at 400-712 crashes per day
  • bug 553581
    • +224 positions up to Top 33 Crash for 3.6
  • bug 554660
    • from 45-90 to >100 in a day, >400 in 3 days and rising further (>600 in 6 days etc.)
  • bug 558955
    • 200-500 crashes per day in first half of April '10, up from 0-5 crashes per day in Nov '09 through early March '10
    • ~5 to >100 in 3 days, somewhat back down, then ~80 to >200 in 3 days, 150-180 to >500 in 3 days
  • bug 570722
    • from 7-18 to >600 in a day
  • bug 595957
    • from 30-90 to >3000 in 3 days