NightlyCrashTriage

From MozillaWiki
Jump to: navigation, search

We aim to analyze the crashes for every Nightly build.

Roster

Nightly builds are produced at 3am each day (California time).

  • Monday (Netherlands time): gsvelto analyzes Friday's build.
  • Monday (US East time): marcia analyzes Saturday's build.
  • Tuesday (US Pacific time): lizzard analyzes Sunday's build.
  • Wednesday (US Pacific time): TBD analyzes Monday's build.
  • Wednesday (US East time): TBD analyzes Tuesday's build.
  • Thursday (US West time): mccr8 analyzes Wednesday's build.
  • Friday (German time): jseward analyzes Thursday's build.

A live calendar is also available for Mozilla employees. Please use it to schedule deviations from the usual roster, e.g. for PTO.

Notes

Triage notes are kept in the following pages.

Use the date you are doing the triage, rather than the date of the build, to decide which page to put your notes in. The reason for this is that the triage date has a heading, which makes it more prominent in the notes than the build date.

Data sources, tools, documentation, communication

Crucial links

Other links

Communication

Documentation

Triage HOWTO

This is a rough guide on how to analyze the crash reports for a particular day's Nightly build.

Crash Report Basics

The first thing you should do is watch David Baron's talk about crash reports. Watch it all the way through. It's full of useful information.

Also, see the documentation about reading individual crash reports on MDN.

Finally, you should install the Crash-stats: State of the bug extension. It adds extra information to bug links in crash-stats, which is extremely useful.]

Crash Report Inspection

The start point is this page. You should look at the Nightly crashes for all four platforms (Windows, Mac, Linux, Android) for the the build you are analyzing.

We keep notes at (see above). Please add your notes to the top of the appropriate page. You don't need to write a lot. The logs serves two main purposes:

  • To show which builds have been analyzed (and any that might have been missed).
  • To communicate anything that might be useful to the people coming up in the roster (e.g. if crash reports are currently broken on one platform, or something like that).

Nightly builds are created at 3am California time. It's generally best to wait at least 24 hours before doing a full analysis a build, because that gives enough time for a decent number of crashes to come in. Having said that, it can be useful to check in earlier than that to see if there are any explosive new crashes that might require action (see below).

Windows crash numbers are much higher than the other and you will likely spend most of your time looking at them. Typical numbers for the #1 crash signature on Windows will be anywhere from 10--50 crash reports, though there is plenty of variation from day to day.

The basic goal of triage is to get a sense of the current crash situation and take steps to improve it.

  • Look closely at most/all of the top crash signatures (e.g. top 10--15 on Windows, top 5 or so on other platforms). (There are certain ones that are hard to act on, e.g. "ShutdownKill", "OOM | Small", Flash crashes.)
  • When appropriate, file bugs for signatures that lack them (more details about this are below). If a signature only has an old bug, filing a new one might be appropriate.
  • Check bugs for signatures that have them to see if they need additional information.
  • If a bug has been closed but the crash is still happening, the bug might need to be reopened.
  • If a bug is open but stalled, adding a NEEDINFO request might help, and adding information about crash frequency can be helpful. (Polite nagging can be effective!)
  • Look for distinct crash signatures that might be related, e.g. multiple signatures relating to a11y, or to the JavaScript JITs.
  • The lower-ranked signatures are less important, though filing bugs for signatures that lack them can still be useful.

One caveat is that sometimes multiple crashes (possibly even 10s or 100s) with a particular signature all come from a single installation. Those ones are usually best ignored, because it's difficult to tell if it's a real problem or a problem with the user's machine or installation, and there are enough other crashes to deal with. Likewise, if a crash appears on one Nightly build but none before or after, it's probably not worth filing.

To determine how many installations a particular crash signature has affected:

  • Look at the "Product*" box, which has an "installations" column. Note that this box gives results for the past 7 days, unlike the other boxes.
  • Look at the "install time" fields in a signature report. If the install times are all identical it's almost certainly a single installation.

JS engine crashes can be difficult to deal with, especially those involving the JITs and the GC, because crashes with different causes can get lumped into the same signature.

Filing Bug Reports

To file a bug report about a particular crash signature, view one crash report and use the relevant "Report this bug in" link. This pre-populates the bug report with the "crash" keyword and a link to the particular crash report.

When you file a bug report, things to mention include the following.

  • The signature and the crash reason (or pattern of crash reasons)
  • The Nightly build id, which looks like "20160506052823".
  • How many crashes have occurred with this signature.
  • Anything unusual (crash only on specific platforms, only in certain locales or on certain sites, extension or dll correlations of interest, whether it's a startup crash, etc.) The correlations tab in crash-stats is great for this.
  • The rank (e.g. "this is the #1 top crash for Nightly 20160506052823").
  • How many installations are involved.
  • If the signature is new, and if so, which Nightly build it started in. One way to determine this is to modify the search parameters (by modifying the URL, or by modifying the search in the crash-stats UI) to broaden the search, e.g. by removing the particular Nightly build identifier and changing the date parameters.
  • If possible, a regression window for that Nightly. You can get regression Windows by using the "Choose regression window" button on the start page.
  • If possible, an indication of which bug's patches may have caused the crash.

Don't worry too much if you can't get all of these things. Getting the right person to look at a crash report is the most important thing. Therefore, you should always NEEDINFO someone in a bug report. If you look at the top few frames of one or more reports, try to see if there are any lines that changed recently. If so, that's a good clue as to what change caused the crash and you can NEEDINFO the author. Failing that, NEEDINFO someone appropriate based on your knowledge of the codebase, or by looking at who has modified nearby lines, or by consulting the module owners or triage leads lists.

Sometimes a single crash cause will result in multiple signatures. It's useful to link all the relevant signatures to the same bug. bug 1229252 is a good example, where the "crash signature" field grew steadily over a period of weeks.

A crash that occurs more than 10 times a day, across multiple installations, is probably impactful enough to be worth tracking. Set the appropriate tracking flag to '?' in the bug report, so that release drivers will be aware of it.

When you file a new bug for a crash signature, that bug won't show up in the "Bugs" columns of crash-stats for a few hours. Which is a shame, because it would be nice to double-check the linking immediately, but it's understandable that crash-stats doesn't ping Bugzilla all the time.

Sharing crash dumps

Occasionally, we run into a difficult to analyze crash where we need help from engineers at an external company. In those cases we have two options for getting help: Sharing a crash dump with user permission from a verified owner of that crash data, or debugging with a privacy agreement in tandem with a Mozilla employee. This is described in more detail in a bug and on the Data Collection page:

Miscellaneous

At the end of each development cycle, it's worth paying attention to which crashes affect both Nightly and Beta, and therefore have fixes that need uplift to Beta.

What if a Nightly Build is Super Crashy?

Occasionally a Nightly build is super crashy. When this happens, there are a few things to be done.

  • First, tell an appropriate person (ping in #releng or maybe #moc) so that updates can be stopped as soon as possible.
  • Then, get a tracking bug (Release Engineering :: General) filed, either by you or the person stopping the updates.
  • Then, if the regressing patch can be identified, get it backed out from mozilla-central.
  • Then, consider asking an appropriate person (via #sheriffs, #taskcluster, #releng, or the sheriffs mailing list) to "respin" Nightly, i.e. rebuild and re-release it. This takes a couple of hours, so is only worthwhile if the next day's Nightly build is still some time off.
  • Finally, once the issue has been resolved and new Nightly builds are ready to go out to users, contact the person who froze updates in step 1 so that updates can be unfrozen.