Buildbot/Talos/Sheriffing

From MozillaWiki
< Buildbot‎ | Talos
Revision as of 18:27, 7 November 2014 by Jmaher (talk | contribs) (Created page with "= Overview = The sheriff team does a great job of finding regressions in unittests and getting fixes for them or backing stuff out. This keeps our trees green and usable whil...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Overview

The sheriff team does a great job of finding regressions in unittests and getting fixes for them or backing stuff out. This keeps our trees green and usable while thousands of checkins a month take place!

For talos, we run about 50 jobs per push (out of ~400) to measure the performance of desktop and android builds. These jobs are green and the sheriffs have little to do.

Enter the role of a Performance Sheriff. This role looks at the data produced by these test jobs and finds regressions, root causes and gets bugs on file to track all issues and make interested parties aware of what is going on.

Understand an alert

As of 2014, alerts come in from [graph server] to [dev.tree-management]. These are generated by programatically verifying there is a sustained regression over time (original data point + 12 future data points).

the alert will reference:

  • platform
  • test name
  • % change / values
  • malicious changeset [range] including commit summary
  • link to graph server

Keep in mind that alerts mention improvements and regressions, which is valuable for us to track the entire system as a whole. For filing bugs, we focus mostly on the regressions.


Finding the root cause

This is where we need to see which branch the regression started on and to see if we ran tests for all the changesets surrounding the suspected push.

Verifying an alert

Once we have identified a suspected push, it is good manners to retrigger the job a few times on that push+surrounding pushes. Many tests have noise and we could have an oddball result which ends up misidentifying the changeset.

Likewise we should verify all the other platforms to see what the scope of this regression is.

Filing a bug

Summary of bug should follow this pattern: %xx <platform> <test> regression on <branch> (v.<xx>) Date, from push <revision>