Buildbot/Talos/Sheriffing
Overview
The sheriff team does a great job of finding regressions in unittests and getting fixes for them or backing stuff out. This keeps our trees green and usable while thousands of checkins a month take place!
For talos, we run about 50 jobs per push (out of ~400) to measure the performance of desktop and android builds. These jobs are green and the sheriffs have little to do.
Enter the role of a Performance Sheriff. This role looks at the data produced by these test jobs and finds regressions, root causes and gets bugs on file to track all issues and make interested parties aware of what is going on.
Understand an alert
As of 2014, alerts come in from [graph server] to [dev.tree-management]. These are generated by programatically verifying there is a sustained regression over time (original data point + 12 future data points).
the alert will reference:
- platform
- test name
- % change / values
- malicious changeset [range] including commit summary
- link to graph server
Keep in mind that alerts mention improvements and regressions, which is valuable for us to track the entire system as a whole. For filing bugs, we focus mostly on the regressions.
Finding the root cause
This is where we need to see which branch the regression started on and to see if we ran tests for all the changesets surrounding the suspected push.
Verifying an alert
Once we have identified a suspected push, it is good manners to retrigger the job a few times on that push+surrounding pushes. Many tests have noise and we could have an oddball result which ends up misidentifying the changeset.
Likewise we should verify all the other platforms to see what the scope of this regression is.
Filing a bug
Summary of bug should follow this pattern: %xx <platform> <test> regression on <branch> (v.<xx>) Date, from push <revision>