Buildbot/Talos/Sheriffing: Difference between revisions
(→Finding the root cause: - more data) |
(→Backout: fixed links) |
||
| Line 31: | Line 31: | ||
Here is an example of a backout which affected many tests. | Here is an example of a backout which affected many tests. | ||
[[ | [[http://alertmanager.allizom.org:8080/alerts.html?rev=d4c4897f9ffb AlertManager]] [[http://alertmanager.allizom.org:8080/alerts.html?rev=5cc96a763c3f related coalesced]] | ||
This example is interesting because we see one change which was quickly identified as the correct change, but one job was coalesced. The coalescing is easy to detect because looking at the suspected [[http://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=f4eea7e2f94b&tochange=5cc96a763c3f changeset]] it is a range. That range includes our backed out changeset as well as the graph showing the backout pattern. Adding more to it, this is on Windows 8 which is the platform which showed a regression on the backout. We have high confidence to map this coalesced alert as being the root cause of the backout. | This example is interesting because we see one change which was quickly identified as the correct change, but one job was coalesced. The coalescing is easy to detect because looking at the suspected [[http://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=f4eea7e2f94b&tochange=5cc96a763c3f changeset]] it is a range. That range includes our backed out changeset as well as the graph showing the backout pattern. Adding more to it, this is on Windows 8 which is the platform which showed a regression on the backout. We have high confidence to map this coalesced alert as being the root cause of the backout. | ||
Revision as of 12:53, 18 November 2014
Overview
The sheriff team does a great job of finding regressions in unittests and getting fixes for them or backing stuff out. This keeps our trees green and usable while thousands of checkins a month take place!
For talos, we run about 50 jobs per push (out of ~400) to measure the performance of desktop and android builds. These jobs are green and the sheriffs have little to do.
Enter the role of a Performance Sheriff. This role looks at the data produced by these test jobs and finds regressions, root causes and gets bugs on file to track all issues and make interested parties aware of what is going on.
Understand an alert
As of 2014, alerts come in from [graph server] to [dev.tree-management]. These are generated by programatically verifying there is a sustained regression over time (original data point + 12 future data points).
the alert will reference:
- platform
- test name
- % change / values
- malicious changeset [range] including commit summary
- link to graph server
Keep in mind that alerts mention improvements and regressions, which is valuable for us to track the entire system as a whole. For filing bugs, we focus mostly on the regressions.
Finding the root cause
There are many reasons for an alert and different scenarios to be aware of:
- backout (usually within 1 week causing a similar regression/improvement)
- pgo/nonpgo (some errors are pgo only and might be a side effect of pgo). We only ship PGO, so these are the most important.
- test/infrastructure change - once in a while we change big things about our tests or infrastructure and it affects our tests
- Coalesed - this is when we don't run every job on every platform on every push and sometimes we have a set of changes
- Regular regression - the normal case where we get an alert and we see it merge from branch to branch
Backout
Backouts happen every day, but backouts that generate performance regressions are what add noise to the system.
Here is an example of a backout which affected many tests. [AlertManager] [related coalesced]
This example is interesting because we see one change which was quickly identified as the correct change, but one job was coalesced. The coalescing is easy to detect because looking at the suspected [changeset] it is a range. That range includes our backed out changeset as well as the graph showing the backout pattern. Adding more to it, this is on Windows 8 which is the platform which showed a regression on the backout. We have high confidence to map this coalesced alert as being the root cause of the backout.
Verifying an alert
Once we have identified a suspected push, it is good manners to retrigger the job a few times on that push+surrounding pushes. Many tests have noise and we could have an oddball result which ends up misidentifying the changeset.
Likewise we should verify all the other platforms to see what the scope of this regression is.
Filing a bug
Summary of bug should follow this pattern: %xx <platform> <test> regression on <branch> (v.<xx>) Date, from push <revision>