Buildbot/Talos/Sheriffing: Difference between revisions
m (→Verifying an alert: - removed) |
m (→Investigating the alert: - filled in details) |
||
| Line 26: | Line 26: | ||
* Retrigger jobs if needed (more [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Noise_FAQ#What_is_Noise noise]], more retriggers) | * Retrigger jobs if needed (more [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Noise_FAQ#What_is_Noise noise]], more retriggers) | ||
== Determining the root cause from the graph server == | |||
Luckily in Alert Manager, we automatically show multiple branches for the given test/platform. This helps you determine the root branch. It is best to [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/GraphServer_FAQ#Zooming zoom]] in and out to verify where the regression is. | Luckily in Alert Manager, we automatically show multiple branches for the given test/platform. This helps you determine the root branch. It is best to [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/GraphServer_FAQ#Zooming zoom]] in and out to verify where the regression is. | ||
== Determining the scope of the regression from graph server == | |||
Once you have the spot, you can validate the other platforms by [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/GraphServer_FAQ#Adding_additional_Data_Points adding additional data sets]] to the graph. It is best here to zoom out a bit as the regression might be a few revisions off on different platforms due to [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Tree_FAQ#What_is_coalescing coalescing]]. | Once you have the spot, you can validate the other platforms by [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/GraphServer_FAQ#Adding_additional_Data_Points adding additional data sets]] to the graph. It is best here to zoom out a bit as the regression might be a few revisions off on different platforms due to [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Tree_FAQ#What_is_coalescing coalescing]]. | ||
== Determining if we have all the data from tree herder == | |||
Finally with the scope of the regression, we need to open the link from Alert Manager to TreeHerder. This is setup to have 5 revision before/after the revision in the alert and filtered on the test and platform. Here we are looking for a few things: | Finally with the scope of the regression, we need to open the link from Alert Manager to TreeHerder. This is setup to have 5 revision before/after the revision in the alert and filtered on the test and platform. Here we are looking for a few things: | ||
* | * Do we have data for the revision before / after the revision we have identified as regressing? If not, we should consider filling in the missing data. | ||
* | * Is our revision or the revision before / after a merge? If so, we should retrigger to ensure that we are not investigating a merged changeset, if we are on a merged changeset, we need to go to the original branch and bisect. | ||
* | * Does it look like most of the other platforms/talos tests have completed in this range? If not, then we could have other alerts for tests/platforms arriving in the future. | ||
== Retriggering jobs == | |||
In the case where we have: | |||
* missing data | |||
* an alert on a merge | |||
* an alert on a pgo build (with no alert on a non-pgo) | |||
* an alert where the range of the regression overlaps with the regular range (a small alert (<5%) or a noisy test) | |||
We need to do some retriggers. I usually find it useful to retrigger 3 times on 5 regressions: | |||
* target revision-2 | |||
* target revision-1 | |||
* target revision | |||
* target revision+1 | |||
* target revision+2 | |||
In the case where there is missing data, target revision becomes a range of: [target revision, revisions with missing data] | |||
This is important because we then have enough evidence to show that the regression is sustained through retriggers and over time. If there is suspect of alerts on other tests/platforms, please retriggers as well. | |||
== Cases to watch out for == | |||
There are many reasons for an alert and different scenarios to be aware of: | There are many reasons for an alert and different scenarios to be aware of: | ||
* [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Tree_FAQ#What_is_a_backout backout]] (usually within 1 week causing a similar regression/improvement) | * [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Tree_FAQ#What_is_a_backout backout]] (usually within 1 week causing a similar regression/improvement) | ||
Revision as of 18:09, 3 February 2015
Overview
The sheriff team does a great job of finding regressions in unittests and getting fixes for them or backing stuff out. This keeps our trees green and usable while thousands of checkins a month take place!
For talos, we run about 50 jobs per push (out of ~400) to measure the performance of desktop and android builds. These jobs are green and the sheriffs have little to do.
Enter the role of a Performance Sheriff. This role looks at the data produced by these test jobs and finds regressions, root causes and gets bugs on file to track all issues and make interested parties aware of what is going on.
What is an alert
As of January 2015, alerts come in from [graph server] to [dev.tree-alerts]. These are generated by programatically verifying there is a sustained regression over time (original data point + 12 future data points]).
the alert will reference:
- [branch]
- platform
- test name
- [% change / values]
- malicious changeset [range] including commit summary
- link to [graph server]
Keep in mind that alerts mention improvements and regressions, which is valuable for us to track the entire system as a whole. For filing bugs, we focus mostly on the regressions.
Investigating the alert
This is a manual process that needs to be done for every alert. We need to:
- Look at the graph and determine the original branch, date, revision where the alert occurred
- Look at the graph for all platforms related to the given branch/test, to see the scope of the regression
- Look at TreeHerder and determine if we have all the data.
- Retrigger jobs if needed (more [noise], more retriggers)
Determining the root cause from the graph server
Luckily in Alert Manager, we automatically show multiple branches for the given test/platform. This helps you determine the root branch. It is best to [zoom] in and out to verify where the regression is.
Determining the scope of the regression from graph server
Once you have the spot, you can validate the other platforms by [adding additional data sets] to the graph. It is best here to zoom out a bit as the regression might be a few revisions off on different platforms due to [coalescing].
Determining if we have all the data from tree herder
Finally with the scope of the regression, we need to open the link from Alert Manager to TreeHerder. This is setup to have 5 revision before/after the revision in the alert and filtered on the test and platform. Here we are looking for a few things:
- Do we have data for the revision before / after the revision we have identified as regressing? If not, we should consider filling in the missing data.
- Is our revision or the revision before / after a merge? If so, we should retrigger to ensure that we are not investigating a merged changeset, if we are on a merged changeset, we need to go to the original branch and bisect.
- Does it look like most of the other platforms/talos tests have completed in this range? If not, then we could have other alerts for tests/platforms arriving in the future.
Retriggering jobs
In the case where we have:
- missing data
- an alert on a merge
- an alert on a pgo build (with no alert on a non-pgo)
- an alert where the range of the regression overlaps with the regular range (a small alert (<5%) or a noisy test)
We need to do some retriggers. I usually find it useful to retrigger 3 times on 5 regressions:
- target revision-2
- target revision-1
- target revision
- target revision+1
- target revision+2
In the case where there is missing data, target revision becomes a range of: [target revision, revisions with missing data]
This is important because we then have enough evidence to show that the regression is sustained through retriggers and over time. If there is suspect of alerts on other tests/platforms, please retriggers as well.
Cases to watch out for
There are many reasons for an alert and different scenarios to be aware of:
- [backout] (usually within 1 week causing a similar regression/improvement)
- [pgo/nonpgo] (some errors are pgo only and might be a side effect of pgo). We only ship PGO, so these are the most important.
- test/infrastructure change - once in a while we change big things about our tests or infrastructure and it affects our tests
- [Merged] - sometimes the root cause looks to be a merge, this is a normall a side effect of [Coalescing].
- [Coalesed] - this is when we don't run every job on every platform on every push and sometimes we have a set of changes
- Regular regression - the normal case where we get an alert and we see it merge from branch to branch
Filing a bug
A lot of work is being done inside of Alert Manager to make filing a bug easier, As each bug has unique attributes it is hard to handle this in a programmatic way, but we can do our best. In fact, there is a 'File bug' clickable link which is underneath each Revision in Alert Manager. Clicking it will bring up a popup with a suggested summary and description for the bug.
Here are some guidelines for filing a bug:
- Product/Component - this should be the same as the bug which is the root cause, if >1 bug, file in "[Testing :: Talos]"
- Dependent/Block bugs - For a new bug, add the tracking bug and root cause bug(s) as blocking this bug
- CC list - cc :jmaher, :avih, patch author(s) and reviewer(s), and owner of the tests as documented on the [talos tests wiki]
- Summary of bug should follow this pattern (should be suggested correctly):
%xx <platform> <test> regression on <branch> (v.<xx>) Date, from push <revision>
- The description is auto suggested as well, this should be good, but do make sure it makes sense
Additional Resources
- [Alert FAQ]
- [Noise FAQ]
- [GraphServer FAQ]
- [Tree FAQ]