Confirmed users
3,376
edits
m (→Verifying an alert: - removed) |
m (→Investigating the alert: - filled in details) |
||
| Line 26: | Line 26: | ||
* Retrigger jobs if needed (more [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Noise_FAQ#What_is_Noise noise]], more retriggers) | * Retrigger jobs if needed (more [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Noise_FAQ#What_is_Noise noise]], more retriggers) | ||
== Determining the root cause from the graph server == | |||
Luckily in Alert Manager, we automatically show multiple branches for the given test/platform. This helps you determine the root branch. It is best to [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/GraphServer_FAQ#Zooming zoom]] in and out to verify where the regression is. | Luckily in Alert Manager, we automatically show multiple branches for the given test/platform. This helps you determine the root branch. It is best to [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/GraphServer_FAQ#Zooming zoom]] in and out to verify where the regression is. | ||
== Determining the scope of the regression from graph server == | |||
Once you have the spot, you can validate the other platforms by [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/GraphServer_FAQ#Adding_additional_Data_Points adding additional data sets]] to the graph. It is best here to zoom out a bit as the regression might be a few revisions off on different platforms due to [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Tree_FAQ#What_is_coalescing coalescing]]. | Once you have the spot, you can validate the other platforms by [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/GraphServer_FAQ#Adding_additional_Data_Points adding additional data sets]] to the graph. It is best here to zoom out a bit as the regression might be a few revisions off on different platforms due to [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Tree_FAQ#What_is_coalescing coalescing]]. | ||
== Determining if we have all the data from tree herder == | |||
Finally with the scope of the regression, we need to open the link from Alert Manager to TreeHerder. This is setup to have 5 revision before/after the revision in the alert and filtered on the test and platform. Here we are looking for a few things: | Finally with the scope of the regression, we need to open the link from Alert Manager to TreeHerder. This is setup to have 5 revision before/after the revision in the alert and filtered on the test and platform. Here we are looking for a few things: | ||
* | * Do we have data for the revision before / after the revision we have identified as regressing? If not, we should consider filling in the missing data. | ||
* | * Is our revision or the revision before / after a merge? If so, we should retrigger to ensure that we are not investigating a merged changeset, if we are on a merged changeset, we need to go to the original branch and bisect. | ||
* | * Does it look like most of the other platforms/talos tests have completed in this range? If not, then we could have other alerts for tests/platforms arriving in the future. | ||
== Retriggering jobs == | |||
In the case where we have: | |||
* missing data | |||
* an alert on a merge | |||
* an alert on a pgo build (with no alert on a non-pgo) | |||
* an alert where the range of the regression overlaps with the regular range (a small alert (<5%) or a noisy test) | |||
We need to do some retriggers. I usually find it useful to retrigger 3 times on 5 regressions: | |||
* target revision-2 | |||
* target revision-1 | |||
* target revision | |||
* target revision+1 | |||
* target revision+2 | |||
In the case where there is missing data, target revision becomes a range of: [target revision, revisions with missing data] | |||
This is important because we then have enough evidence to show that the regression is sustained through retriggers and over time. If there is suspect of alerts on other tests/platforms, please retriggers as well. | |||
== Cases to watch out for == | |||
There are many reasons for an alert and different scenarios to be aware of: | There are many reasons for an alert and different scenarios to be aware of: | ||
* [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Tree_FAQ#What_is_a_backout backout]] (usually within 1 week causing a similar regression/improvement) | * [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Tree_FAQ#What_is_a_backout backout]] (usually within 1 week causing a similar regression/improvement) | ||