Confirmed users
3,376
edits
m (→Understand an alert: - new title, more links) |
m (→Finding the root cause: - some edits) |
||
| Line 19: | Line 19: | ||
Keep in mind that alerts mention improvements and regressions, which is valuable for us to track the entire system as a whole. For filing bugs, we focus mostly on the regressions. | Keep in mind that alerts mention improvements and regressions, which is valuable for us to track the entire system as a whole. For filing bugs, we focus mostly on the regressions. | ||
= | = Investigating the alert = | ||
This is a manual process that needs to be done for every alert. We need to: | |||
* Look at the graph and determine the original branch, date, revision where the alert occurred | |||
* Look at the graph for all platforms related to the given branch/test, to see the scope of the regression | |||
* Look at TreeHerder and determine if we have all the data. | |||
* Retrigger jobs if needed (more [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Noise_FAQ#What_is_Noise noise]], more retriggers) | |||
Luckily in Alert Manager, we automatically show multiple branches for the given test/platform. This helps you determine the root branch. It is best to [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/GraphServer_FAQ#Zooming zoom]] in and out to verify where the regression is. | |||
Once you have the spot, you can validate the other platforms by [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/GraphServer_FAQ#Adding_additional_Data_Points adding additional data sets]] to the graph. It is best here to zoom out a bit as the regression might be a few revisions off on different platforms due to [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Tree_FAQ#What_is_coalescing coalescing]]. | |||
Finally with the scope of the regression, we need to open the link from Alert Manager to TreeHerder. This is setup to have 5 revision before/after the revision in the alert and filtered on the test and platform. Here we are looking for a few things: | |||
* | |||
* | |||
* | |||
There are many reasons for an alert and different scenarios to be aware of: | There are many reasons for an alert and different scenarios to be aware of: | ||
* backout (usually within 1 week causing a similar regression/improvement) | * [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Tree_FAQ#What_is_a_backout backout]] (usually within 1 week causing a similar regression/improvement) | ||
* pgo/nonpgo (some errors are pgo only and might be a side effect of pgo). We only ship PGO, so these are the most important. | * [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Tree_FAQ#What_is_PGO pgo/nonpgo]] (some errors are pgo only and might be a side effect of pgo). We only ship PGO, so these are the most important. | ||
* test/infrastructure change - once in a while we change big things about our tests or infrastructure and it affects our tests | * test/infrastructure change - once in a while we change big things about our tests or infrastructure and it affects our tests | ||
* Coalesed - this is when we don't run every job on every platform on every push and sometimes we have a set of changes | * [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Tree_FAQ#What_is_a_merge Merged]] - sometimes the root cause looks to be a merge, this is a normall a side effect of [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Tree_FAQ#What_is_coalescing Coalescing]]. | ||
* [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Tree_FAQ#What_is_coalescing Coalesed]] - this is when we don't run every job on every platform on every push and sometimes we have a set of changes | |||
* Regular regression - the normal case where we get an alert and we see it merge from branch to branch | * Regular regression - the normal case where we get an alert and we see it merge from branch to branch | ||
= Verifying an alert = | = Verifying an alert = | ||