Buildbot/Talos/Sheriffing: Difference between revisions

m
→‎Investigating the alert: - filled in details
m (→‎Investigating the alert: - filled in details)
Line 26: Line 26:
* Retrigger jobs if needed (more [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Noise_FAQ#What_is_Noise noise]], more retriggers)
* Retrigger jobs if needed (more [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Noise_FAQ#What_is_Noise noise]], more retriggers)


== Determining the root cause from the graph server ==
Luckily in Alert Manager, we automatically show multiple branches for the given test/platform.  This helps you determine the root branch.  It is best to [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/GraphServer_FAQ#Zooming zoom]] in and out to verify where the regression is.
Luckily in Alert Manager, we automatically show multiple branches for the given test/platform.  This helps you determine the root branch.  It is best to [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/GraphServer_FAQ#Zooming zoom]] in and out to verify where the regression is.


== Determining the scope of the regression from graph server ==
Once you have the spot, you can validate the other platforms by [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/GraphServer_FAQ#Adding_additional_Data_Points adding additional data sets]] to the graph.  It is best here to zoom out a bit as the regression might be a few revisions off on different platforms due to [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Tree_FAQ#What_is_coalescing coalescing]].
Once you have the spot, you can validate the other platforms by [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/GraphServer_FAQ#Adding_additional_Data_Points adding additional data sets]] to the graph.  It is best here to zoom out a bit as the regression might be a few revisions off on different platforms due to [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Tree_FAQ#What_is_coalescing coalescing]].


== Determining if we have all the data from tree herder ==
Finally with the scope of the regression, we need to open the link from Alert Manager to TreeHerder.  This is setup to have 5 revision before/after the revision in the alert and filtered on the test and platform.  Here we are looking for a few things:
Finally with the scope of the regression, we need to open the link from Alert Manager to TreeHerder.  This is setup to have 5 revision before/after the revision in the alert and filtered on the test and platform.  Here we are looking for a few things:
*  
* Do we have data for the revision before / after the revision we have identified as regressing?  If not, we should consider filling in the missing data.
*  
* Is our revision or the revision before / after a merge?  If so, we should retrigger to ensure that we are not investigating a merged changeset, if we are on a merged changeset, we need to go to the original branch and bisect.
*  
* Does it look like most of the other platforms/talos tests have completed in this range?  If not, then we could have other alerts for tests/platforms arriving in the future.


== Retriggering jobs ==
In the case where we have:
* missing data
* an alert on a merge
* an alert on a pgo build (with no alert on a non-pgo)
* an alert where the range of the regression overlaps with the regular range (a small alert (<5%) or a noisy test)
We need to do some retriggers.  I usually find it useful to retrigger 3 times on 5 regressions:
* target revision-2
* target revision-1
* target revision
* target revision+1
* target revision+2
In the case where there is missing data, target revision becomes a range of: [target revision, revisions with missing data]
This is important because we then have enough evidence to show that the regression is sustained through retriggers and over time.  If there is suspect of alerts on other tests/platforms, please retriggers as well.
== Cases to watch out for ==
There are many reasons for an alert and different scenarios to be aware of:
There are many reasons for an alert and different scenarios to be aware of:
* [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Tree_FAQ#What_is_a_backout backout]] (usually within 1 week causing a similar regression/improvement)
* [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Tree_FAQ#What_is_a_backout backout]] (usually within 1 week causing a similar regression/improvement)
Confirmed users
3,376

edits