Buildbot/Talos/Sheriffing: Difference between revisions

m
m (→‎Understand an alert: - new title, more links)
Line 19: Line 19:
Keep in mind that alerts mention improvements and regressions, which is valuable for us to track the entire system as a whole.  For filing bugs, we focus mostly on the regressions.
Keep in mind that alerts mention improvements and regressions, which is valuable for us to track the entire system as a whole.  For filing bugs, we focus mostly on the regressions.


= Finding the root cause =
= Investigating the alert =
This is a manual process that needs to be done for every alert.  We need to:
* Look at the graph and determine the original branch, date, revision where the alert occurred
* Look at the graph for all platforms related to the given branch/test, to see the scope of the regression
* Look at TreeHerder and determine if we have all the data.
* Retrigger jobs if needed (more [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Noise_FAQ#What_is_Noise noise]], more retriggers)
 
Luckily in Alert Manager, we automatically show multiple branches for the given test/platform.  This helps you determine the root branch.  It is best to [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/GraphServer_FAQ#Zooming zoom]] in and out to verify where the regression is.
 
Once you have the spot, you can validate the other platforms by [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/GraphServer_FAQ#Adding_additional_Data_Points adding additional data sets]] to the graph.  It is best here to zoom out a bit as the regression might be a few revisions off on different platforms due to [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Tree_FAQ#What_is_coalescing coalescing]].
 
Finally with the scope of the regression, we need to open the link from Alert Manager to TreeHerder.  This is setup to have 5 revision before/after the revision in the alert and filtered on the test and platform.  Here we are looking for a few things:
*
*
*
 
There are many reasons for an alert and different scenarios to be aware of:
There are many reasons for an alert and different scenarios to be aware of:
* backout (usually within 1 week causing a similar regression/improvement)
* [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Tree_FAQ#What_is_a_backout backout]] (usually within 1 week causing a similar regression/improvement)
* pgo/nonpgo (some errors are pgo only and might be a side effect of pgo).  We only ship PGO, so these are the most important.
* [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Tree_FAQ#What_is_PGO pgo/nonpgo]] (some errors are pgo only and might be a side effect of pgo).  We only ship PGO, so these are the most important.
* test/infrastructure change - once in a while we change big things about our tests or infrastructure and it affects our tests
* test/infrastructure change - once in a while we change big things about our tests or infrastructure and it affects our tests
* Coalesed - this is when we don't run every job on every platform on every push and sometimes we have a set of changes
* [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Tree_FAQ#What_is_a_merge Merged]] - sometimes the root cause looks to be a merge, this is a normall a side effect of [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Tree_FAQ#What_is_coalescing Coalescing]].
* [[https://wiki.mozilla.org/Buildbot/Talos/Sheriffing/Tree_FAQ#What_is_coalescing Coalesed]] - this is when we don't run every job on every platform on every push and sometimes we have a set of changes
* Regular regression - the normal case where we get an alert and we see it merge from branch to branch
* Regular regression - the normal case where we get an alert and we see it merge from branch to branch
== Backout ==
Backouts happen every day, but backouts that generate performance regressions are what add noise to the system.
Here is an example of a backout which affected many tests.
[[http://alertmanager.allizom.org:8080/alerts.html?rev=d4c4897f9ffb AlertManager]] [[http://alertmanager.allizom.org:8080/alerts.html?rev=5cc96a763c3f related coalesced]]
This example is interesting because we see one change which was quickly identified as the correct change, but one job was coalesced.  The coalescing is easy to detect because looking at the suspected [[http://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=f4eea7e2f94b&tochange=5cc96a763c3f changeset]] it is a range.  That range includes our backed out changeset as well as the graph showing the backout pattern.  Adding more to it, this is on Windows 8 which is the platform which showed a regression on the backout.  We have high confidence to map this coalesced alert as being the root cause of the backout.


= Verifying an alert =
= Verifying an alert =
Confirmed users
3,376

edits