Confirmed users
2,197
edits
m (Davehunt moved page Performance sheriffing to TestEngineering/Performance/Sheriffing) |
No edit summary |
||
Line 5: | Line 5: | ||
= What is an alert = | = What is an alert = | ||
As of January 2016, alerts are generated in [https://treeherder.mozilla.org/perf.html#/alerts?status=0&framework=1 Perfherder]. These are generated by programatically verifying there is a sustained regression over time ([ | As of January 2016, alerts are generated in [https://treeherder.mozilla.org/perf.html#/alerts?status=0&framework=1 Perfherder]. These are generated by programatically verifying there is a sustained regression over time ([[/Noise_FAQ#Why_do_we_need_12_future_data_points|original data point + 12 future data points]]). | ||
There is an alert summary outlining the alerts which match the same set of revisions. For the summary there are a few pieces of information: | There is an alert summary outlining the alerts which match the same set of revisions. For the summary there are a few pieces of information: | ||
* Title (which is a good bug title if filing one for a regression: | * Title (which is a good bug title if filing one for a regression: | ||
** [ | ** [[/Tree_FAQ#Branch_names_and_confusion|branch]] | ||
** % regressed, this is a range of the regressions (not improvements) | ** % regressed, this is a range of the regressions (not improvements) | ||
** the [ | ** the [[TestEngineering/Performance/Talos/Tests|tests]] which have regressed | ||
** the platforms we see this regression on | ** the platforms we see this regression on | ||
* date of the suspect revision push | * date of the suspect revision push | ||
Line 19: | Line 19: | ||
Below the summary will be a list of alerts, each alert will reference: | Below the summary will be a list of alerts, each alert will reference: | ||
* [ | * [[TestEngineering/Performance/Talos/Tests|Test name]] | ||
* platform (including build type, such as opt, pgo) | * platform (including build type, such as opt, pgo) | ||
* old score (median score of the previous 12 commits) | * old score (median score of the previous 12 commits) | ||
* new score (median score of the future 12 commits) | * new score (median score of the future 12 commits) | ||
* [ | * [[/Alert_FAQ#Why_does_Alert_Manager_print_-xx.25|% change / values]] | ||
* bar chart to show severity, green = improvement, red = regression | * bar chart to show severity, green = improvement, red = regression | ||
* Confidence value (from the t-test code) | * Confidence value (from the t-test code) | ||
Line 33: | Line 33: | ||
* Look at the graph and determine the original branch, date, revision where the alert occurred | * Look at the graph and determine the original branch, date, revision where the alert occurred | ||
* Look at Treeherder and determine if we have all the data. | * Look at Treeherder and determine if we have all the data. | ||
* Retrigger jobs if needed (more [ | * Retrigger jobs if needed (more [[/Noise_FAQ#What_is_Noise|noise]], more retriggers) | ||
* Once you have more data, look at the data in [https://treeherder.mozilla.org/perf.html#/comparechooser compare view] to see if other tests/platforms have changed | * Once you have more data, look at the data in [https://treeherder.mozilla.org/perf.html#/comparechooser compare view] to see if other tests/platforms have changed | ||
* Add all related alerts you see to the summary with the reassign button | * Add all related alerts you see to the summary with the reassign button | ||
== Determining the root cause from Perfherder == | == Determining the root cause from Perfherder == | ||
When viewing a single alert and clicking on the graph link, Perfherder automatically show multiple branches for the given test/platform. This helps you determine the root branch. It is best to [ | When viewing a single alert and clicking on the graph link, Perfherder automatically show multiple branches for the given test/platform. This helps you determine the root branch. It is best to [[/Perfherder_FAQ#Zooming|zoom]] in and out to verify where the regression is. | ||
While this isn't always clear, most of the time it is easy to see another alert on a different branch and mark the current one as a downstream if needed. | While this isn't always clear, most of the time it is easy to see another alert on a different branch and mark the current one as a downstream if needed. | ||
Line 72: | Line 72: | ||
== Determining the scope of the regression from Perfherder == | == Determining the scope of the regression from Perfherder == | ||
Once you have the spot, you can validate the other platforms by [ | Once you have the spot, you can validate the other platforms by [[/Perfherder_FAQ#Adding_additional_data_points|adding additional data sets]] to the graph. It is best here to zoom out a bit as the regression might be a few revisions off on different platforms due to [[/Tree_FAQ#What_is_coalescing|coalescing]]. | ||
== Cases to watch out for == | == Cases to watch out for == | ||
There are many reasons for an alert and different scenarios to be aware of: | There are many reasons for an alert and different scenarios to be aware of: | ||
* [ | * [[/Tree_FAQ#What_is_a_backout|backout]] (usually within 1 week causing a similar regression/improvement) | ||
* [ | * [[/Tree_FAQ#What_is_PGO|pgo/nonpgo]] (some errors are pgo only and might be a side effect of pgo). We only ship PGO, so these are the most important. | ||
* test/infrastructure change - once in a while we change big things about our tests or infrastructure and it affects our tests (we need bugs to document these those) | * test/infrastructure change - once in a while we change big things about our tests or infrastructure and it affects our tests (we need bugs to document these those) | ||
* [ | * [[/Tree_FAQ#What_is_a_merge|Merged]] - sometimes the root cause looks to be a merge, this is a normall a side effect of [[/Tree_FAQ#What_is_coalescing|Coalescing]]. | ||
* [ | * [[/Tree_FAQ#What_is_coalescing|Coalesed]] - this is when we don't run every job on every platform on every push and sometimes we have a set of changes | ||
* Regular regression - the normal case where we get an alert and we see it merge from branch to branch | * Regular regression - the normal case where we get an alert and we see it merge from branch to branch | ||
Line 86: | Line 86: | ||
Every release of Firefox we create a tracking bug (i.e. {{bug|1386631}} - Firefox 57) which we use to associate all regressions found during that release. The reason for this is 2 fold: | Every release of Firefox we create a tracking bug (i.e. {{bug|1386631}} - Firefox 57) which we use to associate all regressions found during that release. The reason for this is 2 fold: | ||
* We can go to one spot and see what regressions we have for reference on new bugs or to follow up. | * We can go to one spot and see what regressions we have for reference on new bugs or to follow up. | ||
* When we [ | * When we [[/Tree_FAQ#What_is_an_uplift|uplift]] it is important to see which alerts we are expecting | ||
These bugs just contain a set of links to other bugs, no conversation is needed. | These bugs just contain a set of links to other bugs, no conversation is needed. | ||
Line 97: | Line 97: | ||
Here are some things to check/verify when filing a bug: | Here are some things to check/verify when filing a bug: | ||
* Product/Component - this should be the same as the bug which is the root cause, if >1 bug, file in [https://bugzilla.mozilla.org/enter_bug.cgi?product=Testing&component=Talos Talos] | * Product/Component - this should be the same as the bug which is the root cause, if >1 bug, file in [https://bugzilla.mozilla.org/enter_bug.cgi?product=Testing&component=Talos Talos] | ||
* Dependent/Block bugs - For a new bug, add the [ | * Dependent/Block bugs - For a new bug, add the [[#Tracking_bugs|tracking bug]] (for the current version) and root cause bug(s) as blocking this bug | ||
* CC list - cc patch author(s), reviewer(s) and owner of the tests as documented on the [ | * CC list - cc patch author(s), reviewer(s) and owner of the tests as documented on the [[TestEngineering/Performance/Talos/Tests|Talos tests wiki]]; if we have >1 bug, we should cc everyone who worked on those bugs so we call pitch in an answer questions faster | ||
* Summary of bug should have a check to make sure the revision is accurate | * Summary of bug should have a check to make sure the revision is accurate | ||
* The description is auto suggested as well, please verify the revision here | * The description is auto suggested as well, please verify the revision here | ||
As a note, the generated description refers the patch author to [ | As a note, the generated description refers the patch author to [[TestEngineering/Performance/Talos/RegressionBugsHandling|guidelines and expectations]] for them about how and when to respond. | ||
Once a bug is filed it is a good idea to do a few things in another comment: | Once a bug is filed it is a good idea to do a few things in another comment: | ||
* provide a link to compare view to show you have done retriggers and believe this is valid | * provide a link to compare view to show you have done retriggers and believe this is valid | ||
* needinfo the patch author (if many patch authors, needinfo one of : | * needinfo the patch author (if many patch authors, needinfo one of :davehunt, :igoldan or :rwood) | ||
* mention how confident you are in the regression (more confidence if you have a lot of retriggers and there is only one patch, less confident if you are waiting on backfilling data, retriggers, try runs, etc.) | * mention how confident you are in the regression (more confidence if you have a lot of retriggers and there is only one patch, less confident if you are waiting on backfilling data, retriggers, try runs, etc.) | ||
Line 113: | Line 113: | ||
== Merge Day - Uplifts == | == Merge Day - Uplifts == | ||
Every 6 weeks we do an [ | Every 6 weeks we do an [[/Tree_FAQ#What_is_an_uplift|uplift]]. These typically result in [https://elvis314.wordpress.com/2014/12/12/tracking-firefox-performance-as-we-uplift-the-volume-of-alerts-we-get/ dozens of alerts] for each uplift. | ||
The job here is to triage alerts as we usually do, except in this case we have a much larger volume of alerts. One thing here is we have alerts from the upstream branch. Take for example when we uplift Mozilla-Central to Mozilla-Beta. We have a tracking bug for each release, and there is a list of bugs (keep in mind some are resolved as wontfix). In a perfect world (half the time) we can match up the alerts that are showing up on Mozilla-Beta with the bugs that have already been filed. The job here is to verify and add bugs to keep track of what is there. | The job here is to triage alerts as we usually do, except in this case we have a much larger volume of alerts. One thing here is we have alerts from the upstream branch. Take for example when we uplift Mozilla-Central to Mozilla-Beta. We have a tracking bug for each release, and there is a list of bugs (keep in mind some are resolved as wontfix). In a perfect world (half the time) we can match up the alerts that are showing up on Mozilla-Beta with the bugs that have already been filed. The job here is to verify and add bugs to keep track of what is there. | ||
Line 125: | Line 125: | ||
= Additional Resources = | = Additional Resources = | ||
* [ | * [[/Alert_FAQ|Alert FAQ]] | ||
* [ | * [[/Noise_FAQ|Noise FAQ]] | ||
* [ | * [[/Perfherder_FAQ|Perfherder FAQ]] | ||
* [ | * [[/Tree_FAQ|Tree FAQ]] | ||