TestEngineering/Performance/Sheriffing: Difference between revisions

Jump to navigation Jump to search
no edit summary
No edit summary
Line 5: Line 5:


= What is an alert =
= What is an alert =
As of January 2016, alerts are generated in [https://treeherder.mozilla.org/perf.html#/alerts?status=0&framework=1 Perfherder].  These are generated by programatically verifying there is a sustained regression over time ([https://wiki.mozilla.org/Performance_sheriffing/Talos/Sheriffing/Noise_FAQ#Why_do_we_need_12_future_data_pointsoriginal data point + 12 future data points]).
As of January 2016, alerts are generated in [https://treeherder.mozilla.org/perf.html#/alerts?status=0&framework=1 Perfherder].  These are generated by programatically verifying there is a sustained regression over time ([[/Noise_FAQ#Why_do_we_need_12_future_data_points|original data point + 12 future data points]]).


There is an alert summary outlining the alerts which match the same set of revisions.  For the summary there are a few pieces of information:
There is an alert summary outlining the alerts which match the same set of revisions.  For the summary there are a few pieces of information:
* Title (which is a good bug title if filing one for a regression:
* Title (which is a good bug title if filing one for a regression:
** [https://wiki.mozilla.org/Performance_sheriffing/Talos/Sheriffing/Tree_FAQ#Branch_names_and_confusion branch]
** [[/Tree_FAQ#Branch_names_and_confusion|branch]]
** % regressed, this is a range of the regressions (not improvements)
** % regressed, this is a range of the regressions (not improvements)
** the [https://wiki.mozilla.org/Performance_sheriffing/Talos/Tests tests] which have regressed
** the [[TestEngineering/Performance/Talos/Tests|tests]] which have regressed
** the platforms we see this regression on
** the platforms we see this regression on
* date of the suspect revision push
* date of the suspect revision push
Line 19: Line 19:


Below the summary will be a list of alerts, each alert will reference:
Below the summary will be a list of alerts, each alert will reference:
* [https://wiki.mozilla.org/Performance_sheriffing/Talos/Tests Test name]
* [[TestEngineering/Performance/Talos/Tests|Test name]]
* platform (including build type, such as opt, pgo)
* platform (including build type, such as opt, pgo)
* old score (median score of the previous 12 commits)
* old score (median score of the previous 12 commits)
* new score (median score of the future 12 commits)
* new score (median score of the future 12 commits)
* [https://wiki.mozilla.org/Performance_sheriffing/Talos/Sheriffing/Alert_FAQ#Why_does_Alert_Manager_print_-xx.25 % change / values]
* [[/Alert_FAQ#Why_does_Alert_Manager_print_-xx.25|% change / values]]
* bar chart to show severity, green = improvement, red = regression
* bar chart to show severity, green = improvement, red = regression
* Confidence value (from the t-test code)
* Confidence value (from the t-test code)
Line 33: Line 33:
* Look at the graph and determine the original branch, date, revision where the alert occurred
* Look at the graph and determine the original branch, date, revision where the alert occurred
* Look at Treeherder and determine if we have all the data.
* Look at Treeherder and determine if we have all the data.
* Retrigger jobs if needed (more [https://wiki.mozilla.org/Performance_sheriffing/Talos/Sheriffing/Noise_FAQ#What_is_Noise noise], more retriggers)
* Retrigger jobs if needed (more [[/Noise_FAQ#What_is_Noise|noise]], more retriggers)
* Once you have more data, look at the data in [https://treeherder.mozilla.org/perf.html#/comparechooser compare view] to see if other tests/platforms have changed
* Once you have more data, look at the data in [https://treeherder.mozilla.org/perf.html#/comparechooser compare view] to see if other tests/platforms have changed
* Add all related alerts you see to the summary with the reassign button
* Add all related alerts you see to the summary with the reassign button


== Determining the root cause from Perfherder ==
== Determining the root cause from Perfherder ==
When viewing a single alert and clicking on the graph link,  Perfherder automatically show multiple branches for the given test/platform.  This helps you determine the root branch.  It is best to [https://wiki.mozilla.org/Performance_sheriffing/Talos/Sheriffing/Perfherder_FAQ#Zooming zoom] in and out to verify where the regression is.
When viewing a single alert and clicking on the graph link,  Perfherder automatically show multiple branches for the given test/platform.  This helps you determine the root branch.  It is best to [[/Perfherder_FAQ#Zooming|zoom]] in and out to verify where the regression is.


While this isn't always clear, most of the time it is easy to see another alert on a different branch and mark the current one as a downstream if needed.
While this isn't always clear, most of the time it is easy to see another alert on a different branch and mark the current one as a downstream if needed.
Line 72: Line 72:


== Determining the scope of the regression from Perfherder ==
== Determining the scope of the regression from Perfherder ==
Once you have the spot, you can validate the other platforms by [https://wiki.mozilla.org/Performance_sheriffing/Talos/Sheriffing/Perfherder_FAQ#Adding_additional_data_points adding additional data sets] to the graph.  It is best here to zoom out a bit as the regression might be a few revisions off on different platforms due to [https://wiki.mozilla.org/Performance_sheriffing/Talos/Sheriffing/Tree_FAQ#What_is_coalescing coalescing].
Once you have the spot, you can validate the other platforms by [[/Perfherder_FAQ#Adding_additional_data_points|adding additional data sets]] to the graph.  It is best here to zoom out a bit as the regression might be a few revisions off on different platforms due to [[/Tree_FAQ#What_is_coalescing|coalescing]].


== Cases to watch out for ==
== Cases to watch out for ==
There are many reasons for an alert and different scenarios to be aware of:
There are many reasons for an alert and different scenarios to be aware of:
* [https://wiki.mozilla.org/Performance_sheriffing/Talos/Sheriffing/Tree_FAQ#What_is_a_backout backout] (usually within 1 week causing a similar regression/improvement)
* [[/Tree_FAQ#What_is_a_backout|backout]] (usually within 1 week causing a similar regression/improvement)
* [https://wiki.mozilla.org/Performance_sheriffing/Talos/Sheriffing/Tree_FAQ#What_is_PGO pgo/nonpgo] (some errors are pgo only and might be a side effect of pgo).  We only ship PGO, so these are the most important.
* [[/Tree_FAQ#What_is_PGO|pgo/nonpgo]] (some errors are pgo only and might be a side effect of pgo).  We only ship PGO, so these are the most important.
* test/infrastructure change - once in a while we change big things about our tests or infrastructure and it affects our tests (we need bugs to document these those)
* test/infrastructure change - once in a while we change big things about our tests or infrastructure and it affects our tests (we need bugs to document these those)
* [https://wiki.mozilla.org/Performance_sheriffing/Talos/Sheriffing/Tree_FAQ#What_is_a_merge Merged] - sometimes the root cause looks to be a merge, this is a normall a side effect of [https://wiki.mozilla.org/Performance_sheriffing/Talos/Sheriffing/Tree_FAQ#What_is_coalescing Coalescing].
* [[/Tree_FAQ#What_is_a_merge|Merged]] - sometimes the root cause looks to be a merge, this is a normall a side effect of [[/Tree_FAQ#What_is_coalescing|Coalescing]].
* [https://wiki.mozilla.org/Performance_sheriffing/Talos/Sheriffing/Tree_FAQ#What_is_coalescing Coalesed] - this is when we don't run every job on every platform on every push and sometimes we have a set of changes
* [[/Tree_FAQ#What_is_coalescing|Coalesed]] - this is when we don't run every job on every platform on every push and sometimes we have a set of changes
* Regular regression - the normal case where we get an alert and we see it merge from branch to branch
* Regular regression - the normal case where we get an alert and we see it merge from branch to branch


Line 86: Line 86:
Every release of Firefox we create a tracking bug (i.e. {{bug|1386631}} - Firefox 57) which we use to associate all regressions found during that release.  The reason for this is 2 fold:
Every release of Firefox we create a tracking bug (i.e. {{bug|1386631}} - Firefox 57) which we use to associate all regressions found during that release.  The reason for this is 2 fold:
* We can go to one spot and see what regressions we have for reference on new bugs or to follow up.
* We can go to one spot and see what regressions we have for reference on new bugs or to follow up.
* When we [https://wiki.mozilla.org/Performance_sheriffing/Talos/Sheriffing/Tree_FAQ#What_is_an_uplift uplift] it is important to see which alerts we are expecting
* When we [[/Tree_FAQ#What_is_an_uplift|uplift]] it is important to see which alerts we are expecting


These bugs just contain a set of links to other bugs, no conversation is needed.
These bugs just contain a set of links to other bugs, no conversation is needed.
Line 97: Line 97:
Here are some things to check/verify when filing a bug:
Here are some things to check/verify when filing a bug:
* Product/Component - this should be the same as the bug which is the root cause, if >1 bug, file in [https://bugzilla.mozilla.org/enter_bug.cgi?product=Testing&component=Talos Talos]
* Product/Component - this should be the same as the bug which is the root cause, if >1 bug, file in [https://bugzilla.mozilla.org/enter_bug.cgi?product=Testing&component=Talos Talos]
* Dependent/Block bugs - For a new bug, add the [https://wiki.mozilla.org/Performance_sheriffing/Talos/Sheriffing#Tracking_bugs tracking bug] (for the current version) and root cause bug(s) as blocking this bug
* Dependent/Block bugs - For a new bug, add the [[#Tracking_bugs|tracking bug]] (for the current version) and root cause bug(s) as blocking this bug
* CC list - cc patch author(s), reviewer(s) and owner of the tests as documented on the [https://wiki.mozilla.org/Performance_sheriffing/Talos/Tests Talos tests wiki]; if we have >1 bug, we should cc everyone who worked on those bugs so we call pitch in an answer questions faster
* CC list - cc patch author(s), reviewer(s) and owner of the tests as documented on the [[TestEngineering/Performance/Talos/Tests|Talos tests wiki]]; if we have >1 bug, we should cc everyone who worked on those bugs so we call pitch in an answer questions faster
* Summary of bug should have a check to make sure the revision is accurate
* Summary of bug should have a check to make sure the revision is accurate
* The description is auto suggested as well, please verify the revision here
* The description is auto suggested as well, please verify the revision here


As a note, the generated description refers the patch author to [https://wiki.mozilla.org/Performance_sheriffing/Talos/RegressionBugsHandling guidelines and expectations] for them about how and when to respond.
As a note, the generated description refers the patch author to [[TestEngineering/Performance/Talos/RegressionBugsHandling|guidelines and expectations]] for them about how and when to respond.


Once a bug is filed it is a good idea to do a few things in another comment:
Once a bug is filed it is a good idea to do a few things in another comment:
* provide a link to compare view to show you have done retriggers and believe this is valid
* provide a link to compare view to show you have done retriggers and believe this is valid
* needinfo the patch author (if many patch authors, needinfo one of :jmaher, :igoldan or :rwood)
* needinfo the patch author (if many patch authors, needinfo one of :davehunt, :igoldan or :rwood)
* mention how confident you are in the regression (more confidence if you have a lot of retriggers and there is only one patch, less confident if you are waiting on backfilling data, retriggers, try runs, etc.)
* mention how confident you are in the regression (more confidence if you have a lot of retriggers and there is only one patch, less confident if you are waiting on backfilling data, retriggers, try runs, etc.)


Line 113: Line 113:


== Merge Day - Uplifts ==
== Merge Day - Uplifts ==
Every 6 weeks we do an [https://wiki.mozilla.org/Performance_sheriffing/Talos/Sheriffing/Tree_FAQ#What_is_an_uplift uplift].  These typically result in [https://elvis314.wordpress.com/2014/12/12/tracking-firefox-performance-as-we-uplift-the-volume-of-alerts-we-get/ dozens of alerts] for each uplift.
Every 6 weeks we do an [[/Tree_FAQ#What_is_an_uplift|uplift]].  These typically result in [https://elvis314.wordpress.com/2014/12/12/tracking-firefox-performance-as-we-uplift-the-volume-of-alerts-we-get/ dozens of alerts] for each uplift.


The job here is to triage alerts as we usually do, except in this case we have a much larger volume of alerts.  One thing here is we have alerts from the upstream branch.  Take for example when we uplift Mozilla-Central to Mozilla-Beta.  We have a tracking bug for each release, and there is a list of bugs (keep in mind some are resolved as wontfix).  In a perfect world (half the time) we can match up the alerts that are showing up on Mozilla-Beta with the bugs that have already been filed.  The job here is to verify and add bugs to keep track of what is there.
The job here is to triage alerts as we usually do, except in this case we have a much larger volume of alerts.  One thing here is we have alerts from the upstream branch.  Take for example when we uplift Mozilla-Central to Mozilla-Beta.  We have a tracking bug for each release, and there is a list of bugs (keep in mind some are resolved as wontfix).  In a perfect world (half the time) we can match up the alerts that are showing up on Mozilla-Beta with the bugs that have already been filed.  The job here is to verify and add bugs to keep track of what is there.
Line 125: Line 125:


= Additional Resources =
= Additional Resources =
* [https://wiki.mozilla.org/Performance_sheriffing/Talos/Sheriffing/Alert_FAQ Alert FAQ]
* [[/Alert_FAQ|Alert FAQ]]
* [https://wiki.mozilla.org/Performance_sheriffing/Talos/Sheriffing/Noise_FAQ Noise FAQ]
* [[/Noise_FAQ|Noise FAQ]]
* [https://wiki.mozilla.org/Performance_sheriffing/Talos/Sheriffing/Perfherder_FAQ Perfherder FAQ]
* [[/Perfherder_FAQ|Perfherder FAQ]]
* [https://wiki.mozilla.org/Performance_sheriffing/Talos/Sheriffing/Tree_FAQ Tree FAQ]
* [[/Tree_FAQ|Tree FAQ]]
* [https://wiki.mozilla.org/Buildbot/Talos/Sheriffing duplicated & updated from old page]
Confirmed users
2,197

edits

Navigation menu