Changes

Jump to: navigation, search

TestEngineering/Performance/Sheriffing

336 bytes added, 12:40, 22 January 2018
update links
= What is an alert =
As of January 2016, alerts are generated in [https://treeherder.mozilla.org/perf.html#/alerts?status=0&framework=1 Perfherder]. These are generated by programatically verifying there is a sustained regression over time ([https://wiki.mozilla.org/BuildbotPerformance_sheriffing/Talos/Sheriffing/Noise_FAQ#Why_do_we_need_12_future_data_pointsoriginal data point + 12 future data points]).
There is an alert summary outlining the alerts which match the same set of revisions. For the summary there are a few pieces of information:
* Title (which is a good bug title if filing one for a regression:
** [https://wiki.mozilla.org/BuildbotPerformance_sheriffing/Talos/Sheriffing/Tree_FAQ#Branch_names_and_confusion branch]
** % regressed, this is a range of the regressions (not improvements)
** the [https://wiki.mozilla.org/BuildbotPerformance_sheriffing/Talos/Tests tests] which have regressed
** the platforms we see this regression on
* date of the suspect revision push
Below the summary will be a list of alerts, each alert will reference:
* [https://wiki.mozilla.org/BuildbotPerformance_sheriffing/Talos/Tests Test name]
* platform (including build type, such as opt, pgo)
* old score (median score of the previous 12 commits)
* new score (median score of the future 12 commits)
* [https://wiki.mozilla.org/BuildbotPerformance_sheriffing/Talos/Sheriffing/Alert_FAQ#Why_does_Alert_Manager_print_-xx.25 % change / values]
* bar chart to show severity, green = improvement, red = regression
* Confidence value (from the t-test code)
* Look at the graph and determine the original branch, date, revision where the alert occurred
* Look at Treeherder and determine if we have all the data.
* Retrigger jobs if needed (more [https://wiki.mozilla.org/BuildbotPerformance_sheriffing/Talos/Sheriffing/Noise_FAQ#What_is_Noise noise], more retriggers)
* Once you have more data, look at the data in [https://treeherder.mozilla.org/perf.html#/comparechooser compare view] to see if other tests/platforms have changed
* Add all related alerts you see to the summary with the reassign button
== Determining the root cause from Perfherder ==
When viewing a single alert and clicking on the graph link, Perfherder automatically show multiple branches for the given test/platform. This helps you determine the root branch. It is best to [https://wiki.mozilla.org/BuildbotPerformance_sheriffing/Talos/Sheriffing/Perfherder_FAQ#Zooming zoom] in and out to verify where the regression is.
While this isn't always clear, most of the time it is easy to see another alert on a different branch and mark the current one as a downstream if needed.
== Determining the scope of the regression from Perfherder ==
Once you have the spot, you can validate the other platforms by [https://wiki.mozilla.org/BuildbotPerformance_sheriffing/Talos/Sheriffing/Perfherder_FAQ#Adding_additional_data_points adding additional data sets] to the graph. It is best here to zoom out a bit as the regression might be a few revisions off on different platforms due to [https://wiki.mozilla.org/BuildbotPerformance_sheriffing/Talos/Sheriffing/Tree_FAQ#What_is_coalescing coalescing].
== Cases to watch out for ==
There are many reasons for an alert and different scenarios to be aware of:
* [https://wiki.mozilla.org/BuildbotPerformance_sheriffing/Talos/Sheriffing/Tree_FAQ#What_is_a_backout backout] (usually within 1 week causing a similar regression/improvement)* [https://wiki.mozilla.org/BuildbotPerformance_sheriffing/Talos/Sheriffing/Tree_FAQ#What_is_PGO pgo/nonpgo] (some errors are pgo only and might be a side effect of pgo). We only ship PGO, so these are the most important.
* test/infrastructure change - once in a while we change big things about our tests or infrastructure and it affects our tests (we need bugs to document these those)
* [https://wiki.mozilla.org/BuildbotPerformance_sheriffing/Talos/Sheriffing/Tree_FAQ#What_is_a_merge Merged] - sometimes the root cause looks to be a merge, this is a normall a side effect of [https://wiki.mozilla.org/BuildbotPerformance_sheriffing/Talos/Sheriffing/Tree_FAQ#What_is_coalescing Coalescing].* [https://wiki.mozilla.org/BuildbotPerformance_sheriffing/Talos/Sheriffing/Tree_FAQ#What_is_coalescing Coalesed] - this is when we don't run every job on every platform on every push and sometimes we have a set of changes
* Regular regression - the normal case where we get an alert and we see it merge from branch to branch
Every release of Firefox we create a tracking bug (i.e. {{bug|1386631}} - Firefox 57) which we use to associate all regressions found during that release. The reason for this is 2 fold:
* We can go to one spot and see what regressions we have for reference on new bugs or to follow up.
* When we [https://wiki.mozilla.org/BuildbotPerformance_sheriffing/Talos/Sheriffing/Tree_FAQ#What_is_an_uplift uplift] it is important to see which alerts we are expecting
These bugs just contain a set of links to other bugs, no conversation is needed.
Here are some things to check/verify when filing a bug:
* Product/Component - this should be the same as the bug which is the root cause, if >1 bug, file in [https://bugzilla.mozilla.org/enter_bug.cgi?product=Testing&component=Talos Talos]
* Dependent/Block bugs - For a new bug, add the [https://wiki.mozilla.org/BuildbotPerformance_sheriffing/Talos/Sheriffing#Tracking_bugs tracking bug] (for the current version) and root cause bug(s) as blocking this bug* CC list - cc patch author(s), reviewer(s) and owner of the tests as documented on the [https://wiki.mozilla.org/BuildbotPerformance_sheriffing/Talos/Tests Talos tests wiki]; if we have >1 bug, we should cc everyone who worked on those bugs so we call pitch in an answer questions faster
* Summary of bug should have a check to make sure the revision is accurate
* The description is auto suggested as well, please verify the revision here
As a note, the generated description refers the patch author to [https://wiki.mozilla.org/BuildbotPerformance_sheriffing/Talos/RegressionBugsHandling guidelines and expectations] for them about how and when to respond.
Once a bug is filed it is a good idea to do a few things in another comment:
== Merge Day - Uplifts ==
Every 6 weeks we do an [https://wiki.mozilla.org/BuildbotPerformance_sheriffing/Talos/Sheriffing/Tree_FAQ#What_is_an_uplift uplift]. These typically result in [https://elvis314.wordpress.com/2014/12/12/tracking-firefox-performance-as-we-uplift-the-volume-of-alerts-we-get/ dozens of alerts] for each uplift.
The job here is to triage alerts as we usually do, except in this case we have a much larger volume of alerts. One thing here is we have alerts from the upstream branch. Take for example when we uplift Mozilla-Central to Mozilla-Beta. We have a tracking bug for each release, and there is a list of bugs (keep in mind some are resolved as wontfix). In a perfect world (half the time) we can match up the alerts that are showing up on Mozilla-Beta with the bugs that have already been filed. The job here is to verify and add bugs to keep track of what is there.
= Additional Resources =
* [https://wiki.mozilla.org/BuildbotPerformance_sheriffing/Talos/Sheriffing/Alert_FAQ Alert FAQ]* [https://wiki.mozilla.org/BuildbotPerformance_sheriffing/Talos/Sheriffing/Noise_FAQ Noise FAQ]* [https://wiki.mozilla.org/BuildbotPerformance_sheriffing/Talos/Sheriffing/Perfherder_FAQ Perfherder FAQ]* [https://wiki.mozilla.org/BuildbotPerformance_sheriffing/Talos/Sheriffing/Tree_FAQ Tree FAQ]* [https://wiki.mozilla.org/BuildbotPerformance_sheriffing/Talos/Sheriffing duplicated & updated from old page]
160
edits

Navigation menu