- 1 Context
- 2 Flowchart
- 3 Filtering and reading the alerts
- 4 How to investigate an alert
- 5 Handling alerts
- 5.1 Thresholds
- 5.2 Regressions vs improvements
- 5.3 Identifying the culprit
- 5.4 Handling Regressions
- 5.5 Handling improvements
- 6 Updating alerts’ status
- 7 Follow-up on regressions
- 8 Special Cases
- 9 Resources
Performance sheriffs, along with the standard by-the-book definition, are responsible for checking for performance regressions on a daily basis. This is done by reviewing alerts from our performance tests on Perfherder. Any time a test or tests exceed the set threshold for its framework, one or multiple alerts will be generated. The goal of the sheriff is to identify the commit or revision responsible for the regression, and file a bug with a :needinfo flag for the author(s) of that commit or revision.
app.diagrams.net link: https://drive.google.com/file/d/1Hpg9AjKTA2jx413Ly4imJWwM_gQHtLU3/view
Filtering and reading the alerts
First thing after accessing the Perfherder alerts page is to make sure the filter is set to show the correct alerts you need to sheriff. The new alerts can be found in untriaged.
The Hide downstream/reassigned/invalid button is meant to reduce the visual pollution in case you don’t want to see the downstream/reassigned/invalid alerts. My alerts button will show only alerts that are assigned to you.
The alerts are grouped by Summaries*.
- may run on different platforms (e.g. Windows, Ubuntu, android, etc.)
- can share the same suite (e.g. tp6m)
- share the same framework (e.g. raptor, talos): if a particular commit trigger alerts from multiple frameworks, there will be different summaries for every framework.
- measure various metrics (e.g. FCP, loadtime), but not all of the metrics trigger alerts
*By the book, an alert is one item of the summary, but we can refer also to a summary as an alert, depending on the context.
Though you can see in the summary items references to those namings, like below.
Ideally, the intent of every patch landed to the mozilla repositories is to cause improvements, but in the real world it doesn’t happen like that. An alert summary can contain improvements, regressions or both.
- the improvements are marked with green
- and the regressions are marked with red.
How to investigate an alert
As golden rules, when you are not sure about the culprit of the graph you're investigating:
- zoom in until you can easily distinguish the datapoints
- retrigger more if the graph is too unstable for the data you have
- when the graph is too noisy and zooming in only makes it harder: zoom out, chose a larger timeframe (30 days, 60 days, etc), and then zoom back in slowly following the limit line of the values in the before half of the graph until you find the datapoint that changes the trend.
Reading the graph
To read the graph of a certain alert, you just need to put the mouse over it and click on the graph link that appears:
Starring it you make sure you know which alert you read when come back to the summary.
The graph will show with a thin vertical line all the alerts associated with the test, so you need to make sure you’re looking at the right one by hovering or clicking on the datapoint. If the datapoint of the improvement/regression is not clear you might want to:
- zoom by drawing a rectangle over the desired area
- zoom out by clicking on the top graph
- extend the timeframe of the graph using the dropbox on the top of the page.
If the commit of the improvement/regression is not clear, take the desired action (usually Retrigger/Backfill) and make sure you write down in the notes of the alert (Add/Edit notes) your name and what you did, so you or another colleague know what’s happening next time the alert is sheriffed. The pattern is: [yourname] comments. We use to leave most recent comments first so we can easily read them when we come back.
TODO: list categories of graph - stable, noisy (noisy with bigger or smaller trend change), modal, invalid.
Depending on the test, the jobs are ran once in several revisions or on every revision. The vast majority runs once in several revision, so almost every time you need to do a backfill between the first bad and last good job in order to determine for sure the culprit, by clicking on the job link from the tooltip of the regression data-point.
A new page with the list of jobs corresponding to the datapoint will appear, then click on the link next to Job at the lower left section of the page to narrow down the jobs you see just to the one that caused the regression.
You're now seeing just the job from the revision that caused the regression. Now you need to see the previous jobs in order to identify how many revisions you need to backfill, by clicking on the number in the image below, usually the lowest is chosen
Having the top job selected, you now need to trigger the backfill action. Usually, the sheriffs chose the quick one (Backfill button) but with Custom action... you can chose how many revisions you want to go deep or how many retriggers per revision you want. There are also available other types of actions the sheriffs team don't use.
After triggering the action you have to see the confirmation message at the top left of the screen which shouldn't appear later than 5-7 seconds. If you don't see that you risk coming back and not have the desired jobs trigerred, meaning you lose investigation time.
After the jobs are finished running, you have jobs triggered on successive revisions and you can continue looking for the culprit.
Finding the culprit
A clear improvement/regression appears usually when there is easily noticeable difference between two adjacent data points:
There are cases when the difference is much less noticeable and the data of the test is more unstable, and some retriggers are necessary in order to determine the interval for the test data and compare it between several adjacent tests:
A less fortunate situation is when the test is unstable, there are gaps in graph when the tests didn’t run for various reasons and the regression/improvement is almost impossible to be determined. If the investigation takes more than 5 business days it’s recommended to ask for help, it you haven’t already:
- from the other sheriffs in your team
- see if there were situations like this in the past and find out how they were treated (there’s a high change there were)
- search for the test framework in individual module ownership page or just search for what you want in mozilla wiki
- try to reach the framework owner you find in individual module ownership page by mozilla's chat app, email or other method handy for you
- Mozilla dashboard might also be helpful
- If you still don’t figure it out, ask your team lead
The investigation might end up opening a bug without knowing the specific commit that caused the regression and asking for help from most relevant people you found about.
A less common case of regression/improvement is when the graph is pretty clear about the culprit but the patch contains changes unrelated to the platform(s) targeted by the alert. For example, if the patch is just modifying configuration stuff for mobile platforms and the alert targets only desktop platforms, it might be an error somewhere. The recommendation is to ask in the culprit bug about what might be missing before opening the regression/linking the improvement to it.
When an alert is created, it is not necessarily caused by a bug. It can be also caused by the instability/noise of the test or by other causes that are unrelated to the repo code, like CI setup.
There are different thresholds above which the alerts are considered regressions and they vary depending on the framework:
- AWSY >= 0,25%
- Build metrics installer size >= 100kb
- talos, raptor, Build metrics >= 2%
Regressions vs improvements
Whilst the difference between regressions and improvements is self-explanatory, after acknowledging the alert:
- the regressions go through multiple statuses until the final resolution
- the improvements have the only status of "improvement"
Identifying the culprit
If the revision that caused the regression is clear, then what is left to do is identify the culprit bug. There are several situations here:
- The revision contains changes only from one bug and you need to open a regression bug for that
- The revision contains changes from several bugs but you are familiar with the test and you know which of the bugs caused the regression and open a bug for that
- The revision contains changes from several bugs (usually a merge from one of the other repos) when you need to do a bisection in order to identify the causing bug.
Harness alerts are usually caused by re-recordings or changes to code from testing/raptor component changing the baseline. If the regression is assumed, then link the culprit directly to the alert, close as wontfix and add the "harness" tag. Otherwise, open regression bug.
These alerts are caused by backouts or fixes of regressions and the associated tags are regression-backedout respectively regression-fix. If they are regression, they get WONTFIX status.
The infra regressions are caused by infra changes are probably the most difficult to identify. Excepting the case when the infra change is announced and known of, usually an infra regression is most likely to be detected by the sheriff after all the suspect commits/bugs were removed from the list.
Anything that doesn’t depend on the repo code is considered to be part of the CI infrastructure, so it’s not dependent on the code state on a certain point in history/graph. For example, if the farm devices were updated with changed OS images, no matter which datapoint from history is (re)triggered, it will run on the current image. So, if the changes of the OS image don’t have the desired effect (improvement - this is always the intent), the retriggers will reveal a regression between the old datapoint(s) and the new ones of the same commit.
Looking at the graph below, it is obvious that until Apr 17 the graph is constantly around 1800-2000 and after that it dropped to 1200-1400. The backfill didn't reveal the culprit so we retriggered a while back. The retriggers were in the interval of the improvement and we can also see that there are many yellow vertical lines that mark the infra changes.
For an easy follow-up, there’s a changelog containing the changes realted to the infrastructure that is very useful when the investigation is leading to this kind of regression: https://changelog.dev.mozaws.net/
Invalid regressions usually (but not only) happen when the test results are very unstable. A useful tip of finding invalid regressions is looking at graph’s history for a pattern in the evolution of the datapoints.
In the graph below, the regression appeared around Dec 9 and as you can see, there a pattern of vary predominantly between 0.7 and 1. If you click on first highlighted datapoint (around Dec 3) you’ll see that its alert is marked as invalid.
Attention, though, that despite this graph has a wide varying interval, most of the datapoint are concentrated around value 1. This is the case of the alert around Dec 9, after which the stabilized itself around the regression’s value (0.75 - 0.8). This is a real regression!
A particular case are sccache hit rate tests. Most often, those alerts are invalid, but is the hit rate drops stay stays low for at least 12-24h then a regression bug should be open.
There are two different approaches to handling regressions:
- filing a regression bug actual regressions
- letting the author of the culprit know that their patch(es) caused a regression when we know that it will be accepted (backout, regression-fix, harness)
Filing a regression bug
Note: To file a bug from Perfherder you are required to be authenticated.
Steps to file regression bug:
1. Click on the Untriaged status in the upper right side of the alert summary.
2. Next select the File bug option and enter the bug number from the bug that caused the regression. You will be redirected to bugzilla.mozilla.org.
3. In bugzilla scroll down and click on Set bug flags and set to affected the last status-firefox version that appears in the list.
4. Click on the Submit Bug button from the bottom of the page.
Manually filling a regression bug
Now that you know the causing bug, you need to make sure that there isn’t already a bug open for this, by searching for regressions just like following up on regressions but clearing the Search by People > Reporter field.
If there is no regression bug open, you need to open one:
The new page should look like this:
Most important fields when filing a regression bug are:
- Type: Defect
- Keywords: perf, perf-alert, regression - will be automatically filled
- Blocks: indicate the next release of the firefox and is a meta bug used to keep track of the regressions associated with a specific release
- Regressed by: the number of the bug that caused the regression (note: the bug number appears strikedthrough if it is closed)
- Request information from: the assignee of the bug
- CC: here usually goes at least the assignee, reporter and triage owner
- tracking-flag will appear at the bottom of the page after you click on Set bug flags and you have to set to affected the last status-firefox version that appears in the list, in this case status-firefox80
- Product and Component: are automatically filled in the Enter bug page, so you need to save the bug with the details so far and click edit to modify those. They have to be the same as in the original bug
After you finished with the regression bug you need to link it to the summary and change the status of the alert to acknowledged.
Next you have to follow the comments of the bug so you make sure it’s closed, ideally before the next Firefox release.
Unlike for regressions, when you identified an improvement there's no need to open a bug, you just need to notify the bug assignee via a comment and add the 'perf-alert' keyword.
You also need to add the 'perf-alert' keyword, to indicate there is a performance alert associated with this bug.
Attention! The summary could contain alerts reassigned from other alerts. You have to tick the box next to each untriaged alert and change its status to Acknowledge.
Ticking the box next to the alert summary and resetting it will UNLINK the reassigned alerts and you don’t want to do that!
Improvements treated as regression
Depending on the test, a high magnitude improvement (> 80%) should be treated more carefully. While a 100% improvement for a pageload test is impossible (the site never loads in an instant), over 80% is very rare and might imply that the test isn't loading what it should (an error page which is likely to contain much less code than the actual website).
Here you can apply the same logic as for invalid regressions, the difference is that the unstable graph evolution triggered an alert while the value changed in the sense of an improvement.
Updating alerts’ status
After finding the culprit (and done the necessary actions for the improvement/regression), you have to:
Add notes tag
- harness - patches that updated that harness and caused improvements
- regression-backedout - paches backed out due to causing regressions
- infra - improvements caused by infra changes (cheanges not related to repository code)
- regression-fix - pacthes fixing a reported regression bug
- improvements - pacthes causing improvements
- harness - patches that updated that harness and caused regressions
- infra - regressions caused by infra changes (cheanges not related to repository code)
Move alerts from untriaged queue
After the revision that caused the alert was identified move alerts from untriaged queue as follows:
- if the alert is a valid improvement/regression and you linked a bug for it, change to
- if the alert is an invalid improvement/regression, change to
- if the alert is a downstream of an improvement/regression, change to
- if the improvement/regression happened earlier/later on the same repo, change to
Follow-up on regressions
TODO: suggested email filter, followup on comments and needinfos, update alerts status when the regression bug closes.
The regression identified from the graph can be inaccurate for several reasons. When the author of the culprit patch/bug doesn't agree that their code caused the regression, we usually take another look over the alert. What can the sheriff do is the following (but not necessary limited to):
- retrigger/backfill the jobs around the regression, especially when the graph is noisy and the regression is not very clear. The sheriff is used to read the graph and can see something clear while the author of the patch doesn't have that skill.
- it is possible that the alert contains 2 very close or neighbor regressions. If the alert contains different tests in terms of naming, it is possible that they are caused by different revisions and need to be confirmed/infirmed by some retriggers/backfills<br/
- sometimes, despite the graph is clear for the sheriff, the patch can contain "static" code (comments, documentation updates). In this case, other cause might be infra change. But be careful, the developer doesn't have to know what the patch does, so sometimes those are caught only if the developer questions the regression
Regressions with no activity
Regression bugs with no activity for three days should be:
- responding to open questions to sheriffs, or
- added the [qf] whiteboard entry
You can follow up on all the open regression bugs created by you.
Multiple Bug IDs on the same push
If the push that has been identified as the culprit has multiple revisions and bug ids linked to it, the sheriff can fill in the "Regressed By" field for each of the bugs with the value of the filed bug for the associated alert.
TODO: add screenshot for example alert