TestEngineering/Performance/Sheriffing/Workflow

From MozillaWiki
Jump to: navigation, search

Context

Sheriffs, besides the standard by-the-book definition, are making sure that no performance regression is happening by watching on a daily basis the alerts triggered by the oscillation of the tests' metrics. Every time the oscillation get past the set thresold an alert is created and it gets root-cause-analyzed by the sheriff, whose ultimate purpose is to identify the commit/revision responsible for this.
If you had been needinfo?ed in a regression bug you need to know that your commit caused a regression alert and this issue needs to be addressed.

Filtering and reading the alerts

First thing after accessing the Perfherder alerts page is to make sure the filter is set to show the correct alerts you need to sheriff. The new alerts can be found in untriaged.
Alerts toolbar
The Hide… buttons are meant to reduce the visual pollution in case you don’t want to see the improvements or downstream/reassigned/invalid alerts.

Regression summary
The alerts are grouped by Summaries*. The tests:

  • may run on different platforms (e.g. Windows, Ubuntu, android, etc.)
  • can share the same suite (e.g. tp6m)
  • share the same framework (e.g. raptor, talos): if a particular commit trigger alerts from multiple frameworks, there will be different summaries for every framework.
  • measure various metrics (e.g. FCP, loadtime), but not all of the metrics trigger alerts

*By the book, an alert is one item of the summary, but we can refer also to a summary as an alert, depending on the context.

Though you can see in the summary items references to those namings, like below.
Summary item
Ideally, the intent of every patch landed to the mozilla repositories is to cause improvements, but in the real world it doesn’t happen like that. An alert summary can contain improvements, regressions or both.

  • the improvements are marked with green

Summary item - improvement

  • and the regressions are marked with red.

Summary item - regression
In terms of organizing the workflow, we should always prioritise the regressions over improvements.

How to read the graph

After finding the culprit (and done the necessary action for the improvement/regression), you have to assign the alert to yourself by clicking the Alert take button button and pressing Enter after. You should see your usename in-place Alert assigned to - small.

Before doing any investigation, you should assign the alert to yourself by clicking on Take button button and pressing Enter after. . You should see your usename in-place Alert assigned to - small.

To read the graph of a certain alert, you just need to put the mouse over it and click on the graph link that appears:
Item with graph link
Starring it Alert star you make sure you know which alert you read when come back to the summary.
Graph view alert tooltip
The graph will show with a thin vertical line all the alerts associated with the test, so you need to make sure you’re looking at the right one by hovering or clicking on the datapoint. If the datapoint of the improvement/regression is not clear you might want to:

  • zoom by drawing a rectangle over the desired area
  • zoom out by clicking on the top graph
  • extend the timeframe of the graph using the dropbox on the top of the page.


If the commit of the improvement/regression is not clear, take the desired action (usually Retrigger/Backfill) and make sure you write down in the notes of the alert (Add/Edit notes) your name and what you did, so you or another colleague know what’s happening next time the alert is sheriffed. The pattern is: [yourname] comments. We use to leave most recent comments first so we can easily read them when we come back.
Alert summary add notes
A clear improvement/regression appears usually when there is easily noticeable difference between two adjacent data points:
Clear improvement graph
There are cases when the difference is much less noticeable and the data of the test is more unstable, and some retriggers are necessary in order to determine the interval for the test data and compare it between several adjacent tests:
Unstable improvement graph
A less fortunate situation is when the test is unstable, there are gaps in graph when the tests didn’t run for various reasons and the regression/improvement is almost impossible to be determined. If the investigation takes more than 5 business days it’s recommended to ask for help, it you haven’t already:

The investigation might end up opening a bug without knowing the specific commit that caused the regression and asking for help from most relevant people you found about.
Graph uncertain culprit

Handling regressions

When a regression happened, it is not necessarily caused by a bug. It can be also caused by the instability/noise of the test or by other causes that are unrelated to the repo code, like CI setup.

Performance regressions

If the commit that caused the regression is clear, then what is left to do is get into the commit’s content. There are several situations here:

  • The commit contains changes only from one bug and you need to open a regression bug for that
  • The commit contains changes from several bugs but you are familiar with the test and you know which of the bugs caused the regression and open a bug for that
  • The commit contains changes from several bugs (usually a merge from one of the other repos) when you need to do a bisection in order to identify the causing bug.

Thresholds

There are different thresholds above which the alerts are considered regressions and they vary depending on the framework:

  • AWSY >= 0,25%
  • Build metrics installer size >= 100kb
  • talos, raptor, Build metrics >= 2%

Now that you know the causing bug, you need to make sure that there isn’t already a bug open for this, by searching for regressions just like following up on regressions but clearing the Search by People > Reporter field.
If there is no regression bug open, you need to open one:
Open regression bug
The new page should look like this:
File regression page
Most important fields when filing a regression bug are:

  • Type: Defect

New bug type - defect

  • Keywords: perf, perf-alert, regression - will be automatically filled


New bug keywords

  • Target: the very next release of the firefox

New bug target

  • Blocks: indicate the next release of the firefox and is a meta bug used to keep track of the regressions associated with a specific release

New bug blocks

  • Regressed by: the number of the bug that caused the regression (note: the bug number appears strikedthrough if it is closed)

New bug regressed by

  • Request information from: the assignee of the bug

New bug assignee New bug needinfo

  • CC: here usually goes at least the assignee, reporter and triage owner

New bug people-right >>> New bug cc

  • Product and Component: are automatically filled in the Enter bug page, so you need to save the bug with the details so far and click edit to modify those. They have to be the same as in the original bug

Bug status >>> Bug edit status
After creating the bug you should see something like this in the Firefox release meta bug [meta] Firefox <release number> - Perfherder Regression Tracking Bug
Bug depends on comment
After you finished with the regression bug you need to link it to the summary and change the status of the alert to acknowledged.
Summary menu link to bug Bug - link to bug
Summary bottom menu acknowledge
Next you have to follow the comments of the bug so you make sure it’s closed, ideally before the next Firefox release.

Infra regressions

The infra regressions are caused by infra changes are probably the most difficult to identify. Excepting the case when the infra change is announced and known of, usually an infra regression is most likely to be detected by the sheriff after all the suspect commits/bugs were removed from the list.

Anything that doesn’t depend on the repo code is considered to be part of the CI infrastructure, so it’s not dependent on the code state on a certain point in history/graph. For example, if the farm devices were updated with changed OS images, no matter which datapoint from history is (re)triggered, it will run on the current image. So, if the changes of the OS image don’t have the desired effect (improvement - this is always the intent), the retriggers will reveal a regression between the old datapoint(s) and the new ones of the same commit.

Looking at the graph below, it is obvious that the data-points highlighted around Sep 10 vary in the same interval as the data-points after Sep 13. They were retriggered after this date, when the infra change happened. The datapoint around Sep 10 that are not highlighted were triggered before the infra change happened, and you can see that their vary interval is constant lower.
Graph infra regression

For an easy follow-up, there’s a changelog containing the changes realted to the infrastructure that is very useful when the investigation is leading to this kind of regression: https://changelog.dev.mozaws.net/

Invalid regressions

Invalid regressions usually (but not only) happen when the test results are very unstable. A useful tip of finding invalid regressions is looking at graph’s history for a pattern in the evolution of the datapoints.

In the graph below, the regression appeared around Dec 9 and as you can see, there a pattern of vary predominantly between 0.7 and 1. If you click on first highlighted datapoint (around Dec 3) you’ll see that its alert is marked as invalid.

Graph invalid regression

Attention, though, that despite this graph has a wide varying interval, most of the datapoint are concentrated around value 1. This is the case of the alert around Dec 9, after which the stabilized itself around the regression’s value (0.75 - 0.8). This is a real regression!
Graph sccache real regression

A particular case are sccache hit rate tests. Most often, those alerts are invalid, but is the hit rate drops stay stays low for at least 12-24h then a regression bug should be open.

Handling improvements

Unlike for regressions, when you identified an improvement there's no need to open a bug, you just need to notify the bug assignee via a comment.

Valid improvements

This time you just need copy the summary, paste it as “Congrats” to the bug causing it and update the status of the summary:
Improvement menu copy summary
Add improvement comment
In notes section of the alert summary, add one of the most appropriate from the tags below:

  • #improvement - patches that caused improvements
  • #harness-update - patches that updated that harness and caused improvements
  • #regression-backedout - paches backed out due to causing regressions
  • #infra - improvements caused by infra changes (cheanges not related to repository code)
  • #regression-fix - pacthes fixing a reported regression bug

Add improvement note tag
Add improvement note #tag
Acknowldge improvement summary

Attention! The summary could contain alerts reassigned from other alerts. You have to tick the box next to each untriaged alert and change its status to Acknowledge.
Ticking the box next to the alert summary and resetting it will UNLINK the reassigned alerts and you don’t want to do that!

Invalid improvements

Here you can apply the same logic as for invalid regressions, the difference is that the unstable graph evolution triggered an alert while the value changed in the sense of an improvement.

Updating alerts’ status

After finding the culprit (and done the necessary action for the improvement/regression), you have to change the status of the alert to one of the following:

  • if the alert is a valid improvement/regression and you linked a bug for it, change to Alert bottom toolbar acknowledge
  • if the alert is an invalid improvement/regression, change to Alert bottom toolbar invalid
  • if the alert is a downstream of an improvement/regression, change to Alert bottom toolbar downstream
  • if the improvement/regression happened earlier/later on the same repo, change to Alert bottom toolbar reassign

Follow-up on regressions

Regression bugs with no activity for three days should be:

  • responding to open questions to sheriffs, or
  • added the [qf] whiteboard entry

Edit bug - defails section

You can follow up on all the open regression bugs created by you.

Resources