Performance sheriffing/Talos/RegressionBugsHandling

From MozillaWiki
Jump to: navigation, search

Handling of Performance Regression Bugs

We use different policies for handling large and small regressions.

Policy #1: Backouts within 48-hours for large regressions

The perf team and the A-Team are testing out a new policy: backing out patches that cause significant Talos regressions on Windows builds within 48 hours.

Only regressions of 10% or more, on reliable Talos tests, on Windows, will face automatic backouts.

List of reliable Talos tests:

  • Startup tests: ts_paint, sessionstore
  • Scrolling smoothness tests: tp5o_scroll, tscrollx
  • Other graphics performance tests: tresize, TART, tsvgx
  • Page load test: tp5o
Why

We want to test a 48-hour backout policy because we noticed that patch authors tend not to address Talos regression bugs for days. If a regression sits in the tree for days, it becomes difficult to back it out, and it becomes much more likely the regression will end up riding the trains to release by default.

This new policy is more aggressive. We think a patch that regresses performance significantly should be backed out quickly, and re-landed when its performance is acceptable.

How do the 48-hour backouts work?

The A-Team perf sheriffs will create a Talos regression bug as soon as a regression is confirmed using Talos re-triggers. The patch author and reviewer will be CC’ed, and if they don’t provide an explanation for why the regression is acceptable, the patch will be backed out. The goal is to back out unjustified regressions within 48 hours of them landing. We’d like to give the patch author about 24 hours to reply after the regression bug is filed.

The A-Team has been working hard on improving the tools for understanding Talos regressions (e.g. Perfherder), and we think debugging a Talos regression is a much less painful process these days. For example, there is now a highly useable view to visualize the comparison between a proposed patch against a baseline revision at https://treeherder.mozilla.org/perf.html#/comparechooser.

Policy #2: Policy for smaller regressions

When a Talos performance regression is detected and the patch which caused it is identified
  • Whoever landed the patch is responsible to drive a decision on whether or not the regression is acceptable.
  • Such decision should preferably involve a module owner/peer of the regressed module.
  • The owner of the offending patch must respond at the bug within 3 business days, and a decision has to be made within two weeks, otherwise the offending patch may be backed out.
If the decision was to try and fix the regression, preferably a time frame and an owner would be provided as well.
It's OK to decide to accept a Talos regression for valid reasons, including
  • There is other higher priority work, the patch's functionality is absolutely required, and no one else is available to investigate in the foreseeable future.
  • The magnitude of the regression is too small to warrant the effort required for an investigation.
  • The performance improvements from the patch outweigh the regression on the regressing test
  • The functionality provided by the patch is absolutely required (e.g. a security patch) or otherwise out-weighs the regression.
  • etc

Obviously, if someone still wants to fix the regression or otherwise improve the performance even if a valid reason to not do so exists, that would be welcome.

Invalid reasons for closing a Talos regression bug
  • "I don't understand how my patch could regress this test"
    • If you think your patch is not to blame, we can re-trigger Talos test runs to confirm the before-and-after behavior. Talk to the person who filed the bug, they would be happy to help you figure out if the regression is real and if it is indeed your patch that is responsible.
    • It's up to the patch author to contact the right people (test authors, another team, etc) to figure out why the test regressed.
    • As a general rule, we're all responsible for the performance of the code we write, even if the regression is in an unrelated part of the codebase.
  • "I don't have time to deal with this right now"
    • This one can be valid (see "valid reasons" above). However, if the regression is quite large and you're pressed for time, consider backing the patch out and re-landing it when you have time to study its performance impact.
    • If the patch can't be backed out and you don't have time to work on it at the moment, make sure to file a follow-up bug to fix the regression, assign yourself and provide a timeline for the investigation & fixing. We should fix regressions before they reach Beta/Release users.

The bottom line is that we want a visible decision on each case where the cause for the regression is known.

See also