TestEngineering/Performance/RegressionBugsHandling

From MozillaWiki
Jump to: navigation, search

Handling of Performance Regression Bugs

Below you'll find references to information about how you can investigate a potential regression for each framework.

Raptor regressions

More information about Raptor can be found here.

A guide to handling Raptor regressions can be found here.

Talos regressions

For more information about Talos tests please visit the test documentation, and for more information about running Talos tests please go to Running Talos.

Guide to handling Talos regressions you can find here.

AWSY regressions

More information about AWSY results can be found here.

Build metrics regressions

More information about build metrics results can be found here

Policies

Policy #1: Backouts within 48-hours for large regressions

The perf team and the A-Team are testing out a new policy: backing out patches that cause significant performance regressions within 48 hours.

Only regressions of 10% or more will face backout.

Why

We want to test a 48-hour backout policy because we noticed that patch authors tend not to address performance regression bugs for days. If a regression sits in the tree for days, it becomes difficult to back it out, and it becomes much more likely the regression will end up riding the trains to release by default.

This new policy is more aggressive. We think a patch that regresses performance significantly should be backed out quickly, and re-landed when its performance is acceptable.

How do the 48-hour backouts work?

The A-Team perf sheriffs will create a performance regression bug as soon as a regression is confirmed via the sheriffing workflow. The patch author and reviewer will be CC’ed, and if they don’t provide an explanation for why the regression is acceptable, the patch will be backed out. The goal is to back out unjustified regressions within 48 hours of them landing. We’d like to give the patch author about 24 hours to reply after the regression bug is filed.

The A-Team has been working hard on improving the tools for understanding performance regressions (e.g. Perfherder), and we think debugging a performance regression is a much less painful process these days. For example, there is now a highly usable view to visualize the comparison between a proposed patch against a baseline revision at https://treeherder.mozilla.org/perf.html#/comparechooser.

Policy #2: Policy for smaller regressions

When a performance regression is detected and the patch which caused it is identified
  • Whoever landed the patch is responsible to drive a decision on whether or not the regression is acceptable.
  • Such decision should preferably involve a module owner/peer of the regressed module.
  • The owner of the offending patch must respond at the bug within 3 business days, and a decision has to be made within two weeks, otherwise the offending patch may be backed out.
If the decision was to try and fix the regression, preferably a time frame and an owner would be provided as well.
It's OK to decide to accept a performance regression for valid reasons, including
  • There is other higher priority work, the patch's functionality is absolutely required, and no one else is available to investigate in the foreseeable future.
  • The magnitude of the regression is too small to warrant the effort required for an investigation.
  • The performance improvements from the patch outweigh the regression on the regressing test
  • The functionality provided by the patch is absolutely required (e.g. a security patch) or otherwise out-weighs the regression.
  • etc

Obviously, if someone still wants to fix the regression or otherwise improve the performance even if a valid reason to not do so exists, that would be welcome.

Invalid reasons for closing a performance regression bug
  • "I don't understand how my patch could regress this test"
    • If you think your patch is not to blame, we can re-trigger the jobs to confirm the before-and-after behavior. Talk to the person who filed the bug, they would be happy to help you figure out if the regression is real and if it is indeed your patch that is responsible.
    • It's up to the patch author to contact the right people (test authors, another team, etc) to figure out why the test regressed.
    • As a general rule, we're all responsible for the performance of the code we write, even if the regression is in an unrelated part of the codebase.
  • "I don't have time to deal with this right now"
    • This one can be valid (see "valid reasons" above). However, if the regression is quite large and you're pressed for time, consider backing the patch out and re-landing it when you have time to study its performance impact.
    • If the patch can't be backed out and you don't have time to work on it at the moment, make sure to file a follow-up bug to fix the regression, assign yourself and provide a timeline for the investigation & fixing. We should fix regressions before they reach Beta/Release users.

The bottom line is that we want a visible decision on each case where the cause for the regression is known.

Running performance tests

If, for some reason, the graph that generated the regression alert is not conclusive enough, there's a Guide to running performance tests that can help you so additional pushes to try to find out what's missing.

See also