TestEngineering/Performance/Sheriffing/Noise FAQ

From MozillaWiki
Jump to: navigation, search

What is Noise

Generally a test reports values that are in a range instead of a consistent value. The larger the range of 'normal' results, the more noise we have.

Some tests will post results in a small range, and when we get a data point significantly outside the range, it is easy to identify.

The problem is that many tests have a large range of expected results. It makes it hard to determine what a regression is when we might have a range += 4% from the median and we have a 3% regression. It is obvious in the graph over time, but hard to tell until you have many future data points.

Why can we not trust a single data point

This is a problem we have dealt with for years with no perfect answer. Some reasons we do know are:

  • the test is noisy due to timing, diskIO, etc.
  • the specific machine might have slight differences
  • sometimes we have longer waits starting the browser or a pageload hang for a couple extra seconds

The short answer is we don't know and have to work within the constraints we do know.

Why do we need 12 future data points

We are re-evaluating our assertions here, but the more data points we have, the more confidence we have in the analysis of the raw data to point out a specific change.

This causes problem when we land code on Mozilla-Beta and it takes 10 days to get 12 data points. We sometimes rerun tests and just retriggering a job will help provide more data points to help us generate an alert if needed.

Can't we do smarter analysis to reduce noise

Yes, we can. We have other projects and a masters thesis has been written on this subject. The reality is we will still need future data points to show a trend and depending on the source of data we will need to use different algorithms to analyze it.

Duplicate / new alerts

One problem with coalescing is that we sometimes generate an original alert on a range of changes, then when we fill in the data (backfilling/retriggering) we generate new alerts. This causes confusion while looking at the alerts.

Here are some scenarios which duplication will be seen:

  • backfilling data from coalescing, you will see a similar alert on the same branch/platform/test but a different revision
    • action: reassign the alerts to the original alert summary so all related alerts are in one place!
  • we merge changesets between branches
    • action: find the original alert summary on the upstream branch and mark the specific alert as downstream to that alert summary
  • pgo builds
    • action: reassign these to the non-pgo alert summary (if one exists), or downstream to the correct alert summary if this originally happened on another branch

In Alert Manager it is good to acknowledge the alert and use the reassign or downstream actions. This helps us keep track of alerts across branches whenever we need to investigate in the future.


On weekends (Saturday/Sunday) and many holidays, we find that the volume of pushes are much smaller. This results in much fewer tests to be run. For many tests, especially ones that are noisier than others, we find that the few data points we collect on a weekend are much less noisy (either falling to the top or bottom of the noise range).

Here is an example view of data that behaves differently on weekends:

Weekends example.png

This affects the Talos Sheriff because on Monday when our volume of pushes picks up, we get a larger range of values. Due to the way we calculate a regression, it means that we see a shift in our expected range on Monday. Usually these alerts are generated Monday evening/Tuesday morning. These are typically small regressions (<3%) and on the noisier tests.

Multi Modal

Many tests are bi-modal or multi-modal. This means that they have a consistent set of values, but 2 or 3 of them. Instead of having a bunch of scattered values between the low and high, you will have 2 values, the lower one and the higher one.

Here is an example of a graph that has two sets of values (with random ones scattered in between):

Modal example.png

This affects the alerts and results because sometimes we get a series of results that are less modal than the original- of course this generates an alert and a day later you will probably see that we are back to the original x-modal pattern as we see historically. Some of this is affected by the weekends.

Random Noise

Random noise happens all the time. In fact our unittests fail 2-3% of the time with a test or two that randomly fails.

This doesn't affect Talos alerts as much, but keep in mind that if you cannot determine a trend for an alerted regression and have done a lot of retriggers, then it is probably not worth the effort to find the root cause.