Auto-tools/Projects/Signal From Noise/StatusNovember2011

From MozillaWiki
Jump to: navigation, search

State of Performance Testing: November 2011

This document presents the state of Talos performance testing and supporting infrastructure as of November 2011, with highlights on deficiencies of the existing system.

Review of Current Performance Testing Workflow

For a push to (e.g.) mozilla-central:

It is a common misconception that TBPL talos tests turn orange for test failure. In fact, as long as no infrastructure problems are encountered, Talos runs always report green regardless of the performance measurements of the system under test. Graphserver and the talos regression emails are the only source of information as to whether a regression or improvement has resulted. However, the existing methods of calculating statistics (professing to detect regressions as low as 1%) lead to a high degree of noise and therefore a high ratio of false positives (and negatives) to actual regressions detected. The consequence is confusion amongst developers with regard to whether a reported regression is legitimate or not; the consequence of this is that the regression emails are largely ignored since there are so many of them and most of them are merely noise.

It is another common misconception that all of the "Talos regression/improvement" emails actually come from the Talos tests. Some (e.g. numbers of constructors) do not.

State of Talos, November 2011

  • Not developer friendly: hard to install/run
  • Statistics hard-coded: the averaging logic used for individual tests lived directly in code, non-configurable short of editing the python test harness
  • Pageset-centric statistics: (this applies to both talos and graphserver) talos and graphserver had statistical granularity with respect to the entire pageset (a manifest of pages, see ). This effectively prohibited inquiries into statistics for a single page
  • Deployed to test slaves via a file: a zipfile of the non-packaged Talos is copied to the test slave and added to PYTHONPATH via buildbot
    • Could not use dependency management or install as a python package
  • PerfConfigurator, the configuration module of Talos, was both fragile and contained/required duplicate code blocks
    • A frequent complaint was that running PerfConfigurator + talos was a two-step process

Types of Talos Tests

Talos has, in essence, three different kinds of tests, although the lines between them are blurred:

State of Graphserver

Talos performance data is uploaded to see .

State of Statistics, November 2011

While "no one knows" exactly why the statistics being used have made it into Talos, stastics were calculated in the following manner as of November 2011 (from ):

 ... the following aggregation process occurs: 
 1. Filter maximum value from times for each component.  
 2. Generate a median of the remaining values for each component. 
 3. Filter the maximum median from the set. 
 4. Produce an average of the remaining medians.

Where "component" is a single page in a page set. (See also: , ). Taking the median value over each page load ("component") is presumedly done because the distribution is often multi-modal (see ) and taking the median, optimistically, may sample from the main mode of a distribution. works as follows:

 To determine whether a good point is "good" or "bad", we take 20-30 points of historical data, and 5 points of future data.  We compare these using a t-test.  See . Regressions are mailed to the dev-tree-management mailing list.  Regressions are calculated by the script which uses a configuration file based on


In practice a high amount of noise and false positives (and negatives) are observed with respect to regression or improvement detections. points out the general methodology used by this script and statistical shortcomings and potentially faulty assumptions going into it. One notable violation of assumptions is that the t-test used assumes a normal distribution which we know for a fact not to be true (as documented elsewhere in the thesis).

One tool available to developers is compare-talos: a semi-official web app that exists to compare talos numbers from different runs: .

Larres (see ) and Lewchuk (see ,!msg/ ) investigated Talos statistics and suggested several potential areas of improvement.