State of Performance Testing: November 2011

This document presents the state of Talos performance testing and supporting infrastructure as of November 2011, with highlights on deficiencies of the existing system.

Review of Current Performance Testing Workflow

For a push to (e.g.) mozilla-central:

talos tests are run by buildbot on the test slaves
talos uploads numbers to http://graphs.mozilla.org from the slaves
a script versioned with graphserver (but unrelated to and unused by graphs.mozilla.org) mails dev-tree-management when a regression or improvement is detected: http://hg.mozilla.org/graphs/file/tip/server/analysis/analyze_talos.py . The methodology used is available from inspection of the script or https://wiki.mozilla.org/images/c/c0/Larres-thesis.pdf#page=73

It is a common misconception that TBPL talos tests turn orange for test failure. In fact, as long as no infrastructure problems are encountered, Talos runs always report green regardless of the performance measurements of the system under test. Graphserver and the talos regression emails are the only source of information as to whether a regression or improvement has resulted. However, the existing methods of calculating statistics (professing to detect regressions as low as 1%) lead to a high degree of noise and therefore a high ratio of false positives (and negatives) to actual regressions detected. The consequence is confusion amongst developers with regard to whether a reported regression is legitimate or not; the consequence of this is that the regression emails are largely ignored since there are so many of them and most of them are merely noise.

It is another common misconception that all of the "Talos regression/improvement" emails actually come from the Talos tests. Some (e.g. numbers of constructors) do not.

State of Talos, November 2011

Not developer friendly: hard to install/run
Statistics hard-coded: the averaging logic used for individual tests lived directly in code, non-configurable short of editing the python test harness
Pageset-centric statistics: (this applies to both talos and graphserver) talos and graphserver had statistical granularity with respect to the entire pageset (a manifest of pages, see https://wiki.mozilla.org/Buildbot/Talos/Tests#Page_Load_Tests ). This effectively prohibited inquiries into statistics for a single page
Deployed to test slaves via a talos.zip file: a zipfile of the non-packaged Talos is copied to the test slave and added to PYTHONPATH via buildbot
- Could not use dependency management or install as a python package
PerfConfigurator, the configuration module of Talos, was both fragile and contained/required duplicate code blocks
- A frequent complaint was that running PerfConfigurator + talos was a two-step process

Types of Talos Tests

Talos has, in essence, three different kinds of tests, although the lines between them are blurred:

startup tests : https://wiki.mozilla.org/Buildbot/Talos#Startup_Tests
pageloader tests : https://wiki.mozilla.org/Buildbot/Talos#Page_Load_Tests
computed (pageloader) tests : while page load tests typically measure the time it takes for a certain event to fire (e.g. MozAfterPaint), they may also compute their own metric while loading a manifest of web pages.

State of Graphserver

Talos performance data is uploaded to http://graphs.mozilla.org: see https://wiki.mozilla.org/Perfomatic .

While implemented in python, the Graphserver is not a python package
New tests/machines/etc must be added by changing https://hg.mozilla.org/graphs/file/tip/sql/data.sql and the new sql must be redeployed to refresh the live DB -- this must happen FOR EVERY TEST AND MACHINE CHANGE.
Built-in assumption that the number of interest is the mean of all pages except the longest running page ( http://hg.mozilla.org/graphs/file/f5047264f3f3/server/pyfomatic/collect.py#l212 ) which we proved is a statistically dubious assertion.

State of Statistics, November 2011

While "no one knows" exactly why the statistics being used have made it into Talos, stastics were calculated in the following manner as of November 2011 (from https://wiki.mozilla.org/Metrics/Talos_Investigation#Talos_Background ):

 ... the following aggregation process occurs: 
 1. Filter maximum value from times for each component.  
 2. Generate a median of the remaining values for each component. 
 3. Filter the maximum median from the set. 
 4. Produce an average of the remaining medians.

Where "component" is a single page in a page set. (See also: http://k0s.org/mozilla/blog/20120829151007 , https://bugzilla.mozilla.org/show_bug.cgi?id=710484 ). Taking the median value over each page load ("component") is presumedly done because the distribution is often multi-modal (see https://wiki.mozilla.org/Metrics/Talos_Investigation#Non_Normal_distributions ) and taking the median, optimistically, may sample from the main mode of a distribution.

http://hg.mozilla.org/graphs/file/tip/server/analysis/analyze_talos.py works as follows:

 To determine whether a good point is "good" or "bad", we take 20-30 points of historical data, and 5 points of future data.  We compare these using a t-test.  See https://wiki.mozilla.org/images/c/c0/Larres-thesis.pdf#page=74 . Regressions are mailed to the dev-tree-management mailing list.  Regressions are calculated by the analyze_talos.py script which uses a configuration file based on http://hg.mozilla.org/graphs/file/tip/server/analysis/analysis.cfg.template

(from https://wiki.mozilla.org/Buildbot/Talos#Regressions)

In practice a high amount of noise and false positives (and negatives) are observed with respect to regression or improvement detections. https://wiki.mozilla.org/images/c/c0/Larres-thesis.pdf#page=74 points out the general methodology used by this script and statistical shortcomings and potentially faulty assumptions going into it. One notable violation of assumptions is that the t-test used assumes a normal distribution which we know for a fact not to be true (as documented elsewhere in the thesis).

One tool available to developers is compare-talos: a semi-official web app that exists to compare talos numbers from different runs: http://perf.snarkfest.net/compare-talos/ .

Larres (see https://wiki.mozilla.org/images/c/c0/Larres-thesis.pdf ) and Lewchuk (see https://bugzilla.mozilla.org/show_bug.cgi?id=710484 , https://groups.google.com/forum/#!msg/mozilla.dev.platform/kXUFafYInWs/XRCsrapUUGAJ ) investigated Talos statistics and suggested several potential areas of improvement.

Auto-tools/Projects/Signal From Noise/StatusNovember2011

Contents

State of Performance Testing: November 2011

Review of Current Performance Testing Workflow

State of Talos, November 2011

Types of Talos Tests

State of Graphserver

State of Statistics, November 2011

Navigation menu

Auto-tools/Projects/Signal From Noise/StatusNovember2011

State of Performance Testing: November 2011

Review of Current Performance Testing Workflow

State of Talos, November 2011

Types of Talos Tests

State of Graphserver

State of Statistics, November 2011

Navigation menu

Search