Auto-tools/Projects/Signal From Noise/StatusNovember2011
State of Performance Testing: November 2011
This document presents the state of Talos performance testing and supporting infrastructure as of November 2011, with highlights on deficiencies of the existing system.
Review of Current Performance Testing Workflow
For a push to (e.g.) mozilla-central:
- talos tests are run by buildbot on the test slaves
- talos uploads numbers to http://graphs.mozilla.org from the slaves
- a script versioned with graphserver (but unrelated to and unused by graphs.mozilla.org) mails dev-tree-management when a regression or improvement is detected: http://hg.mozilla.org/graphs/file/tip/server/analysis/analyze_talos.py . The methodology used is available from inspection of the script or https://wiki.mozilla.org/images/c/c0/Larres-thesis.pdf#page=73
It is a common misconception that TBPL talos tests turn orange for test failure. In fact, as long as no infrastructure problems are encountered, Talos runs always report green regardless of the performance measurements of the system under test. Graphserver and the talos regression emails are the only source of information as to whether a regression or improvement has resulted. However, the existing methods of calculating statistics (professing to detect regressions as low as 1%) lead to a high degree of noise and therefore a high ratio of false positives (and negatives) to actual regressions detected. The consequence is confusion amongst developers with regard to whether a reported regression is legitimate or not; the consequence of this is that the regression emails are largely ignored since there are so many of them and most of them are merely noise.
It is another common misconception that all of the "Talos regression/improvement" emails actually come from the Talos tests. Some (e.g. numbers of constructors) do not.
State of Talos, November 2011
- Not developer friendly: hard to install/run
- Statistics hard-coded: the averaging logic used for individual tests lived directly in code, non-configurable short of editing the python test harness
- Pageset-centric statistics: (this applies to both talos and graphserver) talos and graphserver had statistical granularity with respect to the entire pageset (a manifest of pages, see https://wiki.mozilla.org/Buildbot/Talos/Tests#Page_Load_Tests ). This effectively prohibited inquiries into statistics for a single page
- Deployed to test slaves via a talos.zip file: a zipfile of the non-packaged Talos is copied to the test slave and added to PYTHONPATH via buildbot
- Could not use dependency management or install as a python package
- PerfConfigurator, the configuration module of Talos, was both fragile and contained/required duplicate code blocks
- A frequent complaint was that running PerfConfigurator + talos was a two-step process
Types of Talos Tests
Talos has, in essence, three different kinds of tests, although the lines between them are blurred:
- startup tests : https://wiki.mozilla.org/Buildbot/Talos#Startup_Tests
- pageloader tests : https://wiki.mozilla.org/Buildbot/Talos#Page_Load_Tests
- computed (pageloader) tests : while page load tests typically measure the time it takes for a certain event to fire (e.g. MozAfterPaint), they may also compute their own metric while loading a manifest of web pages.
State of Graphserver
- While implemented in python, the Graphserver is not a python package
- New tests/machines/etc must be added by changing https://hg.mozilla.org/graphs/file/tip/sql/data.sql and the new sql must be redeployed to refresh the live DB -- this must happen FOR EVERY TEST AND MACHINE CHANGE.
- Built-in assumption that the number of interest is the mean of all pages except the longest running page ( http://hg.mozilla.org/graphs/file/f5047264f3f3/server/pyfomatic/collect.py#l212 ) which we proved is a statistically dubious assertion.
State of Statistics, November 2011
While "no one knows" exactly why the statistics being used have made it into Talos, stastics were calculated in the following manner as of November 2011 (from https://wiki.mozilla.org/Metrics/Talos_Investigation#Talos_Background ):
... the following aggregation process occurs: 1. Filter maximum value from times for each component. 2. Generate a median of the remaining values for each component. 3. Filter the maximum median from the set. 4. Produce an average of the remaining medians.
Where "component" is a single page in a page set. (See also: http://k0s.org/mozilla/blog/20120829151007 , https://bugzilla.mozilla.org/show_bug.cgi?id=710484 ). Taking the median value over each page load ("component") is presumedly done because the distribution is often multi-modal (see https://wiki.mozilla.org/Metrics/Talos_Investigation#Non_Normal_distributions ) and taking the median, optimistically, may sample from the main mode of a distribution.
http://hg.mozilla.org/graphs/file/tip/server/analysis/analyze_talos.py works as follows:
To determine whether a good point is "good" or "bad", we take 20-30 points of historical data, and 5 points of future data. We compare these using a t-test. See https://wiki.mozilla.org/images/c/c0/Larres-thesis.pdf#page=74 . Regressions are mailed to the dev-tree-management mailing list. Regressions are calculated by the analyze_talos.py script which uses a configuration file based on http://hg.mozilla.org/graphs/file/tip/server/analysis/analysis.cfg.template
In practice a high amount of noise and false positives (and negatives) are observed with respect to regression or improvement detections. https://wiki.mozilla.org/images/c/c0/Larres-thesis.pdf#page=74 points out the general methodology used by this script and statistical shortcomings and potentially faulty assumptions going into it. One notable violation of assumptions is that the t-test used assumes a normal distribution which we know for a fact not to be true (as documented elsewhere in the thesis).
One tool available to developers is compare-talos: a semi-official web app that exists to compare talos numbers from different runs: http://perf.snarkfest.net/compare-talos/ .
Larres (see https://wiki.mozilla.org/images/c/c0/Larres-thesis.pdf ) and Lewchuk (see https://bugzilla.mozilla.org/show_bug.cgi?id=710484 , https://groups.google.com/forum/#!msg/mozilla.dev.platform/kXUFafYInWs/XRCsrapUUGAJ ) investigated Talos statistics and suggested several potential areas of improvement.