State of Performance Testing: November 2012

While the Signal from Noise project is not yet complete, there have been considerable improvements to Talos and the supporting infrastructure in the preceding year.

State of Talos: November 2012

The following areas have been improved in Talos as part of the SfN project:

Talos has been made more developer friendly
- It is now a standard python package with appropriate dependencies
- It is now easier to run: https://wiki.mozilla.org/Buildbot/Talos/Running#Running_locally_-_Source_Code
- Features `talos` + `PerfConfigurator` executables
- `talos` may be executed in a single step
- Standalone Talos has been deprecated
- The Pageloader extension now versioned with Talos
Per recommendation of Lewchuk and Metrics ( https://wiki.mozilla.org/Metrics/Talos_Investigation#Proposed_Filtering_Changes ), page load tests no longer ignore the maximum value and instead ignore the first 5 values per page: https://bugzilla.mozilla.org/show_bug.cgi?id=710484 , see ignore_first:5 in http://k0s.org:8080/?show=active
Talos statistics being reported to graphserver are now configurable via filters (see http://k0s.org/mozilla/blog/20120215124438 ). This said, we ultimately want to remove filters entirely once we no longer have to maintain graphserver. Talos shouldn't be doing statistics; this is a stop-gap measure (albeit a year long one) until we switch to using Datazilla and turn off graphserver.
All raw Talos measurements are now reported to Datazilla: https://datazilla.mozilla.org/talos
Talos tp tests no longer touch network: https://bugzilla.mozilla.org/show_bug.cgi?id=720852
PerfConfigurator has been made a robust YAML/JSON configuration parser and generator
- run_tests.py now utilizes PerfConfigurator
- PerfConfigurator and remotePerfConfigurator have also been combined
- Separate PerfConfigurator step no longer required
Pageloader no longer calculates statistics: https://bugzilla.mozilla.org/show_bug.cgi?id=723571
Test definitions are no longer duplicated throughout the Talos codebase. These are now in a python file: http://hg.mozilla.org/talos/file/tip/talos/test.py. This allows the test definitions to take advantage of inheritance and to stop duplicating repeating code and have made Talos far easier to configure, change, maintain, and expand. There is more we want to do here, this is merely a start. For more details see https://bugzilla.mozilla.org/show_bug.cgi?id=814228. jmaher - we should expand on the usage of this, the audience is developers
Talos testing on try server with talos.json: Release engineering put forth considerable effort to allow talos changes to be tested with try server. The results of this effort include having the URL of a talos.zip file listed in https://mxr.mozilla.org/mozilla-central/source/testing/talos/talos.json . This change alone probably saved hundreds of man-hours. ctalbert - expand this so that we can detail a bit about how to actually go about making use of the functionality.
software has been written which includes a web app component that details the up-to-date names of talos tests and suites in buildbot, TBPL, talos, and graphserver: http://k0s.org/mozilla/hg/talosnames/ . A deployed instance as at http://k0s.org:8080/ .

Several contributors have also participated in Talos development. \o/ The scope of their contributions have ranged from good first bug fixes to over-arching rewrites of parts of the software. Thanks goes out to all the folks that volunteered their time to help out here.

There are several remaining areas where the Talos software should be improved such as:

Complete our work toward creating a central definition of what a Talos test is versus split between the various definitions: https://bugzilla.mozilla.org/show_bug.cgi?id=814228
Unification of Talos counters: https://bugzilla.mozilla.org/show_bug.cgi?id=812352

State of Datazilla: November 2012

Datazilla manages talos data with three distinct database schemas: talos_objectstore_1, talos_perftest_1, and pushlog_hgmozilla_1. The objectstore contains a single table designed to store JSON objects. These objects contain a set of untreated replicate values for every page in a given Talos test suite. They are indexed in a separate schema called talos_perftest_1. In addition to indexing test data and reference data (product type, platform information, test suite/page names) the index also stores associated metrics data. This includes the results of the welch's one sided t-test, application of false discovery rate, and exponentially smoothed means and standard deviations. The application of metrics are treated generically in the schema, so any number of statistical treatments of the raw data can be supported in the future. The pushlog_hgmozilla_1 schema maintains an ordered list of pushes that are used to compare consecutive pushes to one another. The raw JSON data generated by Talos in production is received asyncronously and not necessarily in the push order that occured from the repository. All of the database schema's can be found here: https://github.com/mozilla/datazilla/tree/master/datazilla/model/sql/template_schema .

The user interface for datazilla was initially designed to drill down and examine the raw data associated with a Talos test. This was helpful in Q1-Q2 2012, in determining what needed to be done but does not address the issue of performance regression detection which is most relevant to developers and sheriffs. A new user interface was designed and implemented in Q4 to display the results of the new metrics treatment.

datazilla is now deployed in production. The source code can be found here https://github.com/mozilla/datazilla
datazilla utilizes production Talos data
RESTful API: http://datazilla.readthedocs.org/en/latest/webservice/
there is a python client as used by talos: https://github.com/mozilla/datazilla_client

State of Statistics: November 2012

The Mozilla Metrics team, https://wiki.mozilla.org/Metrics , worked as part of Signal from Noise to audit our performance statistical methodology and help develop better models. Metrics looked at the following issues:

Determine source(s) of variation in the data: After looking at the data from running experiments, Metrics determined two main sources of variation in the data. First, aggregating all the test pages into a single number was hiding true signal from noise as the pageload times for the 100 pages were very different. Second, the way Talos data was being collected before Q1 2012 introduced a large variation within the replicates of each test page.

Interleaved/non-interleaved tests: as of Q1 2012, pageload tests (see https://wiki.mozilla.org/Buildbot/Talos#Page_Load_Tests ) were run such that the entire pageset was cycled through 'N' times, where 'N' is the number of replicates per page. We were concerned that this could be a source of our noise. This issue was investigated, see http://elvis314.wordpress.com/2012/03/12/reducing-the-noise-in-talos/ and http://people.mozilla.org/~ctalbert/TalosPlots/rowmajor_change/index.html. This way, the "within" variation for individual test pages decreased (which means more powerful in detecting regressions between pushes)

Non-normal distributions - https://wiki.mozilla.org/Metrics/Talos_Investigation#Non_Normal_distributions : Several non-normal distributions were found amongst the Talos data sets, including multi-modal distributions. One of the causes of multimodality was due to aggregation of pages with very different pageload times due to different characteristics of the pages we are testing in tp 5. Hence, it is crucial to move to page-centric testing, rather than aggregated testing.

Determining the number of observations per testpage: It is crucial that we have a good balance machine time for a talos test and having enough replicates for statisical viability of the test results. The optimal number of replicates for each test page for statistical testing is about 30 (J devore, Probability & Statistics for Engineering & Sciences 8th ed. p. 226). However, due to the time constraints, we decided to collect 25 replicates (still a big improvement from previous, when we collected 10 replicates but not optimal).

Pageload tests ignore the first 5 data points (Ignore_first:5) : Metrics determined that ignoring the first 5 data points on pageload tests increased statistical viability because most of the variation was coming from the first few data points ; see https://bugzilla.mozilla.org/show_bug.cgi?id=731391 , https://wiki.mozilla.org/images/d/dd/Talos_Statistical_Analysis_Writeup.pdf.

Audit of tp5 pageset : The tp5 (see https://wiki.mozilla.org/Buildbot/Talos#tp5 ) pageset was audited as some of the pages had significantly large variation within replicates: https://wiki.mozilla.org/images/3/38/Tp5_Good_Pages.pdf (Metrics recommend decreasing the size of test pages in tp5 and increasing the number of replicates).

Quality of data: Some pages show systematic patterns which may indicate that there is a problem with the data being collected (may be due to hardware, software, validity of test pages, etc.). This should be investigated to ensure that the data we collect for testing correctly represents what we are trying to measure.

New method for regression detection: https://wiki.mozilla.org/images/d/dd/Talos_Statistical_Analysis_Writeup.pdf : Working with Datazilla results for tp5 test pages, Metrics developed a regression detection algorithm.To compare the mean of each page to of the new push to the mean of each page to the current push, hypothesis tests are conducted http://en.wikipedia.org/wiki/Statistical_hypothesis_testing. Welch's t-test is used to determine whether a page has regressed for the given new push. Moving to page-centric testing led to multiple hypothesis testing problem, and to correct for the inflation of false positives, False Discovery Rate Procedure (FDR) is used: http://www.stat.cmu.edu/~genovese/talks/hannover1-04.pdf. Due to the natural variation between consecutive pushes, exponential smoothing was implemented before performing FDR procedure. Code for this is available in https://github.com/mozilla/datazilla-metrics

Datazilla utilizes these improved statistical methodologies. Datazilla uses the welch's ttest, the FDR procedure, and the exponential smoothing. A datazilla-metrics repository, https://github.com/mozilla/datazilla-metrics , has been created, which is a python package that implements statistical methods useful for Datazilla.

Performance Testing Roadmap: 2013

It is a goal for 2013 to finish up the loose ends for talos, datazilla, and signal from noise in general:

Switch primary performance UI to be Datazilla : bug 824813
Deprecate Graphserver : bug 824814
Turn regressions orange on TBPL : bug 824812

Conclusion

In the last year, we've dug into every part of the performance testing automation at Mozilla. We have analyzed the test harness, the reporting tools, and the statistical soundness of the results that were being generated. Over the course of that year, we used what we learned to make the Talos framework easier to maintain, easier to run, simpler to set up, easier to test on try, and less error prone.

We have created Datazilla, an extensible system for storing and retriving all our performance metrics from Talos and any future performance automation. We have rebooted our performance statistical analysis and created statistically viable, per-push regression/improvement detection. We have made all these systems easier to use and more open so that any contributor anywhere can take a look at our code and even experiment with new methods of statistical analysis on our performance data.

But, we're not finished yet. There are more fixes to be done to the Talos framework itself. And the most critical piece of the infrastructure move still has to take place. We have to shift to using Datazilla in production and deprecating our use of Graphserver for new versions of Firefox. As we do that, we can clean out the remaining cruft in the Talos test framework, and focus our efforts on new ground breaking performance automation.Stay tuned. Or better yet, get involved: https://wiki.mozilla.org/Auto-tools#Want_to_Help.3F

Auto-tools/Projects/Signal From Noise/StatusNovember2012

Contents

State of Performance Testing: November 2012

State of Talos: November 2012

State of Datazilla: November 2012

State of Statistics: November 2012

Performance Testing Roadmap: 2013

Conclusion

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

How to Contribute

MozillaWiki

Around Mozilla

Tools