Auto-tools/Projects/Signal From Noise/Execution2012
Signal From Noise
An effort beginning in Q1 2012 was undertaken to improve talos statistics such that regressions and improvements were found more rigorously: https://wiki.mozilla.org/Auto-tools/Projects/Signal_From_Noise . While initially the project was scheduled for Q1, the complexity and amount of effort needed to provide statistical fidelity was far greater than expected. Signal from Noise builds on top of the work of Larres (https://wiki.mozilla.org/images/c/c0/Larres-thesis.pdf ) and Stephen Lewchuk who analyzed the statistics as yielded by Talos and suggested areas for improvement.
Goals of Signal from Noise
- Enhance fidelity of performance testing measurements: we want to ensure that (nearly) all regressions and improvements are detected. We also want to ensure that as few false positives (the noise) are reported as possible
- Understand our performance statistics models and assumptions end-to-end and ensure that the statistics we are using are valid
- Document and expose these statistical methodologies in such a way that they are transparent
- Make it easier/possible for developers to detect regressions in their code both locally and from try server.
- Turn talos jobs on TBPL orange when a talos regression is detected. Developers look at https://tbpl.mozilla.org/ as the singular source of truth for what is good or bad for a given push.
It is also an implicit goal of Signal from Noise to ensure that we are measuring as close as possible to user-relevant statistics. Performance tests are effectively a proxy for what the user experiences; while the test is only an analog, a principle of performance testing is that you measure analogs that have some meaning to the user. While e.g. Larres' proposal to turn off address space randomization ( https://wiki.mozilla.org/images/c/c0/Larres-thesis.pdf#page=55 ) eliminates noise, it also creates a system under test that is much different from how the user would experience Firefox. This also makes it harder, in general, to test, since each developer would have to build a non-ASR version of Firefox for Talos testing.
Execution of Signal from Noise
In general, it was initially expected that some back-of-the-envelope analysis of existing Talos numbers and some back-of-the-envelope engineering implementations of statistics that at least appeared (though non-rigorously) less noisy would satisfy the goals of the Signal from Noise project for the time being, with ongoing effort optimistically being invested in analysis of performance data. The initial SfN effort was scheduled for a single quarter. This proved rather optimistic. (See also: http://k0s.org/mozilla/blog/20120829151007 .)
In practice, because of the way averaging was split between talos and graphserver, it was effectively impossible to utilize the current system to develop more robust statistics. It quickly became apparent that we needed a system that preserved the raw measurements in order to allow us to use and compare the fidelity of different statistical models on the Talos data. The Talos test harness itself should not be crunching numbers: its role is measuring and reporting.
Work was devoted to the creation of a graphserver replacement that would make it possible to perform regression and improvement detection per-push. This is the datazilla project: https://wiki.mozilla.org/Auto-tools/Projects/Datazilla . The decision to write a new piece of infrastructure versus refining graphserver was not taken lightly. While for any new software there will be unknown (but non-incidental) sinks of time, there was little existing code in graphserver to use as a foundation for the problems we cared about solving.
Problems We Aimed to Solve with Datazilla
- Preserve and capture raw performance numbers: the Talos test framework is a bad place to do statistics, because if you do any averaging before uploading the results then the ability to retrieve the original data is forever lost. Instead, datazilla should take in all raw values from talos and provide a central platform for regression/improvement detection and statistical study.
- Reduce the granularity of Talos from a page set to a single page: statistics and regressions should be dealt with on a per-page basis, as pages may have wildly different performance characteristics. See also https://wiki.mozilla.org/Metrics/Talos_Investigation#Unrolling_Talos and http://k0s.org/mozilla/blog/20120425093346 .
- Establish a full, extensible RESTful interface to the data: Datazilla's data and statistical methods should be accessible by all developers and the tools they may wish to craft to use the data.
- Statistics should be self-evident: often, Talos+Graphserver and other statistical systems have been approached as a "black box": A number comes out that is "good" or "bad". However, this effectively leaves an interested developer in the dark as to where this number came from and discourages understanding the system and exploring data. Datazilla was designed to expose the statistics being used so that there are no mysteries here.
- No requirement to update the database every time a test or machine changes: unlike the maintenance nightmare that is the current data.sql in graphserver, the Datazilla schema should be dynamic in response to uploaded data.
- Allow experimentation with statistics: while in practice, there will be a canonical manner (or conceivably manners) to determine regressions and improvements, alternatives should be investigatable and swappable. This can only be done by creating a system that stores all the raw data from the performance tests.
- Ability to utilize data from arbitrary performance suites, not just Talos: whatever we create next for performance analysis should be able to use Datazilla as a data storage and retrieval system. This way we can use Datazilla as a building block in our next performance automation task.
- Datazilla should be able to be scalable enough to accumulate data per-push and generate a "regression/improvement" analysis for that push in real time.
- The system should also provide a UI atop its own REST interfaces so that an interested developer can start on TBPL and drill into the results of a push. The developer should be able to drill all the way down to the raw replicate values for the page (i.e. each page is loaded some number of times, and you should be able to drill down to that level if you want to).