Performance/Telemetry Regression Alerts
Overview
We have over 1,000 Telemetry probes so we need an automated way to monitor them for regressions. Noise is a major challenge, even more so than with Talos data, as Telemetry data is collected from a wide variety of computers, configurations and workloads. We require a reliable means of detecting regressions, improvements and changes in a measurement's distribution.
Design
The current prototype uses telemetry.js to fetch the histograms for the build-ids of the past couple of months. The histograms are passed to a python job that for each metric runs a regression algorithm and aggregates the histograms by platform and channel. The Bhattacharyya distance is computed between the histograms of the current build-id and the past N build-ids. If the variance of the distance between the histogram of the current build-id and the histograms of the past N build-ids is small enough and the distance between the histograms of the current build-id and the previous build-id is above a cutoff value K, a regression is reported. Furthermore, Histograms that don't have enough data are filtered out. Cut-off values are determined empirically from the data and past known regressions.
The Bhattacharyya distance has proven to perform significantly better on our dataset than using the Pearson correlation, a Chi-Square test, a Mann-Whitney test or a one class Support Vector Machine.
People
- Roberto Vitillo: stats work
- Mark Reid: Telemetry server-side changes
- Avi Halachmi
- Vladan Djeric
- External contributors welcome, contact vdjeric@mozilla.com
Meeting notes
Tracking bug: bug 1031011
- June 25, 2014: Initial meeting to discuss the approach & requirements