Performance/Telemetry Regression Alerts

From MozillaWiki
Jump to: navigation, search

Overview

We have over 1,000 Telemetry probes so we need an automated way to monitor them for regressions. Noise is a major challenge, even more so than with Talos data, as Telemetry data is collected from a wide variety of computers, configurations and workloads. We require a reliable means of detecting regressions, improvements and changes in a measurement's distribution.

Design

The current prototype uses telemetry.js to fetch the histograms for the build-ids of the past couple of months. The histograms are passed to a python job that for each metric runs a regression algorithm and aggregates the histograms by platform and channel. The Bhattacharyya distance is computed between the histograms of the current build-id and the past N build-ids. If the variance of the distance between the histogram of the current build-id and the histograms of the past N build-ids is small enough and the distance between the histograms of the current build-id and the previous build-id is above a cutoff value K, a regression is reported. Furthermore, Histograms that don't have enough data are filtered out. Cut-off values are determined empirically from the data and past known regressions.

The Bhattacharyya distance has proven to perform significantly better (in terms of false positives) on our dataset than using a correlation coefficient, a Chi-Square test, a Mann-Whitney test, a Kolmogorov-Smirnov test of the estimated densities or a one class Support Vector Machine.

People

  • Roberto Vitillo: stats work
  • Mark Reid: Telemetry server-side changes
  • Avi Halachmi
  • Vladan Djeric
  • External contributors welcome, contact vdjeric@mozilla.com

Meeting notes

Tracking bug: bug 1031011