Performance/Telemetry Regression Alerts

Overview

We have over 1,000 Telemetry probes so we need an automated way to monitor them for regressions. Noise is a major challenge, even more so than with Talos data, as Telemetry data is collected from a wide variety of computers, configurations and workloads. We require a reliable means of detecting regressions, improvements and changes in a measurement's distribution.

Design

The current prototype uses telemetry.js to fetch the histograms for the build-ids of the past couple of months. The histograms are passed to a python job that for each metric runs a regression algorithm and aggregates the histograms by platform and channel. The Bhattacharyya distance is computed between the histograms of the current build-id and the past N build-ids. If the variance of the distance between the histogram of the current build-id and the histograms of the past N build-ids is small enough and the distance between the histograms of the current build-id and the previous build-id is above a cutoff value K, a regression is reported. Furthermore, Histograms that don't have enough data are filtered out. Cut-off values are determined empirically from the data and past known regressions.

The Bhattacharyya distance has proven to perform significantly better (in terms of false positives) on our dataset than using a correlation coefficient, a Chi-Square test, a Mann-Whitney test, a Kolmogorov-Smirnov test of the estimated densities or a one class Support Vector Machine.

People

Roberto Vitillo: stats work
Mark Reid: Telemetry server-side changes
Avi Halachmi
Vladan Djeric
External contributors welcome, contact vdjeric@mozilla.com

Meeting notes

Tracking bug: bug 1031011

June 25, 2014: Initial meeting to discuss the approach & requirements

Performance/Telemetry Regression Alerts

Contents

Overview

Design

People

Meeting notes

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

How to Contribute

MozillaWiki

Around Mozilla

Tools