Data Publishing

From MozillaWiki
Jump to: navigation, search

Introduction

Mozilla’s history is steeped in openness and transparency - it’s simply core to what we do and how we see ourselves in the world. We are always looking for ways to bring our mission to life in ways that help create a healthy internet and support the Mozilla Manifesto. One of our commitments says “We are committed to an internet that elevates critical thinking, reasoned argument, shared knowledge, and verifiable facts”.

To this end, we have spent a good amount of time considering how we can publicly share our Mozilla telemetry data sets - it is one of the most simple and effective ways we can enable collaboration and share knowledge. But, only if it can be done safely and in a privacy protecting, principled way. We believe we’ve designed a way to do this and we are excited to outline our approach here.

Dataset Publishing Process

We want our data publishing review process, as well as our review decisions to be public and understandable, similar to our Mozilla Data Collection program. To that end, our full dataset publishing policy and details about what considerations we look at before determining what is safe to publish can be found below, including a summary of the critical pieces of that process.

The goal of our data publishing process is to:

  • Reduce friction for data publishing requests with low privacy risk to users;
  • Have a review system of checks and balances that considers both data aggregations and data level sensitivities to determine privacy risk prior to publishing, and;
  • Create a public record of these reviews, including making data and the queries that generate it publicly available and putting a link to the dataset + metadata on a public-facing Mozilla property.

This page defines all of the factors that must be taken into consideration before publicly sharing Mozilla’s telemetry data. It describes:

  • The levels of possible dataset aggregations using Mozilla’s data
  • The levels of publishing sensitivity
  • What dimensions are sensitive, and at which level
  • What metrics are sensitive, and at which level
  • How we characterize the levels of aggregation

How we characterize the levels of aggregation

The table below describes the various types of aggregation levels we are defining.

Level Aggregation Examples
1 Statistical / ML Models
A model built/trained using real data.
TAAR, Federated learning models, Forecasting models
2 Dimension-level aggregation w/ minimum bucket sizes
Aggregated by dimensions, minimum "bucket" size of population 5,000.
Total page loads by country, OS, locale, channel where any combination with a count less than 5,000 are grouped into “Other”

[Canada, Linux, “Other locales”, nightly] for rare locales

3 Dimension-level aggregation w/o minimum bucket sizes
Aggregated by dimensions, no minimum bucket size.
Client ID count by country, os, locale, channel, where there could be: [Canada, Linux, PL, nightly] which has one client in it.
4 Probabilistic Aggregates
Data structures for approximations.
HLL for computing approximate unique client counts, bloom filter for computing presence in a set.
5 Anonymized individual-level data
Covers “partial aggregates” like clients_daily which is aggregated by day. Key feature is that it still has an individual-level identifier. Actual identifiers are anonymized using a one-to-one replacement value. In this example, we replaced the ID with A, B, C, etc.
  • Anonymized_id, date, country, os, locale, channel
  • A, 2019-08-08, Canada, Linux, PL, nightly
  • A, 2019-08-09, Canada, Linux, PL, nightly
  • A, 2019-08-10, Canada, Linux, PL, nightly
  • B, 2019-08-10, Peru, Windows, EN, release
6 Not-anonymized individual-level data
This data contains individual-level identifiers as they exist in the raw data. Compared with anonymized data, instead of A, B, and C we use the original identifiers.
  • actual_id, date, country, os, locale, channel
  • 859c8a32-0b73-b547-a5e7-8ef4ed9c4c2d, 2019-08-08, Canada, Linux, PL, nightly
  • 859c8a32-0b73-b547-a5e7-8ef4ed9c4c2d, 2019-08-09, Canada, Linux, PL, nightly
  • 859c8a32-0b73-b547-a5e7-8ef4ed9c4c2d, 2019-08-10, Canada, Linux, PL, nightly
  • 4db8d07d-1935-9c45-93c9-6d97a790bb12, 2019-08-10, Peru, Windows, EN, release
7 High resolution individual-level data
The highest level of resolution is releasing events at the per-second or per-subsecond resolution.
Raw telemetry events data, a sequence of actions in order of occurrence.

How we characterize the sensitivity of dimensions

Based on the [Collection Categories], most Telemetry data naturally falls within category 1 (technical data) and 2 (interaction data), which are not considered sensitive. A notable exception, however, is geo location, which we geocode from IP addresses to extract City / Region / Country, but only include cities with a population > 15,000 (according to the Geonames database).

Category 3 (web activity) or 4 (highly-sensitive) data should be excluded from the set of “safe” dimensions.

Matrix of aggregation safety vs. dimension sensitivity:

Category Aggregation Level Notes
Category 1 (Technical) and 2 (Interaction) 1, 2, 3 For low-sensitivity data, we may not require a minimum bucket size for aggregation.
Category 3 (Web Activity) 1, 2 As sensitivity increases, minimum bucket size becomes increasingly important.
Category 4 (Highly Sensitive Data) 1, 2 Technically, category 4 often involves highly sensitive data, such as explicit identifiers, that will be removed in the process of aggregation. We include it here for the sake of completeness.

How do we characterize the sensitivity of metrics?

Most metrics are not sensitive information, per se. That said, if a metric indicates or directly implies something about revenue, it is “sensitive”. Example: Search counts.

Dataset Publishing Process

The goal of this process is to (1) make the “easy” (that is, safe) data publishing requests relatively friction-less, (2) have guard rails in-place so we don’t publish something that exposes us or our users to risk in some way, and (3) ensure that the dataset publishing request process matches closely other processes that are familiar to the data stewards.

Having a dataset published requires filling out a bug. Requests will use the nomenclature defined in the preceding sections to answer a series of questions including the following four. If the answer to all of them is “no”, the data may be published. A “yes” above means extra review is required.

  • Is the level of aggregation 3 or higher?
  • Are there any Data Collection Category 3 (web activity) or 4 (highly-sensitive) dimensions?
  • Do any of the dimensions or metrics include sensitive data?
  • Are there any data included that do not have a corresponding data review for collection? Please link to relevant data review(s).

A data steward will then be assigned to the bug, either by the requester or as part of bug triage, to double-check that these questions are correctly answered and there are no confounding factors inherent to the publishing of the data.

Once the request is approved, data engineering will do the implementation work:

  • Write (or review) the query
  • Schedule it to update on the desired frequency
  • Plumb it in to the public facing dataset infrastructure, including metadata that links the public data back to the above review bug.
  • Once the dataset has been published, it will be announced on the Data @ Mozilla blog. Accessing the public data is described on the data documentation page.

Definitions

Metric - A metric is anything we want to measure. Examples: the number of clients that used the developer tools console, the number of active clients.

Dimension - A dimension is a qualitative value such as OS, channel, or date. In practice, a dimension often defines a sub-population on which we can calculate a metric, allowing us to segment the metric for further analysis. Examples: if we have an OS dimension, we can analyze the number of active clients by OS.

Aggregate - A combined value of many measurements (metric values), typically grouped by dimension or sets of dimensions.

Individual-level Data - Data containing a dimension which uniquely identifies a single profile, user, client, etc.

Tabular Data - Data that consists of rows (or records) and columns (or fields). Each row has the same number of columns, and each column represents a dimension or metric for that row. Think of a spreadsheet or CSV file as examples of this type of data.

What's Been Published So Far?

Our publicly available datasets are here.