Changes

Data Publishing

779 bytes added, 18:03, 22 September 2020

added the descriptions for aggregation levels.

! Level !! Aggregation !! Examples

|-

| 1 || '''Statistical / ML Models''' A model built/trained using real data. || TAAR, Federated learning models, Forecasting models

|-

| 2 || '''Dimension-level aggregation w/ minimum bucket sizes''' Aggregated by dimensions, minimum "bucket" size of population 5,000. || Total page loads by country, OS, locale, channel where any combination with a count less than 5,000 are grouped into “Other”

[Canada, Linux, “Other locales”, nightly] for rare locales

|-

| 3 || '''Dimension-level aggregation w/o minimum bucket sizes''' Aggregated by dimensions, no minimum bucket size. || Client ID count by country, os, locale, channel, where there could be: [Canada, Linux, PL, nightly] which has one client in it.

|-

| 4 || '''Probabilistic Aggregates''' Data structures for approximations. || [https://en.wikipedia.org/wiki/HyperLogLog HLL] for computing approximate unique client counts, [https://en.wikipedia.org/wiki/Bloom_filter bloom filter] for computing presence in a set.

|-

| 5 || '''Anonymized individual-level data''' Covers “partial aggregates” like clients_daily which is aggregated by day. Key feature is that it still has an individual-level identifier. Actual identifiers are anonymized using a one-to-one replacement value. In this example, we replaced the ID with A, B, C, etc. ||

* Anonymized_id, date, country, os, locale, channel

* A, 2019-08-08, Canada, Linux, PL, nightly

* B, 2019-08-10, Peru, Windows, EN, release

|-

| 6 || '''Not-anonymized individual-level data''' This data contains individual-level identifiers as they exist in the raw data. Compared with anonymized data, instead of A, B, and C we use the original identifiers. ||

* actual_id, date, country, os, locale, channel

* 859c8a32-0b73-b547-a5e7-8ef4ed9c4c2d, 2019-08-08, Canada, Linux, PL, nightly

* 4db8d07d-1935-9c45-93c9-6d97a790bb12, 2019-08-10, Peru, Windows, EN, release

|-

| 7 || '''High resolution individual-level data''' The highest level of resolution is releasing events at the per-second or per-subsecond resolution. || Raw telemetry events data, a sequence of actions in order of occurrence.

|-

|}

Mreid

Confirm

36

edits

Changes

Data Publishing

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

How to Contribute

MozillaWiki

Around Mozilla

Tools