|
|
| Line 1: |
Line 1: |
| | #REDIRECT [[https://github.com/mozilla/telemetry-batch-view/blob/master/docs/choosing_a_dataset.md]] |
| | |
| =Data Set Documentation= | | =Data Set Documentation= |
| This document describes a set of datasets which can be queried using re:dash/sql.telemetry.mozilla.org (s.t.m.o). In addition, they can be queried using a Spark cluster - see [https://wiki.mozilla.org/Telemetry/Custom_analysis_with_spark#How_can_I_load_parquet_datasets_in_a_Jupyter_notebook.3F these directions]. The Longditudinal dataset is also available natively within Spark, see the [https://github.com/mozilla/emr-bootstrap-spark/blob/master/examples/Longitudinal%20Dataset%20Tutorial.ipynb longitudinal tutorial].
| |
|
| |
| ==Longitudinal==
| |
| [[Telemetry/LongitudinalExamples|Complete documentation]]
| |
|
| |
| {{longitudinal data intro}}
| |
|
| |
| ==Main Summary==
| |
| [https://github.com/mozilla/telemetry-batch-view/blob/master/docs/MainSummary.md Complete Documentation]
| |
|
| |
| Like the longitudinal dataset, main summary summarizes [https://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/data/main-ping.html main pings]. Each row corresponds to a single ping. This table does no sampling and includes all desktop pings.
| |
|
| |
| ===Caveats===
| |
| Querying against main summary on SQL.t.m.o/re:dash can '''impact performance for other users''' and can '''take a while to complete''' (~30m for simple queries). Since main summary includes a row for every ping, there are a large number of records which can consume a lot of resources on the shared cluster.
| |
|
| |
| Instead, we recommend using the Longitudinal dataset where possible if querying from re:dash/s.t.m.o. The longitudinal dataset samples to 1% of all data and organized the data by client_id. In the odd case where these queries are necessary, limit to a short submission_date_s3 range and ideally make use of the sample_id field. Ideally, users who require this dataset would use Spark.
| |
|
| |
| ==Cross Sectional==
| |
| The Cross Sectional dataset is a simplified version of the Longitudinal dataset.
| |
|
| |
| The majority of Longitudinal columns contain array values with one element for each ping, which is difficult to work with in SQL. The Cross Sectional dataset '''replaces these array-valued columns with summary statistics'''. To give an example, the Longitudinal dataset will contain a column named "geo_country" where each row is an array of locales for one client (e.g. array<"en_US", "en_US", "en_GB">). Instead, the Cross Sectional dataset includes a column named "geo_country_mode" where each row contains a single string representing the mode (e.g. "en_US"). The Cross Sectional column is '''easier to work with''' in SQL and is more representative than just choosing a single value from the Longitudinal array.
| |
|
| |
| Note that the Cross Sectional dataset is derived from the Longitudinal dataset, so the dataset is a '''1% sample of main pings'''
| |
|
| |
| This dataset is sometimes abbreviated as the '''xsec dataset'''. You can find the current version of the code [https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/com/mozilla/telemetry/views/CrossSectionalView.scala here]. This dataset is under active development, please '''contact rharter@mozilla.com with any questions'''.
| |
|
| |
| ==Client Count==
| |
|
| |
| The Client Count dataset is simply a count of clients in a time period, separated out into a set of dimensions.
| |
|
| |
| This is useful for questions similar to: ''How many X type of users were there during Y?'' - where X is some dimensions, and Y is some dates. Examples of X are: E10s Enabled, Operating System Type, or Country. For a complete list of dimensions, see [https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/com/mozilla/telemetry/views/ClientCountView.scala#L22 here].
| |
|
| |
| Client Count does not contain a traditional int count column, instead the counts are stored as a HyperLogLogs in the hll column. The count of the hll is found using <code>cardinality(cast(hll AS HLL))</code>, and different hll's can be merged using <code>merge(cast(hll AS HLL))</code>. An example can be found in the [https://sql.telemetry.mozilla.org/queries/81/source#129 Firefox ER Reporting].
| |
|
| |
| ===Caveats===
| |
|
| |
| Currently there is no Python wrapper for the HyperLogLog library, so the client count dataset is unavailable in Spark.
| |
|
| |
| ==Crash Aggregates==
| |
| [https://github.com/mozilla/telemetry-batch-view/blob/master/docs/CrashAggregateView.md Complete Documentation]
| |
|
| |
| The Crash Aggregates dataset compiles crash statistics over various dimensions for each day. Example dimensions include channel and country, example statistics include usage hours and plugin crashes. See the [https://github.com/mozilla/telemetry-batch-view/blob/master/docs/CrashAggregateView.md complete documentation] for all available dimensions and statistics.
| |
|
| |
| This dataset is good for queries of the form ''How many crashes did X types of users get during time Y?'' and ''Which types of users crashed the most during time Y?''.
| |
|
| |
|
| ==Mobile Metrics==
| | This document now lives here: |
| The android_events, android_clients, android_addons, and mobile_clients tables are documented here:
| | https://github.com/mozilla/telemetry-batch-view/blob/master/docs/choosing_a_dataset.md |
| https://wiki.mozilla.org/Mobile/Metrics/Redash | |