Telemetry/Available Telemetry Datasets and their Applications: Difference between revisions

Telemetry/Available Telemetry Datasets and their Applications (view source)

Revision as of 21:39, 7 November 2016

4,701 bytes removed , 7 November 2016

Deprecate wiki based d10n

Harter

54

edits

@@ Line 1: / Line 1: @@
+#REDIRECT [[https://github.com/mozilla/telemetry-batch-view/blob/master/docs/choosing_a_dataset.md]]
 =Data Set Documentation=
-This document describes a set of datasets which can be queried using re:dash/sql.telemetry.mozilla.org (s.t.m.o). In addition, they can be queried using a Spark cluster - see [https://wiki.mozilla.org/Telemetry/Custom_analysis_with_spark#How_can_I_load_parquet_datasets_in_a_Jupyter_notebook.3F these directions]. The Longditudinal dataset is also available natively within Spark, see the [https://github.com/mozilla/emr-bootstrap-spark/blob/master/examples/Longitudinal%20Dataset%20Tutorial.ipynb longitudinal tutorial].
-==Longitudinal==
-[[Telemetry/LongitudinalExamples|Complete documentation]]
-{{longitudinal data intro}}
-==Main Summary==
-[https://github.com/mozilla/telemetry-batch-view/blob/master/docs/MainSummary.md Complete Documentation]
-Like the longitudinal dataset, main summary summarizes [https://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/data/main-ping.html main pings]. Each row corresponds to a single ping. This table does no sampling and includes all desktop pings.
-===Caveats===
-Querying against main summary on SQL.t.m.o/re:dash can '''impact performance for other users''' and can '''take a while to complete''' (~30m for simple queries). Since main summary includes a row for every ping, there are a large number of records which can consume a lot of resources on the shared cluster.
-Instead, we recommend using the Longitudinal dataset where possible if querying from re:dash/s.t.m.o. The longitudinal dataset samples to 1% of all data and organized the data by client_id. In the odd case where these queries are necessary, limit to a short submission_date_s3 range and ideally make use of the sample_id field. Ideally, users who require this dataset would use Spark.
-==Cross Sectional==
-The Cross Sectional dataset is a simplified version of the Longitudinal dataset.
-The majority of Longitudinal columns contain array values with one element for each ping, which is difficult to work with in SQL. The Cross Sectional dataset '''replaces these array-valued columns with summary statistics'''. To give an example, the Longitudinal dataset will contain a column named "geo_country" where each row is an array of locales for one client (e.g. array<"en_US", "en_US", "en_GB">). Instead, the Cross Sectional dataset includes a column named "geo_country_mode" where each row contains a single string representing the mode (e.g. "en_US"). The Cross Sectional column is '''easier to work with''' in SQL and is more representative than just choosing a single value from the Longitudinal array.
-Note that the Cross Sectional dataset is derived from the Longitudinal dataset, so the dataset is a '''1% sample of main pings'''
-This dataset is sometimes abbreviated as the '''xsec dataset'''. You can find the current version of the code [https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/com/mozilla/telemetry/views/CrossSectionalView.scala here]. This dataset is under active development, please '''contact rharter@mozilla.com with any questions'''.
-==Client Count==
-The Client Count dataset is simply a count of clients in a time period, separated out into a set of dimensions.
-This is useful for questions similar to: ''How many X type of users were there during Y?'' - where X is some dimensions, and Y is some dates. Examples of X are: E10s Enabled, Operating System Type, or Country. For a complete list of dimensions, see [https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/com/mozilla/telemetry/views/ClientCountView.scala#L22 here].
-Client Count does not contain a traditional int count column, instead the counts are stored as a HyperLogLogs in the hll column. The count of the hll is found using  <code>cardinality(cast(hll AS HLL))</code>, and different hll's can be merged using  <code>merge(cast(hll AS HLL))</code>. An example can be found in the [https://sql.telemetry.mozilla.org/queries/81/source#129 Firefox ER Reporting].
-===Caveats===
-Currently there is no Python wrapper for the HyperLogLog library, so the client count dataset is unavailable in Spark.
-==Crash Aggregates==
-[https://github.com/mozilla/telemetry-batch-view/blob/master/docs/CrashAggregateView.md Complete Documentation]
-The Crash Aggregates dataset compiles crash statistics over various dimensions for each day. Example dimensions include channel and country, example statistics include usage hours and plugin crashes. See the [https://github.com/mozilla/telemetry-batch-view/blob/master/docs/CrashAggregateView.md complete documentation] for all available dimensions and statistics.
-This dataset is good for queries of the form ''How many crashes did X types of users get during time Y?'' and ''Which types of users crashed the most during time Y?''.
-==Mobile Metrics==
+This document now lives here:
-The android_events, android_clients, android_addons, and mobile_clients tables are documented here:
+https://github.com/mozilla/telemetry-batch-view/blob/master/docs/choosing_a_dataset.md
-https://wiki.mozilla.org/Mobile/Metrics/Redash

Telemetry/Available Telemetry Datasets and their Applications: Difference between revisions

Telemetry/Available Telemetry Datasets and their Applications (view source)

Revision as of 21:39, 7 November 2016

Navigation menu

Search