Telemetry/Available Telemetry Datasets and their Applications: Difference between revisions

Telemetry/Available Telemetry Datasets and their Applications (view source)

Revision as of 20:23, 20 September 2016

548 bytes added , 20 September 2016

frank changes

Fbertsch

29

edits

@@ Line 1: / Line 1: @@
 =Data Set Documentation=
+This document describes a set of datasets which can be queried using re:dash/sql.telemetry.mozilla.org (s.t.m.o). In addition, they can be queried using a Spark cluster - see [https://wiki.mozilla.org/Telemetry/Custom_analysis_with_spark#How_can_I_load_parquet_datasets_in_a_Jupyter_notebook.3F these directions]. The Longditudinal dataset is also available natively within Spark, see the [https://github.com/mozilla/emr-bootstrap-spark/blob/master/examples/Longitudinal%20Dataset%20Tutorial.ipynb longitudinal tutorial].
 ==Longitudinal==
 [[Telemetry/LongitudinalExamples|Complete documentation]]
@@ Line 13: / Line 15: @@
 Querying against main summary on SQL.t.m.o/re:dash can '''impact performance for other users''' and can '''take a while to complete''' (~30m for simple queries). Since main summary includes a row for every ping, there are a large number of records which can consume a lot of resources on the shared cluster.
-Instead, we recommend using the Longitudinal dataset where possible if querying from re:dash/sql.t.m.o. The longitudinal dataset samples to 1% of all data and organized the data by client_id. In the odd case where these queries are necessary, limit to a short submission_date_s3 range and ideally make use of the sample_id field. Even better, try using Spark.
+Instead, we recommend using the Longitudinal dataset where possible if querying from re:dash/s.t.m.o. The longitudinal dataset samples to 1% of all data and organized the data by client_id. In the odd case where these queries are necessary, limit to a short submission_date_s3 range and ideally make use of the sample_id field. Ideally, users who require this dataset would use Spark.
 ==Cross Sectional==