29
edits
(→Cross Sectional: Clarify XSec docs) |
(frank changes) |
||
Line 1: | Line 1: | ||
=Data Set Documentation= | =Data Set Documentation= | ||
This document describes a set of datasets which can be queried using re:dash/sql.telemetry.mozilla.org (s.t.m.o). In addition, they can be queried using a Spark cluster - see [https://wiki.mozilla.org/Telemetry/Custom_analysis_with_spark#How_can_I_load_parquet_datasets_in_a_Jupyter_notebook.3F these directions]. The Longditudinal dataset is also available natively within Spark, see the [https://github.com/mozilla/emr-bootstrap-spark/blob/master/examples/Longitudinal%20Dataset%20Tutorial.ipynb longitudinal tutorial]. | |||
==Longitudinal== | ==Longitudinal== | ||
[[Telemetry/LongitudinalExamples|Complete documentation]] | [[Telemetry/LongitudinalExamples|Complete documentation]] | ||
Line 13: | Line 15: | ||
Querying against main summary on SQL.t.m.o/re:dash can '''impact performance for other users''' and can '''take a while to complete''' (~30m for simple queries). Since main summary includes a row for every ping, there are a large number of records which can consume a lot of resources on the shared cluster. | Querying against main summary on SQL.t.m.o/re:dash can '''impact performance for other users''' and can '''take a while to complete''' (~30m for simple queries). Since main summary includes a row for every ping, there are a large number of records which can consume a lot of resources on the shared cluster. | ||
Instead, we recommend using the Longitudinal dataset where possible if querying from re:dash/ | Instead, we recommend using the Longitudinal dataset where possible if querying from re:dash/s.t.m.o. The longitudinal dataset samples to 1% of all data and organized the data by client_id. In the odd case where these queries are necessary, limit to a short submission_date_s3 range and ideally make use of the sample_id field. Ideally, users who require this dataset would use Spark. | ||
==Cross Sectional== | ==Cross Sectional== |
edits