Telemetry/LongitudinalExamples: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
(Add an example about the scalars)
(Deprecate wiki based d10n)
 
(One intermediate revision by one other user not shown)
Line 1: Line 1:
===Introduction===
This document is now hosted here:
{{longitudinal data intro}}
https://github.com/mozilla/telemetry-batch-view/blob/master/docs/longitudinal_examples.md


=== Table structure ===
Wiki pages linking to this dead page:
 
https://wiki.mozilla.org/api.php?action=query&list=backlinks&bltitle=Telemetry/LongitudinalExamples
To get an overview of the longitudinal data table:
DESCRIBE longitudinal
 
That table has a row for each client, with columns for the different parts of the ping.
There are a lot of fields here, so I recommend downloading the results as a CSV if you want to search through these fields.
Unfortunately, there's no way to filter the output of DESCRIBE in Presto.
 
Because this table combines all rows for a given client id, most columns contain either Arrays or Maps (described below). A few properties are directly available to query on:
SELECT count(*) AS count
FROM longitudinal
WHERE os = 'Linux'
 
==== Arrays ====
Most properties are arrays, which contain one entry for each submission from a given client (newest first). Note that indexing starts at 1:
SELECT reason[1] AS newest_reason
FROM longitudinal
WHERE os = 'Linux'
 
To expand arrays and maps and work on the data row-wise we can use <code>UNNEST(array)</code>.
WITH lengths AS
  (SELECT os, greatest(-1, least(31, sl / (24*60*60))) AS days
    FROM longitudinal
    CROSS JOIN UNNEST(session_length, reason) AS t(sl, r)
    WHERE r = 'shutdown' OR r = 'aborted-session')
SELECT os, days, count(*) AS count
FROM lengths
GROUP BY days, os ORDER BY days ASC
 
However, it may be better to use a sample from the main_summary table instead.
 
Links:
* [https://prestodb.io/docs/current/functions/array.html Documentation on array functions]
* [https://prestodb.io/docs/current/sql/select.html#unnest <code>UNNEST</code> documentation]
 
==== Maps ====
Some fields like <code>active_addons</code> or <code>user_prefs</code> are handled as maps, on which you can use the <code>[]</code> operator and special functions:
 
WITH adp AS
  (SELECT active_addons[1]['{d10d0bf8-f5b5-c8b4-a8b2-2b9879e08c5d}']
            IS NOT null AS has_adblockplus
    FROM longitudinal)
SELECT has_adblockplus, count(*) AS count
FROM adp GROUP BY 1 ORDER BY 2 DESC
 
Links:
* [https://prestodb.io/docs/current/functions/map.html Documentation on map functions]
 
=== Sampling ===
 
While composing queries, it can be helpful to work on small samples to reduce query runtimes:
 
SELECT * FROM longitudinal LIMIT 1000 ...
 
There's no need to use other sampling methods, such as TABLESAMPLE, on the longitudinal set. Rows are randomly ordered, so a LIMIT sample is expected to be random.
 
=== Example Queries ===
* Blocklist URLs (extensions.blocklist.url):
SELECT bl, COUNT(bl)
FROM
  (SELECT element_at(settings, 1).user_prefs['extensions.blocklist.url'] AS bl
    FROM longitudinal)
GROUP BY bl
 
* Blocklist enabled/disabled (extensions.blocklist.enabled) count:
SELECT bl, COUNT(bl)
FROM
  (SELECT element_at(settings, 1).blocklist_enabled AS bl
    FROM longitudinal)
GROUP BY bl
 
* Parsing most recent submission_date
SELECT DATE_PARSE(submission_date[1], '%Y-%m-%dT00:00:00.000Z') as parsed_submission_date
FROM longitudinal
 
* Limiting to most recent ping in the last 7 days
SELECT * FROM longitudinal
WHERE DATE_DIFF('day', DATE_PARSE(submission_date[1], '%Y-%m-%dT00:00:00.000Z'), current_date) < 7
 
* Scalar measurement (how many users with more than 100 tabs)
WITH samples AS
  (SELECT
    client_id,
    normalized_channel as channel,
    mctc AS max_concurrent_tabs
  FROM longitudinal
  CROSS JOIN UNNEST(scalar_parent_browser_engagement_max_concurrent_tab_count) as t (mctc)
  WHERE
    scalar_parent_browser_engagement_max_concurrent_tab_count is not null and
    normalized_channel = 'nightly')
SELECT approx_distinct(client_id) FROM samples WHERE max_concurrent_tabs > 100
 
=== Using Views ===
 
If you find yourself copy/pasting SQL between different queries, consider using a Presto VIEW to allow for code reuse. Views create logical tables which you can reuse in other queries. For example, [https://sql.telemetry.mozilla.org/queries/776/source this view] defines some important filters and derived variables which are then used in [https://sql.telemetry.mozilla.org/queries/777/source#1311 this downstream query].
 
You can define a view by prefixing your query with
CREATE OR REPLACE VIEW view_name AS ...
Be careful not to overwrite an existing view! Using a unique name is important.
 
Find more information [https://prestodb.io/docs/current/sql/create-view.html here].
 
=== Working offline ===
It's often useful to keep a local sample of the l10l data when prototyping an analysis. The data is stored in s3://telemetry-parquet/longitudinal/. Once you have AWS credentials you can copy a shard of the parquet dataset to a local directory using `aws cp [filename] .`
 
To request AWS credentials, see [https://mana.mozilla.org/wiki/display/SVCOPS/Requesting+A+Dev+IAM+account+from+Cloud+Operations this page]. To initiate your AWS config, try `aws configure`
 
=== FAQ ===
==== I'm getting an error, "... cannot be resolved" ====
For some reason, re:dash has trouble parsing SQL strings with double quotes. Try using single quotes instead.
 
=== Other Resources ===
* [https://prestodb.io/docs/current/sql.html Presto Docs]
* [https://docs.treasuredata.com/articles/presto-query-faq Helpful FAQ covering perf/distribution]
* [https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/com/mozilla/telemetry/views/Longitudinal.scala#L194 Longitudinal schema definition]
* [[Custom_dashboards_with_re:dash]]

Latest revision as of 22:11, 7 November 2016