|
|
| (7 intermediate revisions by 4 users not shown) |
| Line 1: |
Line 1: |
| ===Introduction===
| | This document is now hosted here: |
| {{longitudinal data intro}}
| | https://github.com/mozilla/telemetry-batch-view/blob/master/docs/longitudinal_examples.md |
|
| |
|
| === Table structure ===
| | Wiki pages linking to this dead page: |
| | | https://wiki.mozilla.org/api.php?action=query&list=backlinks&bltitle=Telemetry/LongitudinalExamples |
| Get an overview of the longitudinal data table:
| |
| DESCRIBE longitudinal
| |
| | |
| That table has a row for each client, with columns for the different parts of the ping.
| |
| There are a lot of fields here, so I recommend downloading the results as a CSV if you want to search through these fields.
| |
| Unfortunately, there's no way to filter the outputof DESCRIBE in Presto.
| |
| | |
| Some properties are directly available to query on:
| |
| SELECT count(*) AS count
| |
| FROM longitudinal
| |
| WHERE os = 'Linux'
| |
| | |
| Note that the longitudinal table is from a 1% sample of the clients. The sampling is based on the client id and stable over time.
| |
| | |
| === Sampling ===
| |
| | |
| While composing queries, it can be helpful to work on small samples to reduce query runtimes:
| |
| | |
| SELECT * FROM longitudinal LIMIT 1000 ...
| |
| | |
| For a statistically sound sample, use TABLESAMPLE:
| |
|
| |
| SELECT * FROM longitudinal TABLESAMPLE BERNOULLI(xx)
| |
| | |
| Where xx is an integer representing what percentage of data you want to include in your sample (e.g. 10% sample -> xx=10).
| |
| | |
| A couple of caveats:
| |
| * This sampling method will only decrease your query run time if you're manipulating the data a lot. Bernoulli sampling still requires reading the whole DB before proceeding.
| |
| * This sample will not be deterministic. I.e. you will not get the same sample for every run. This can cause problems when using Presto Views or logical tables.
| |
| * Unlike LIMIT, this method does not guarantee a fixed number of results.
| |
| | |
| === Arrays ===
| |
| Other properties are arrays, which contain one entry for each submission from that client (newest first):
| |
| SELECT reason[1] AS newest_reason
| |
| FROM longitudinal
| |
| WHERE os = 'Linux'
| |
| | |
| To expand arrays and maps and work on the data in them row-wise we can use <code>UNNEST(array)</code>:
| |
| WITH lengths AS
| |
| (SELECT os, greatest(-1, least(31, sl / (24*60*60))) AS days
| |
| FROM longitudinal
| |
| CROSS JOIN UNNEST(session_length, reason) AS t(sl, r)
| |
| WHERE r = 'shutdown' OR r = 'aborted-session')
| |
| SELECT os, days, count(*) AS count
| |
| FROM lengths
| |
| GROUP BY days, os ORDER BY days ASC
| |
| | |
| Links:
| |
| * [https://prestodb.io/docs/current/functions/array.html Documentation on array functions]
| |
| * [https://prestodb.io/docs/current/sql/select.html#unnest <code>UNNEST</code> documentation]
| |
| | |
| === Maps ===
| |
| Some fields like <code>active_addons</code> or <code>user_prefs</code> are handled as maps, on which you can use the <code>[]</code> operator and special functions:
| |
| | |
| WITH adp AS
| |
| (SELECT active_addons[1]['{d10d0bf8-f5b5-c8b4-a8b2-2b9879e08c5d}']
| |
| IS NOT null AS has_adblockplus
| |
| FROM longitudinal)
| |
| SELECT has_adblockplus, count(*) AS count
| |
| FROM adp GROUP BY 1 ORDER BY 2 DESC
| |
| | |
| Links:
| |
| * [https://prestodb.io/docs/current/functions/map.html Documentation on map functions]
| |
| | |
| === Example Queries ===
| |
| * Blocklist URLs (extensions.blocklist.url):
| |
| SELECT bl, COUNT(bl)
| |
| FROM
| |
| (SELECT element_at(settings, 1).user_prefs['extensions.blocklist.url'] AS bl
| |
| FROM longitudinal)
| |
| GROUP BY bl
| |
| | |
| * Blocklist enabled/disabled (extensions.blocklist.enabled) count:
| |
| SELECT bl, COUNT(bl)
| |
| FROM
| |
| (SELECT element_at(settings, 1).blocklist_enabled AS bl
| |
| FROM longitudinal)
| |
| GROUP BY bl
| |
| | |
| * Parsing most recent submission_date
| |
| SELECT DATE_PARSE(submission_date[1], '%Y-%m-%dT00:00:00.000Z') as parsed_submission date
| |
| FROM longitudinal
| |
| | |
| * Limiting to most recent ping in the last 7 days
| |
| SELECT * FROM longitudinal
| |
| WHERE DATE_DIFF('day', DATE_PARSE(submission_date[1], '%Y-%m-%dT00:00:00.000Z'), current_date) < 7
| |
| | |
| === Using Views ===
| |
| | |
| If you find yourself copy/pasting SQL between different queries, consider using a Presto VIEW to allow for code reuse. Views create logical tables which you can reuse in other queries. For example, [https://sql.telemetry.mozilla.org/queries/776/source this view] defines some important filters and derived variables which are then used in [https://sql.telemetry.mozilla.org/queries/777/source#1311 this downstream query].
| |
| | |
| You can define a view by prefixing your query with
| |
| CREATE OR REPLACE VIEW view_name AS ...
| |
| Be careful not to overwrite an existing view! Using a unique name is important.
| |
| | |
| Find more information [https://prestodb.io/docs/current/sql/create-view.html here].
| |
| | |
| === FAQ ===
| |
| ==== I'm getting an error, "... cannot be resolved" ====
| |
| For some reason, re:dash has trouble parsing SQL strings with double quotes. Try using single quotes instead.
| |
| | |
| === Other Resources ===
| |
| * [https://prestodb.io/docs/current/sql.html Presto Docs]
| |
| * [https://docs.treasuredata.com/articles/presto-query-faq Helpful FAQ covering perf/distribution]
| |
| * [https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/com/mozilla/telemetry/views/Longitudinal.scala#L194 Longitudinal schema definition]
| |
| * [[Custom_dashboards_with_re:dash]]
| |