CloudServices/DataPipeline: Difference between revisions

 
(13 intermediate revisions by 4 users not shown)
Line 1: Line 1:
= Overview =
= Overview =
The cloud services data pipeline ingests data for analysis, monitoring and reporting. The pipeline is currently used for processing desktop and device [[Telemetry|Telemetry]] data and cloud services server logs. The [[Firefox/Measurement|Firefox Measurement Team]] is building the data pipeline.
The cloud services data pipeline ingests data for analysis, monitoring and reporting. The pipeline is currently used for processing desktop and device [[Telemetry|Telemetry]] data and cloud services server logs. The ingestion pipeline is one component of the [[Data/Platform|Fx Data Platform]].


= Team Communication =
* IRC channel: #datapipeline
* Mailing list: dev-metrics-pipeline@mozilla.com
* Standup meeting: https://etherpad.mozilla.org/data-pipeline-meeting-notes
* Bugzilla: http://mzl.la/1DOOBZt
= Cross Team Communication =
* FHR mailing list: [https://mail.mozilla.org/listinfo/fhr-dev fhr-dev]
* FHR v4 standup meeting: https://etherpad.mozilla.org/fhr-v4-status
* Cross team coordination meeting (ended 3/19): https://etherpad.mozilla.org/data-pipeline-coordination
= Resources =
=== Pipeline specs/docs ===
=== Pipeline specs/docs ===
* [https://docs.google.com/a/mozilla.com/document/d/1tzPc9hIACNi07psaQEKfpYQho8wuObC_BkMg3QEDIwA/edit#heading=h.vbs9qotdifjb Pipeline technical proposal]
* [https://docs.google.com/a/mozilla.com/document/d/1tzPc9hIACNi07psaQEKfpYQho8wuObC_BkMg3QEDIwA/edit#heading=h.vbs9qotdifjb Pipeline technical proposal]
Line 19: Line 7:
* [[CloudServices/DataPipeline/Metadata|Pipeline Metadata]]
* [[CloudServices/DataPipeline/Metadata|Pipeline Metadata]]


=== Reporting and tools ===
=== Data sets and other documentation ===
* [https://docs.google.com/a/mozilla.com/document/d/1QGiXfQ0AHCkJNXfMPArjab8Gq8zIdqDopCBr-1qD3sc/edit?usp=sharing Reporting and monitoring overview]
* [http://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/index.html Telemetry Data]
* [https://mana.mozilla.org/wiki/display/CLOUDSERVICES/Bespoke+Dashboards Bespoke Dashboards]
* [https://wiki.mozilla.org/Mobile/Metrics/Redash Mobile Metrics]
 
* [https://github.com/mozilla/testpilot/blob/master/docs/README-METRICS.md Test Pilot]
=== Planning ===
* [https://mana.mozilla.org/wiki/display/CLOUDSERVICES/Cloud+Services+Data Cloud Services Data Projects]
* [https://mana.mozilla.org/wiki/display/CLOUDSERVICES/Data+Sources List of Data Sources]
* [https://mana.mozilla.org/wiki/display/CLOUDSERVICES/V1+Pipeline V1 Pipeline & Data Sources]
 
= Pipeline Milestones =
* '''Q1 2015''': Launch pipeline prototype
** Architecture decisions completed; production stack up and running with monitoring dashboards
** Business Intelligence/Data Warehouse proof of concept implemented
** Ingestion process completed for FHR+telemetry (start collecting on 2015-02-23)
** Backprocessing from pipeline datastore implemented
** By client ID analysis supported
** Pipeline runs in parallel to existing infrastructure; not yet source of truth
* '''Q2 2015''': Pipeline officially supports business use cases
** FHR v4 feeds executive dashboard
** Complete set of use cases tbd (most likely primarily FHR+telemetry use cases)
** Complete set of monitoring and reporting outputs tbd: dashboards, data warehouse, monitoring, self-service access to data
** FHR+telemetry hits full release 2015-05-19, handle full production load
* '''Q3 2015''': Fill out monitoring and reporting capabilities; add sources and use cases
 
= Related Dates and Schedules =
* '''FHR+Telemetry client work'''
** Current plan: FF38
** 2015-02-23 Nightly
** 2015-03-30 Aurora
** 2015-05-11 Release
 
= Work Queue =
Tracking tasks in bugzilla: http://mzl.la/1DOOBZt
 
=== Risks and Open Questions ===
* Old-FHR data through pipeline? Yes/No: [telliot]
* Deletes & legal policy [telliot]
* Security review [telliot]


= Code =
= Code =
Line 73: Line 27:
|-
|-
| https://github.com/mozilla-services/heka || Data collection and processing made easy
| https://github.com/mozilla-services/heka || Data collection and processing made easy
|-
| https://github.com/mozilla-services/nginx_moz_ingest || HTTP Data Pipeline Ingestion
|-
|-
| https://github.com/trink/hindsight || Data collection and processing made light weight, fast, and more reliable
| https://github.com/trink/hindsight || Data collection and processing made light weight, fast, and more reliable
Line 95: Line 51:
| https://github.com/mozilla/emr-bootstrap-spark || AWS bootstrap scripts for Mozilla's flavoured Spark setup.
| https://github.com/mozilla/emr-bootstrap-spark || AWS bootstrap scripts for Mozilla's flavoured Spark setup.
|-
|-
| https://github.com/mozilla/moz-crash-rate-aggregates || Crash Rate Aggregation code
|-
| https://github.com/mozilla/jupyter-notebook-gist || Plugin to create, list, and load GitHub Gists from Jupyter notebooks
| https://github.com/mozilla/jupyter-notebook-gist || Plugin to create, list, and load GitHub Gists from Jupyter notebooks
|-
|-
| https://github.com/mreid-moz/jupyter-spark || Jupyter Notebook extension for Apache Spark integration
| https://github.com/mozilla/jupyter-spark || Jupyter Notebook extension for Apache Spark integration
|-
|-
| https://github.com/mozilla/python_mozaggregator || Aggregator job for telemetry.mozilla.org
| https://github.com/mozilla/python_mozaggregator || Aggregator job for telemetry.mozilla.org
Line 104: Line 62:
|-
|-
| https://github.com/mozilla/telemetry-analysis-service || Eventual home of the revamped a.t.m.o (per Bug 1248688)
| https://github.com/mozilla/telemetry-analysis-service || Eventual home of the revamped a.t.m.o (per Bug 1248688)
|-
| https://github.com/vitillo/telemetry-airflow || Scheduling / workflow management for Telemetry jobs
|-
| https://github.com/vitillo/e10s_analyses || Data analysis relating to Electrolysis / E10s
|-
|-
| https://github.com/mozilla/telemetry-tools || Utility code to work with Mozilla Telemetry data
| https://github.com/mozilla/telemetry-tools || Utility code to work with Mozilla Telemetry data
Line 109: Line 71:


= Archive =
= Archive =
* [https://docs.google.com/a/mozilla.com/document/d/1QGiXfQ0AHCkJNXfMPArjab8Gq8zIdqDopCBr-1qD3sc/edit?usp=sharing Q4 2014: Reporting and monitoring overview]
* [https://mana.mozilla.org/wiki/display/CLOUDSERVICES/Bespoke+Dashboards Bespoke Dashboards]
* [https://mana.mozilla.org/wiki/display/CLOUDSERVICES/Cloud+Services+Data Cloud Services Data Projects]
* [https://mana.mozilla.org/wiki/display/CLOUDSERVICES/Data+Sources List of Data Sources]
* [https://mana.mozilla.org/wiki/display/CLOUDSERVICES/V1+Pipeline V1 Pipeline & Data Sources]
* [https://docs.google.com/a/mozilla.com/document/d/1CTazW99zBK5K40f-fgSyTPw9IXgmFYjQmNhzxTT9Tts/edit?usp=sharing post workweek roadmap]
* [https://docs.google.com/a/mozilla.com/document/d/1CTazW99zBK5K40f-fgSyTPw9IXgmFYjQmNhzxTT9Tts/edit?usp=sharing post workweek roadmap]
* [https://id.etherpad.mozilla.org/data-team old etherpad]
* [https://id.etherpad.mozilla.org/data-team old etherpad]
39

edits