CloudServices/DataPipeline: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
(→‎V2 Pipeline: Add a few more pipeline code links)
 
(15 intermediate revisions by 4 users not shown)
Line 1: Line 1:
= Overview =
= Overview =
The cloud services data pipeline ingests data for analysis, monitoring and reporting. The pipeline is currently used for processing desktop and device [[Telemetry|Telemetry]] data and cloud services server logs. The [[Firefox/Measurement|Firefox Measurement Team]] is building the data pipeline.
The cloud services data pipeline ingests data for analysis, monitoring and reporting. The pipeline is currently used for processing desktop and device [[Telemetry|Telemetry]] data and cloud services server logs. The ingestion pipeline is one component of the [[Data/Platform|Fx Data Platform]].


= Team Communication =
* IRC channel: #datapipeline
* Mailing list: dev-metrics-pipeline@mozilla.com
* Standup meeting: https://etherpad.mozilla.org/data-pipeline-meeting-notes
* Bugzilla: http://mzl.la/1DOOBZt
= Cross Team Communication =
* FHR mailing list: [https://mail.mozilla.org/listinfo/fhr-dev fhr-dev]
* FHR v4 standup meeting: https://etherpad.mozilla.org/fhr-v4-status
* Cross team coordination meeting (ended 3/19): https://etherpad.mozilla.org/data-pipeline-coordination
= Resources =
=== Pipeline specs/docs ===
=== Pipeline specs/docs ===
* [https://docs.google.com/a/mozilla.com/document/d/1tzPc9hIACNi07psaQEKfpYQho8wuObC_BkMg3QEDIwA/edit#heading=h.vbs9qotdifjb Pipeline technical proposal]
* [https://docs.google.com/a/mozilla.com/document/d/1tzPc9hIACNi07psaQEKfpYQho8wuObC_BkMg3QEDIwA/edit#heading=h.vbs9qotdifjb Pipeline technical proposal]
Line 19: Line 7:
* [[CloudServices/DataPipeline/Metadata|Pipeline Metadata]]
* [[CloudServices/DataPipeline/Metadata|Pipeline Metadata]]


=== Reporting and tools ===
=== Data sets and other documentation ===
* [https://docs.google.com/a/mozilla.com/document/d/1QGiXfQ0AHCkJNXfMPArjab8Gq8zIdqDopCBr-1qD3sc/edit?usp=sharing Reporting and monitoring overview]
* [http://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/index.html Telemetry Data]
* [https://mana.mozilla.org/wiki/display/CLOUDSERVICES/Bespoke+Dashboards Bespoke Dashboards]
* [https://wiki.mozilla.org/Mobile/Metrics/Redash Mobile Metrics]
 
* [https://github.com/mozilla/testpilot/blob/master/docs/README-METRICS.md Test Pilot]
=== Planning ===
* [https://mana.mozilla.org/wiki/display/CLOUDSERVICES/Cloud+Services+Data Cloud Services Data Projects]
* [https://mana.mozilla.org/wiki/display/CLOUDSERVICES/Data+Sources List of Data Sources]
* [https://mana.mozilla.org/wiki/display/CLOUDSERVICES/V1+Pipeline V1 Pipeline & Data Sources]
 
= Pipeline Milestones =
* '''Q1 2015''': Launch pipeline prototype
** Architecture decisions completed; production stack up and running with monitoring dashboards
** Business Intelligence/Data Warehouse proof of concept implemented
** Ingestion process completed for FHR+telemetry (start collecting on 2015-02-23)
** Backprocessing from pipeline datastore implemented
** By client ID analysis supported
** Pipeline runs in parallel to existing infrastructure; not yet source of truth
* '''Q2 2015''': Pipeline officially supports business use cases
** FHR v4 feeds executive dashboard
** Complete set of use cases tbd (most likely primarily FHR+telemetry use cases)
** Complete set of monitoring and reporting outputs tbd: dashboards, data warehouse, monitoring, self-service access to data
** FHR+telemetry hits full release 2015-05-19, handle full production load
* '''Q3 2015''': Fill out monitoring and reporting capabilities; add sources and use cases
 
= Related Dates and Schedules =
* '''FHR+Telemetry client work'''
** Current plan: FF38
** 2015-02-23 Nightly
** 2015-03-30 Aurora
** 2015-05-11 Release
 
= Work Queue =
Tracking tasks in bugzilla: http://mzl.la/1DOOBZt
 
=== Risks and Open Questions ===
* Old-FHR data through pipeline? Yes/No: [telliot]
* Deletes & legal policy [telliot]
* Security review [telliot]


= Code =
= Code =
Line 73: Line 27:
|-
|-
| https://github.com/mozilla-services/heka || Data collection and processing made easy
| https://github.com/mozilla-services/heka || Data collection and processing made easy
|-
| https://github.com/mozilla-services/nginx_moz_ingest || HTTP Data Pipeline Ingestion
|-
| https://github.com/trink/hindsight || Data collection and processing made light weight, fast, and more reliable
|}
|}


Line 93: Line 51:
| https://github.com/mozilla/emr-bootstrap-spark || AWS bootstrap scripts for Mozilla's flavoured Spark setup.
| https://github.com/mozilla/emr-bootstrap-spark || AWS bootstrap scripts for Mozilla's flavoured Spark setup.
|-
|-
| https://github.com/mozilla/moz-crash-rate-aggregates || Crash Rate Aggregation code
|-
| https://github.com/mozilla/jupyter-notebook-gist || Plugin to create, list, and load GitHub Gists from Jupyter notebooks
| https://github.com/mozilla/jupyter-notebook-gist || Plugin to create, list, and load GitHub Gists from Jupyter notebooks
|-
|-
| https://github.com/mreid-moz/jupyter-spark || Jupyter Notebook extension for Apache Spark integration
| https://github.com/mozilla/jupyter-spark || Jupyter Notebook extension for Apache Spark integration
|-
|-
| https://github.com/mozilla/python_mozaggregator || Aggregator job for telemetry.mozilla.org
| https://github.com/mozilla/python_mozaggregator || Aggregator job for telemetry.mozilla.org
Line 102: Line 62:
|-
|-
| https://github.com/mozilla/telemetry-analysis-service || Eventual home of the revamped a.t.m.o (per Bug 1248688)
| https://github.com/mozilla/telemetry-analysis-service || Eventual home of the revamped a.t.m.o (per Bug 1248688)
|-
| https://github.com/vitillo/telemetry-airflow || Scheduling / workflow management for Telemetry jobs
|-
| https://github.com/vitillo/e10s_analyses || Data analysis relating to Electrolysis / E10s
|-
|-
| https://github.com/mozilla/telemetry-tools || Utility code to work with Mozilla Telemetry data
| https://github.com/mozilla/telemetry-tools || Utility code to work with Mozilla Telemetry data
Line 107: Line 71:


= Archive =
= Archive =
* [https://docs.google.com/a/mozilla.com/document/d/1QGiXfQ0AHCkJNXfMPArjab8Gq8zIdqDopCBr-1qD3sc/edit?usp=sharing Q4 2014: Reporting and monitoring overview]
* [https://mana.mozilla.org/wiki/display/CLOUDSERVICES/Bespoke+Dashboards Bespoke Dashboards]
* [https://mana.mozilla.org/wiki/display/CLOUDSERVICES/Cloud+Services+Data Cloud Services Data Projects]
* [https://mana.mozilla.org/wiki/display/CLOUDSERVICES/Data+Sources List of Data Sources]
* [https://mana.mozilla.org/wiki/display/CLOUDSERVICES/V1+Pipeline V1 Pipeline & Data Sources]
* [https://docs.google.com/a/mozilla.com/document/d/1CTazW99zBK5K40f-fgSyTPw9IXgmFYjQmNhzxTT9Tts/edit?usp=sharing post workweek roadmap]
* [https://docs.google.com/a/mozilla.com/document/d/1CTazW99zBK5K40f-fgSyTPw9IXgmFYjQmNhzxTT9Tts/edit?usp=sharing post workweek roadmap]
* [https://id.etherpad.mozilla.org/data-team old etherpad]
* [https://id.etherpad.mozilla.org/data-team old etherpad]

Latest revision as of 16:44, 21 December 2016

Overview

The cloud services data pipeline ingests data for analysis, monitoring and reporting. The pipeline is currently used for processing desktop and device Telemetry data and cloud services server logs. The ingestion pipeline is one component of the Fx Data Platform.

Pipeline specs/docs

Data sets and other documentation

Code

V2 Pipeline

Link Description
https://github.com/mozilla-services/data-pipeline Mozilla Services Data Pipeline
https://github.com/mozilla-services/lua_sandbox Generic Lua sandbox for dynamic data analysis
https://github.com/mozilla-services/mozilla-pipeline-schemas JSON Schema specifications of pipeline data
https://github.com/mozilla/pipeline-monitoring-dashboard Monitoring data quality issues for metrics pipeline
https://github.com/mozilla-services/heka Data collection and processing made easy
https://github.com/mozilla-services/nginx_moz_ingest HTTP Data Pipeline Ingestion
https://github.com/trink/hindsight Data collection and processing made light weight, fast, and more reliable

Telemetry

Link Description
https://github.com/vitillo/telemetry-onboarding Slides / notebooks for Telemetry Onboarding
https://github.com/mozilla/telemetry-server Code for analysis.telemetry.mozilla.org among other things
https://github.com/bsmedberg/telemetry-experiments-dashboard A dashboard to track the deployment of Firefox Telemetry Experiments
https://github.com/mozilla/telemetry-batch-view A Scala framework to build derived datasets, aka batch views, of Telemetry data.
https://github.com/mozilla/cerberus Automatic alert system for telemetry histograms
https://github.com/mozilla/emr-bootstrap-spark AWS bootstrap scripts for Mozilla's flavoured Spark setup.
https://github.com/mozilla/moz-crash-rate-aggregates Crash Rate Aggregation code
https://github.com/mozilla/jupyter-notebook-gist Plugin to create, list, and load GitHub Gists from Jupyter notebooks
https://github.com/mozilla/jupyter-spark Jupyter Notebook extension for Apache Spark integration
https://github.com/mozilla/python_mozaggregator Aggregator job for telemetry.mozilla.org
https://github.com/mozilla/python_moztelemetry Spark bindings for Mozilla Telemetry
https://github.com/mozilla/telemetry-analysis-service Eventual home of the revamped a.t.m.o (per Bug 1248688)
https://github.com/vitillo/telemetry-airflow Scheduling / workflow management for Telemetry jobs
https://github.com/vitillo/e10s_analyses Data analysis relating to Electrolysis / E10s
https://github.com/mozilla/telemetry-tools Utility code to work with Mozilla Telemetry data

Archive