CloudServices/DataPipeline: Difference between revisions

 
(28 intermediate revisions by 4 users not shown)
Line 1: Line 1:
= Overview =
= Overview =
The cloud services data pipeline ingests data for analysis, monitoring and reporting. The pipeline is currently used for processing cloud services server logs. We're in the process of improving it to support desktop and device telemetry data. The data pipeline team also works on [https://docs.services.mozilla.com/heka/ Heka] (a major component of the pipeline implementation), custom dashboards for cloud services projects, and the [[Telemetry]] server.
The cloud services data pipeline ingests data for analysis, monitoring and reporting. The pipeline is currently used for processing desktop and device [[Telemetry|Telemetry]] data and cloud services server logs. The ingestion pipeline is one component of the [[Data/Platform|Fx Data Platform]].


= Team Communication =
* IRC channel: #datapipeline
* Mailing list: dev-metrics-pipeline@mozilla.com
* Standup meeting: https://etherpad.mozilla.org/data-pipeline-meeting-notes
= Cross Team Communication =
* Cross team coordination meeting: https://etherpad.mozilla.org/data-pipeline-coordination
* FHR mailing list: [https://mail.mozilla.org/listinfo/fhr-dev fhr-dev]
= Resources =
=== Pipeline specs/docs ===
=== Pipeline specs/docs ===
* [https://docs.google.com/a/mozilla.com/document/d/1tzPc9hIACNi07psaQEKfpYQho8wuObC_BkMg3QEDIwA/edit#heading=h.vbs9qotdifjb Pipeline technical proposal]
* [https://docs.google.com/a/mozilla.com/document/d/1tzPc9hIACNi07psaQEKfpYQho8wuObC_BkMg3QEDIwA/edit#heading=h.vbs9qotdifjb Pipeline technical proposal]
* [[CloudServices/DataPipeline/HTTPEdgeServerSpecification|HTTP Edge Server Specification]]
* [[CloudServices/DataPipeline/HTTPEdgeServerSpecification|HTTP Edge Server Specification]]
=== Reporting and tools ===
* [[CloudServices/DataPipeline/Metadata|Pipeline Metadata]]
* [https://docs.google.com/a/mozilla.com/document/d/1QGiXfQ0AHCkJNXfMPArjab8Gq8zIdqDopCBr-1qD3sc/edit?usp=sharing Reporting and monitoring overview]
* [https://mana.mozilla.org/wiki/display/CLOUDSERVICES/Bespoke+Dashboards Bespoke Dashboards]
 
=== Planning ===
* [https://mana.mozilla.org/wiki/display/CLOUDSERVICES/Cloud+Services+Data Cloud Services Data Projects]
* [https://mana.mozilla.org/wiki/display/CLOUDSERVICES/Data+Sources List of Data Sources]
* [https://mana.mozilla.org/wiki/display/CLOUDSERVICES/V1+Pipeline V1 Pipeline & Data Sources]
 
= Pipeline Milestones =
* '''Q1 2015''': Launch pipeline prototype
** Architecture decisions completed; production stack up and running with monitoring dashboards
** Business Intelligence/Data Warehouse proof of concept implemented
** Ingestion process completed for FHR+telemetry (start collecting on 2015-02-23)
** Backprocessing from pipeline datastore implemented
** Pipeline runs in parallel to existing infrastructure; not yet source of truth
* '''Q2 2015''': Pipeline officially supports business use cases
** FHR v4 feeds executive dashboard
** Complete set of use cases tbd (most likely primarily FHR+telemetry use cases)
** Complete set of monitoring and reporting outputs tbd: dashboards, data warehouse, monitoring, self-service access to data
** FHR+telemetry hits full release 2015-05-19, handle full production load
* '''Q3 2015''': Fill out monitoring and reporting capabilities; add sources and use cases
 
= Related Dates and Schedules =
* '''FHR+Telemetry client work'''
** Current plan: FF38
** 2015-02-23 Nightly
** 2015-03-30 Aurora
** 2015-05-11 Release
 
= Work Queue =
Tracking tasks in bugzilla: http://mzl.la/1DOOBZt


=== Risks and Open Questions ===
=== Data sets and other documentation ===
* Old-FHR data through pipeline? Yes/No: [telliot]
* [http://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/index.html Telemetry Data]
* Deletes & legal policy [telliot]
* [https://wiki.mozilla.org/Mobile/Metrics/Redash Mobile Metrics]
* [https://github.com/mozilla/testpilot/blob/master/docs/README-METRICS.md Test Pilot]


= Code =
= Code =
=== V2 Pipeline ===
=== V2 Pipeline ===
* https://github.com/mozilla-services/heka
{| class="wikitable"
* https://github.com/mozilla-services/data-pipeline
|-
* https://github.com/mozilla/pipeline-monitoring-dashboard
! Link !! Description
|-
| https://github.com/mozilla-services/data-pipeline || Mozilla Services Data Pipeline
|-
| https://github.com/mozilla-services/lua_sandbox || Generic Lua sandbox for dynamic data analysis
|-
| https://github.com/mozilla-services/mozilla-pipeline-schemas || JSON Schema specifications of pipeline data
|-
| https://github.com/mozilla/pipeline-monitoring-dashboard || Monitoring data quality issues for metrics pipeline
|-
| https://github.com/mozilla-services/heka || Data collection and processing made easy
|-
| https://github.com/mozilla-services/nginx_moz_ingest || HTTP Data Pipeline Ingestion
|-
| https://github.com/trink/hindsight || Data collection and processing made light weight, fast, and more reliable
|}
 
=== Telemetry ===
=== Telemetry ===
* https://github.com/mozilla/telemetry-server
 
* https://github.com/bsmedberg/telemetry-experiments-dashboard
{| class="wikitable"
|-
! Link !! Description
|-
| https://github.com/vitillo/telemetry-onboarding || Slides / notebooks for Telemetry Onboarding
|-
| https://github.com/mozilla/telemetry-server || Code for analysis.telemetry.mozilla.org among other things
|-
| https://github.com/bsmedberg/telemetry-experiments-dashboard || A dashboard to track the deployment of Firefox Telemetry Experiments
|-
| https://github.com/mozilla/telemetry-batch-view || A Scala framework to build derived datasets, aka batch views, of Telemetry data.
|-
| https://github.com/mozilla/cerberus || Automatic alert system for telemetry histograms
|-
| https://github.com/mozilla/emr-bootstrap-spark || AWS bootstrap scripts for Mozilla's flavoured Spark setup.
|-
| https://github.com/mozilla/moz-crash-rate-aggregates || Crash Rate Aggregation code
|-
| https://github.com/mozilla/jupyter-notebook-gist || Plugin to create, list, and load GitHub Gists from Jupyter notebooks
|-
| https://github.com/mozilla/jupyter-spark || Jupyter Notebook extension for Apache Spark integration
|-
| https://github.com/mozilla/python_mozaggregator || Aggregator job for telemetry.mozilla.org
|-
| https://github.com/mozilla/python_moztelemetry || Spark bindings for Mozilla Telemetry
|-
| https://github.com/mozilla/telemetry-analysis-service || Eventual home of the revamped a.t.m.o (per Bug 1248688)
|-
| https://github.com/vitillo/telemetry-airflow || Scheduling / workflow management for Telemetry jobs
|-
| https://github.com/vitillo/e10s_analyses || Data analysis relating to Electrolysis / E10s
|-
| https://github.com/mozilla/telemetry-tools || Utility code to work with Mozilla Telemetry data
|}


= Archive =
= Archive =
* [https://docs.google.com/a/mozilla.com/document/d/1QGiXfQ0AHCkJNXfMPArjab8Gq8zIdqDopCBr-1qD3sc/edit?usp=sharing Q4 2014: Reporting and monitoring overview]
* [https://mana.mozilla.org/wiki/display/CLOUDSERVICES/Bespoke+Dashboards Bespoke Dashboards]
* [https://mana.mozilla.org/wiki/display/CLOUDSERVICES/Cloud+Services+Data Cloud Services Data Projects]
* [https://mana.mozilla.org/wiki/display/CLOUDSERVICES/Data+Sources List of Data Sources]
* [https://mana.mozilla.org/wiki/display/CLOUDSERVICES/V1+Pipeline V1 Pipeline & Data Sources]
* [https://docs.google.com/a/mozilla.com/document/d/1CTazW99zBK5K40f-fgSyTPw9IXgmFYjQmNhzxTT9Tts/edit?usp=sharing post workweek roadmap]
* [https://docs.google.com/a/mozilla.com/document/d/1CTazW99zBK5K40f-fgSyTPw9IXgmFYjQmNhzxTT9Tts/edit?usp=sharing post workweek roadmap]
* [https://id.etherpad.mozilla.org/data-team old etherpad]
* [https://id.etherpad.mozilla.org/data-team old etherpad]
39

edits