CloudServices/DataPipeline: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
m (tweak wording)
 
(52 intermediate revisions by 4 users not shown)
Line 1: Line 1:
= Overview =
= Overview =
The cloud services data pipeline ingests data for analysis, monitoring and reporting. The pipeline is currently used for processing cloud services server logs. We're in the process of improving it to support desktop and device telemetry data. The data pipeline team also works on [https://docs.services.mozilla.com/heka/ Heka] (a major component of the pipeline implementation), custom dashboards for cloud services projects, and the [[Telemetry]] server.
The cloud services data pipeline ingests data for analysis, monitoring and reporting. The pipeline is currently used for processing desktop and device [[Telemetry|Telemetry]] data and cloud services server logs. The ingestion pipeline is one component of the [[Data/Platform|Fx Data Platform]].


= Resources =
=== Pipeline specs/docs ===
* IRC channel: #datapipeline
* [https://docs.google.com/a/mozilla.com/document/d/1tzPc9hIACNi07psaQEKfpYQho8wuObC_BkMg3QEDIwA/edit#heading=h.vbs9qotdifjb Pipeline technical proposal]
* [https://docs.google.com/a/mozilla.com/document/d/1tzPc9hIACNi07psaQEKfpYQho8wuObC_BkMg3QEDIwA/edit#heading=h.vbs9qotdifjb Pipeline technical proposal]
* [https://docs.google.com/a/mozilla.com/document/d/1QGiXfQ0AHCkJNXfMPArjab8Gq8zIdqDopCBr-1qD3sc/edit?usp=sharing Reporting and monitoring overview]
* [[CloudServices/DataPipeline/HTTPEdgeServerSpecification|HTTP Edge Server Specification]]
* [[CloudServices/DataPipeline/Metadata|Pipeline Metadata]]


= Pipeline Milestones =
=== Data sets and other documentation ===
* '''Q4 2014''': Telemetry data running through pipeline
* [http://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/index.html Telemetry Data]
** Server stack deploy in github ("opsified")
* [https://wiki.mozilla.org/Mobile/Metrics/Redash Mobile Metrics]
** Re-implement monitoring dashboards
* [https://github.com/mozilla/testpilot/blob/master/docs/README-METRICS.md Test Pilot]
* '''Q1 2015''': Launch pipeline prototype
** Architecture decisions completed; production stack up and running
** Business Intelligence/Data Warehouse proof of concept implemented
** Ingestion process completed for FHR+telemetry (start collecting on 2015-02-23)
** Backprocessing from pipeline datastore implemented
** Pipeline runs in parallel to existing infrastructure; not yet source of truth
* '''Q2 2015''': Pipeline officially supports business use cases
** Complete set of use cases tbd (most likely primarily FHR+telemetry use cases)
** Complete set of monitoring and reporting outputs tbd: dashboards, data warehouse, monitoring, self-service access to data
** FHR+telemetry hits full release 2015-05-19, handle full production load
* '''Q3 2015''': Fill out monitoring and reporting capabilities; add sources and use cases


= Related Dates and Schedules =
= Code =
* '''FHR+Telemetry client work'''
=== V2 Pipeline ===
** Current plan: FF39 Nightly and uplifted to FF38. May not hit this schedule, but the pipeline needs to be ready
{| class="wikitable"
** 2015-02-23 Nightly
|-
** 2015-05-19 Release
! Link !! Description
|-
| https://github.com/mozilla-services/data-pipeline || Mozilla Services Data Pipeline
|-
| https://github.com/mozilla-services/lua_sandbox || Generic Lua sandbox for dynamic data analysis
|-
| https://github.com/mozilla-services/mozilla-pipeline-schemas || JSON Schema specifications of pipeline data
|-
| https://github.com/mozilla/pipeline-monitoring-dashboard || Monitoring data quality issues for metrics pipeline
|-
| https://github.com/mozilla-services/heka || Data collection and processing made easy
|-
| https://github.com/mozilla-services/nginx_moz_ingest || HTTP Data Pipeline Ingestion
|-
| https://github.com/trink/hindsight || Data collection and processing made light weight, fast, and more reliable
|}


= Work Queue =
=== Telemetry ===
To Do/Doing categories need to be logged in bugzilla.
 
=== Risks and Open Questions ===
{| class="wikitable"
* Send something to dev-planning? [kparlante, telliot]
|-
* Old-FHR data through pipeline? Yes/No: [telliot]
! Link !! Description
* Deletes & legal policy [telliot, mreid to provide cost estimate]
|-
=== Needs more discussion ===
| https://github.com/vitillo/telemetry-onboarding || Slides / notebooks for Telemetry Onboarding
* Maintaining a sample data set for faster queries
|-
* Implement a specific flag to determine if data gets warehoused or not
| https://github.com/mozilla/telemetry-server || Code for analysis.telemetry.mozilla.org among other things
* Integrate Roberto’s spark data flow into new DWH
|-
** Implies a similar db-backed table of DWH filenames for filtering (don’t want to list S3 every time - too slow)
| https://github.com/bsmedberg/telemetry-experiments-dashboard || A dashboard to track the deployment of Firefox Telemetry Experiments
* Elasticsearch (Kibana) output filter
|-
* Complete list of outputs (and filters and any other support)
| https://github.com/mozilla/telemetry-batch-view || A Scala framework to build derived datasets, aka batch views, of Telemetry data.
* Build a shim for debugging CEPs with local data
|-
* Store the “raw raw” data for some period to ensure we’re safe if our code and/or CEP code is badly broken. Can’t just lose data.  
| https://github.com/mozilla/cerberus || Automatic alert system for telemetry histograms
** Tee off to short-lived S3 before it goes through the main pipeline?
|-
* BI query example that cross references data sources
| https://github.com/mozilla/emr-bootstrap-spark || AWS bootstrap scripts for Mozilla's flavoured Spark setup.
** example: does fxa/sync increase browser usage?
|-
=== To Do ===
| https://github.com/mozilla/moz-crash-rate-aggregates || Crash Rate Aggregation code
* Q4 telemetry: (re) implement telemetry monitoring dashboards [?]
|-
* Q1 BI: define schema for data warehouse (talk to jjensen) [kparlante]
| https://github.com/mozilla/jupyter-notebook-gist || Plugin to create, list, and load GitHub Gists from Jupyter notebooks
** should use multiple data sources
|-
* Q1 BI: write filter for data warehouse [trink]
| https://github.com/mozilla/jupyter-spark || Jupyter Notebook extension for Apache Spark integration
* Q1 BI: signal & schedule loading of data warehouse [mreid]
|-
* Q1 BI: redshift output [trink]
| https://github.com/mozilla/python_mozaggregator || Aggregator job for telemetry.mozilla.org
* Q1 BI: setup domo and/or tableau to look at mysql or csv or whatever is easy [?]
|-
* Q1: Data format spec [kparlante, trink]
| https://github.com/mozilla/python_moztelemetry || Spark bindings for Mozilla Telemetry
** JSON schema, specifically for FHR+telemetry, also anticipate other sources
|-
* Q1: implement best guess at per user sampling [trink]
| https://github.com/mozilla/telemetry-analysis-service || Eventual home of the revamped a.t.m.o (per Bug 1248688)
** follow up with saptarshi for more complex algorithm
|-
=== Doing ===
| https://github.com/vitillo/telemetry-airflow || Scheduling / workflow management for Telemetry jobs
* Opsify stack [whd]
|-
* Q4 telemetry: Send telemetry data through the pipeline [mreid]
| https://github.com/vitillo/e10s_analyses || Data analysis relating to Electrolysis / E10s
* Q4 telemetry: Larger payloads (32MB) for telemetry [trink]
|-
* risk mitigation: Estimate cost of “full scan” DWH query [mreid]
| https://github.com/mozilla/telemetry-tools || Utility code to work with Mozilla Telemetry data
* risk mitigation: Estimate cost of single DWH delete [mreid]
|}
=== Done ===
* Parallelize sandbox filters (eg FHRSearch) [trink]
* Enable Lua JIT [trink]


= Archive =
= Archive =
* [https://docs.google.com/a/mozilla.com/document/d/1QGiXfQ0AHCkJNXfMPArjab8Gq8zIdqDopCBr-1qD3sc/edit?usp=sharing Q4 2014: Reporting and monitoring overview]
* [https://mana.mozilla.org/wiki/display/CLOUDSERVICES/Bespoke+Dashboards Bespoke Dashboards]
* [https://mana.mozilla.org/wiki/display/CLOUDSERVICES/Cloud+Services+Data Cloud Services Data Projects]
* [https://mana.mozilla.org/wiki/display/CLOUDSERVICES/Data+Sources List of Data Sources]
* [https://mana.mozilla.org/wiki/display/CLOUDSERVICES/V1+Pipeline V1 Pipeline & Data Sources]
* [https://docs.google.com/a/mozilla.com/document/d/1CTazW99zBK5K40f-fgSyTPw9IXgmFYjQmNhzxTT9Tts/edit?usp=sharing post workweek roadmap]
* [https://docs.google.com/a/mozilla.com/document/d/1CTazW99zBK5K40f-fgSyTPw9IXgmFYjQmNhzxTT9Tts/edit?usp=sharing post workweek roadmap]
* [https://id.etherpad.mozilla.org/data-team old etherpad]
* [https://id.etherpad.mozilla.org/data-team old etherpad]

Latest revision as of 16:44, 21 December 2016

Overview

The cloud services data pipeline ingests data for analysis, monitoring and reporting. The pipeline is currently used for processing desktop and device Telemetry data and cloud services server logs. The ingestion pipeline is one component of the Fx Data Platform.

Pipeline specs/docs

Data sets and other documentation

Code

V2 Pipeline

Link Description
https://github.com/mozilla-services/data-pipeline Mozilla Services Data Pipeline
https://github.com/mozilla-services/lua_sandbox Generic Lua sandbox for dynamic data analysis
https://github.com/mozilla-services/mozilla-pipeline-schemas JSON Schema specifications of pipeline data
https://github.com/mozilla/pipeline-monitoring-dashboard Monitoring data quality issues for metrics pipeline
https://github.com/mozilla-services/heka Data collection and processing made easy
https://github.com/mozilla-services/nginx_moz_ingest HTTP Data Pipeline Ingestion
https://github.com/trink/hindsight Data collection and processing made light weight, fast, and more reliable

Telemetry

Link Description
https://github.com/vitillo/telemetry-onboarding Slides / notebooks for Telemetry Onboarding
https://github.com/mozilla/telemetry-server Code for analysis.telemetry.mozilla.org among other things
https://github.com/bsmedberg/telemetry-experiments-dashboard A dashboard to track the deployment of Firefox Telemetry Experiments
https://github.com/mozilla/telemetry-batch-view A Scala framework to build derived datasets, aka batch views, of Telemetry data.
https://github.com/mozilla/cerberus Automatic alert system for telemetry histograms
https://github.com/mozilla/emr-bootstrap-spark AWS bootstrap scripts for Mozilla's flavoured Spark setup.
https://github.com/mozilla/moz-crash-rate-aggregates Crash Rate Aggregation code
https://github.com/mozilla/jupyter-notebook-gist Plugin to create, list, and load GitHub Gists from Jupyter notebooks
https://github.com/mozilla/jupyter-spark Jupyter Notebook extension for Apache Spark integration
https://github.com/mozilla/python_mozaggregator Aggregator job for telemetry.mozilla.org
https://github.com/mozilla/python_moztelemetry Spark bindings for Mozilla Telemetry
https://github.com/mozilla/telemetry-analysis-service Eventual home of the revamped a.t.m.o (per Bug 1248688)
https://github.com/vitillo/telemetry-airflow Scheduling / workflow management for Telemetry jobs
https://github.com/vitillo/e10s_analyses Data analysis relating to Electrolysis / E10s
https://github.com/mozilla/telemetry-tools Utility code to work with Mozilla Telemetry data

Archive