CloudServices/DataPipeline: Difference between revisions

(Add work queue)
 
(64 intermediate revisions by 4 users not shown)
Line 1: Line 1:
= Overview =
= Overview =
The cloud services data pipeline ingests data for analysis, monitoring and reporting. The pipeline is currently used for processing cloud services server logs. We're in the process of improving it to support desktop and device telemetry data. The data pipeline team also works on [https://docs.services.mozilla.com/heka/ Heka] (a major component of the pipeline implementation), custom dashboards for cloud services projects, and the [[Telemetry]] server.
The cloud services data pipeline ingests data for analysis, monitoring and reporting. The pipeline is currently used for processing desktop and device [[Telemetry|Telemetry]] data and cloud services server logs. The ingestion pipeline is one component of the [[Data/Platform|Fx Data Platform]].


= Resources =
=== Pipeline specs/docs ===
* IRC channel: #datapipeline
* [https://docs.google.com/a/mozilla.com/document/d/1tzPc9hIACNi07psaQEKfpYQho8wuObC_BkMg3QEDIwA/edit#heading=h.vbs9qotdifjb Pipeline technical proposal]
* [https://docs.google.com/a/mozilla.com/document/d/1tzPc9hIACNi07psaQEKfpYQho8wuObC_BkMg3QEDIwA/edit#heading=h.vbs9qotdifjb Pipeline technical proposal]
* [https://docs.google.com/a/mozilla.com/document/d/1QGiXfQ0AHCkJNXfMPArjab8Gq8zIdqDopCBr-1qD3sc/edit?usp=sharing Reporting and monitoring overview]
* [[CloudServices/DataPipeline/HTTPEdgeServerSpecification|HTTP Edge Server Specification]]
* [[CloudServices/DataPipeline/Metadata|Pipeline Metadata]]


= Milestones =
=== Data sets and other documentation ===
* '''Q4 2014''': Telemetry data running through pipeline
* [http://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/index.html Telemetry Data]
** Server stack deploy in github ("opsified")
* [https://wiki.mozilla.org/Mobile/Metrics/Redash Mobile Metrics]
** Re-implement monitoring dashboards
* [https://github.com/mozilla/testpilot/blob/master/docs/README-METRICS.md Test Pilot]
* '''Q1 2015''': Launch pipeline prototype
** Architecture decisions completed; production stack up and running
** Business Intelligence/Data Warehouse proof of concept implemented
** Ingestion process completed for FHR+telemetry (start collecting on 2015-02-23)
** Backprocessing from pipeline datastore implemented
** Pipeline runs in parallel to existing infrastructure; not yet source of truth
* '''Q2 2015''': Pipeline officially supports business use cases
** Complete set of use cases tbd (most likely primarily FHR+telemetry use cases)
** Complete set of monitoring and reporting outputs tbd: dashboards, data warehouse, monitoring, self-service access to data
** FHR+telemetry hits full release 2015-05-19, handle full production load
* '''Q3 2015''': Fill out monitoring and reporting capabilities; add sources and use cases


= Related Dates/Schedules =
= Code =
* '''FHR+Telemetry client work'''
=== V2 Pipeline ===
** Current plan: FF39 Nightly and uplifted to FF38. May not hit this schedule, but the pipeline needs to be ready
{| class="wikitable"
** 2015-02-23 Nightly
|-
** 2015-05-19 Release
! Link !! Description
|-
| https://github.com/mozilla-services/data-pipeline || Mozilla Services Data Pipeline
|-
| https://github.com/mozilla-services/lua_sandbox || Generic Lua sandbox for dynamic data analysis
|-
| https://github.com/mozilla-services/mozilla-pipeline-schemas || JSON Schema specifications of pipeline data
|-
| https://github.com/mozilla/pipeline-monitoring-dashboard || Monitoring data quality issues for metrics pipeline
|-
| https://github.com/mozilla-services/heka || Data collection and processing made easy
|-
| https://github.com/mozilla-services/nginx_moz_ingest || HTTP Data Pipeline Ingestion
|-
| https://github.com/trink/hindsight || Data collection and processing made light weight, fast, and more reliable
|}


= Work Queue =
=== Telemetry ===
== Risks/Questions ==
 
Send something to dev-planning? [kparlante, telliot]
{| class="wikitable"
Old-FHR data through pipeline? Yes/No: [telliot]
|-
Deletes & legal policy [telliot, mreid to provide cost estimate]
! Link !! Description
Stewing
|-
Maintaining a sample data set for faster queries
| https://github.com/vitillo/telemetry-onboarding || Slides / notebooks for Telemetry Onboarding
Implement a specific flag to determine if data gets warehoused or not
|-
Integrate Roberto’s spark data flow into new DWH
| https://github.com/mozilla/telemetry-server || Code for analysis.telemetry.mozilla.org among other things
Implies a similar db-backed table of DWH filenames for filtering (don’t want to list S3 every time - too slow)
|-
Elasticsearch (Kibana) output filter
| https://github.com/bsmedberg/telemetry-experiments-dashboard || A dashboard to track the deployment of Firefox Telemetry Experiments
Complete list of outputs (and filters and any other support)
|-
Build a shim for debugging CEPs with local data
| https://github.com/mozilla/telemetry-batch-view || A Scala framework to build derived datasets, aka batch views, of Telemetry data.
Store the “raw raw” data for some period to ensure we’re safe if our code and/or CEP code is badly broken. Can’t just lose data.  
|-
Tee off to short-lived S3 before it goes through the main pipeline?
| https://github.com/mozilla/cerberus || Automatic alert system for telemetry histograms
BI query example that cross references data sources
|-
example: does fxa/sync increase browser usage?
| https://github.com/mozilla/emr-bootstrap-spark || AWS bootstrap scripts for Mozilla's flavoured Spark setup.
Queueing
|-
Q4 telemetry: (re) implement telemetry monitoring dashboards [?]
| https://github.com/mozilla/moz-crash-rate-aggregates || Crash Rate Aggregation code
Q1 BI: define schema for data warehouse (talk to jjensen) [kparlante]
|-
should use multiple data sources
| https://github.com/mozilla/jupyter-notebook-gist || Plugin to create, list, and load GitHub Gists from Jupyter notebooks
Q1 BI: write filter for data warehouse [trink]
|-
Q1 BI: signal & schedule loading of data warehouse [mreid]
| https://github.com/mozilla/jupyter-spark || Jupyter Notebook extension for Apache Spark integration
Q1 BI: redshift output [trink]
|-
Q1 BI: setup domo and/or tableau to look at mysql or csv or whatever is easy [?]
| https://github.com/mozilla/python_mozaggregator || Aggregator job for telemetry.mozilla.org
Q1: Data format spec [kparlante, trink]
|-
JSON schema, specifically for FHR+telemetry, also anticipate other sources
| https://github.com/mozilla/python_moztelemetry || Spark bindings for Mozilla Telemetry
Q1: implement best guess at per user sampling [trink]
|-
follow up with saptarshi for more complex algorithm
| https://github.com/mozilla/telemetry-analysis-service || Eventual home of the revamped a.t.m.o (per Bug 1248688)
Doing
|-
Opsify stack [whd]
| https://github.com/vitillo/telemetry-airflow || Scheduling / workflow management for Telemetry jobs
Q4 telemetry: Send telemetry data through the pipeline [mreid]
|-
Q4 telemetry: Larger payloads (32MB) for telemetry [trink]
| https://github.com/vitillo/e10s_analyses || Data analysis relating to Electrolysis / E10s
risk mitigation: Estimate cost of “full scan” DWH query [mreid]
|-
risk mitigation: Estimate cost of single DWH delete [mreid]
| https://github.com/mozilla/telemetry-tools || Utility code to work with Mozilla Telemetry data
Done
|}
Parallelize sandbox filters (eg FHRSearch) [trink]
 
Enable Lua JIT [trink]
= Archive =
* [https://docs.google.com/a/mozilla.com/document/d/1QGiXfQ0AHCkJNXfMPArjab8Gq8zIdqDopCBr-1qD3sc/edit?usp=sharing Q4 2014: Reporting and monitoring overview]
* [https://mana.mozilla.org/wiki/display/CLOUDSERVICES/Bespoke+Dashboards Bespoke Dashboards]
* [https://mana.mozilla.org/wiki/display/CLOUDSERVICES/Cloud+Services+Data Cloud Services Data Projects]
* [https://mana.mozilla.org/wiki/display/CLOUDSERVICES/Data+Sources List of Data Sources]
* [https://mana.mozilla.org/wiki/display/CLOUDSERVICES/V1+Pipeline V1 Pipeline & Data Sources]
* [https://docs.google.com/a/mozilla.com/document/d/1CTazW99zBK5K40f-fgSyTPw9IXgmFYjQmNhzxTT9Tts/edit?usp=sharing post workweek roadmap]
* [https://id.etherpad.mozilla.org/data-team old etherpad]
39

edits