CloudServices/DataPipeline: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
(Add work queue)
(Formatted work queue)
Line 31: Line 31:
= Work Queue =
= Work Queue =
== Risks/Questions ==
== Risks/Questions ==
Send something to dev-planning? [kparlante, telliot]
* Send something to dev-planning? [kparlante, telliot]
Old-FHR data through pipeline? Yes/No: [telliot]
* Old-FHR data through pipeline? Yes/No: [telliot]
Deletes & legal policy [telliot, mreid to provide cost estimate]
* Deletes & legal policy [telliot, mreid to provide cost estimate]
Stewing
 
Maintaining a sample data set for faster queries
== Needs more discussion/definition ==
Implement a specific flag to determine if data gets warehoused or not
* Maintaining a sample data set for faster queries
Integrate Roberto’s spark data flow into new DWH
* Implement a specific flag to determine if data gets warehoused or not
Implies a similar db-backed table of DWH filenames for filtering (don’t want to list S3 every time - too slow)
* Integrate Roberto’s spark data flow into new DWH
Elasticsearch (Kibana) output filter
* Implies a similar db-backed table of DWH filenames for filtering (don’t want to list S3 every time - too slow)
Complete list of outputs (and filters and any other support)
* Elasticsearch (Kibana) output filter
Build a shim for debugging CEPs with local data
* Complete list of outputs (and filters and any other support)
Store the “raw raw” data for some period to ensure we’re safe if our code and/or CEP code is badly broken. Can’t just lose data.  
* Build a shim for debugging CEPs with local data
Tee off to short-lived S3 before it goes through the main pipeline?
* Store the “raw raw” data for some period to ensure we’re safe if our code and/or CEP code is badly broken. Can’t just lose data.  
BI query example that cross references data sources
* Tee off to short-lived S3 before it goes through the main pipeline?
example: does fxa/sync increase browser usage?
* BI query example that cross references data sources
Queueing
* example: does fxa/sync increase browser usage?
Q4 telemetry: (re) implement telemetry monitoring dashboards [?]
 
Q1 BI: define schema for data warehouse (talk to jjensen) [kparlante]
== To Do ==
should use multiple data sources
* Q4 telemetry: (re) implement telemetry monitoring dashboards [?]
Q1 BI: write filter for data warehouse [trink]
* Q1 BI: define schema for data warehouse (talk to jjensen) [kparlante]
Q1 BI: signal & schedule loading of data warehouse [mreid]
* should use multiple data sources
Q1 BI: redshift output [trink]
* Q1 BI: write filter for data warehouse [trink]
Q1 BI: setup domo and/or tableau to look at mysql or csv or whatever is easy [?]
* Q1 BI: signal & schedule loading of data warehouse [mreid]
Q1: Data format spec [kparlante, trink]
* Q1 BI: redshift output [trink]
JSON schema, specifically for FHR+telemetry, also anticipate other sources
* Q1 BI: setup domo and/or tableau to look at mysql or csv or whatever is easy [?]
Q1: implement best guess at per user sampling [trink]
* Q1: Data format spec [kparlante, trink]
follow up with saptarshi for more complex algorithm
* JSON schema, specifically for FHR+telemetry, also anticipate other sources
Doing
* Q1: implement best guess at per user sampling [trink]
Opsify stack [whd]
* follow up with saptarshi for more complex algorithm
Q4 telemetry: Send telemetry data through the pipeline [mreid]
== Doing ==
Q4 telemetry: Larger payloads (32MB) for telemetry [trink]
* Opsify stack [whd]
risk mitigation: Estimate cost of “full scan” DWH query [mreid]
* Q4 telemetry: Send telemetry data through the pipeline [mreid]
risk mitigation: Estimate cost of single DWH delete [mreid]
* Q4 telemetry: Larger payloads (32MB) for telemetry [trink]
Done
* risk mitigation: Estimate cost of “full scan” DWH query [mreid]
Parallelize sandbox filters (eg FHRSearch) [trink]
* risk mitigation: Estimate cost of single DWH delete [mreid]
Enable Lua JIT [trink]
== Done ==
* Parallelize sandbox filters (eg FHRSearch) [trink]
* Enable Lua JIT [trink]

Revision as of 22:41, 8 January 2015

Overview

The cloud services data pipeline ingests data for analysis, monitoring and reporting. The pipeline is currently used for processing cloud services server logs. We're in the process of improving it to support desktop and device telemetry data. The data pipeline team also works on Heka (a major component of the pipeline implementation), custom dashboards for cloud services projects, and the Telemetry server.

Resources

Milestones

  • Q4 2014: Telemetry data running through pipeline
    • Server stack deploy in github ("opsified")
    • Re-implement monitoring dashboards
  • Q1 2015: Launch pipeline prototype
    • Architecture decisions completed; production stack up and running
    • Business Intelligence/Data Warehouse proof of concept implemented
    • Ingestion process completed for FHR+telemetry (start collecting on 2015-02-23)
    • Backprocessing from pipeline datastore implemented
    • Pipeline runs in parallel to existing infrastructure; not yet source of truth
  • Q2 2015: Pipeline officially supports business use cases
    • Complete set of use cases tbd (most likely primarily FHR+telemetry use cases)
    • Complete set of monitoring and reporting outputs tbd: dashboards, data warehouse, monitoring, self-service access to data
    • FHR+telemetry hits full release 2015-05-19, handle full production load
  • Q3 2015: Fill out monitoring and reporting capabilities; add sources and use cases

Related Dates/Schedules

  • FHR+Telemetry client work
    • Current plan: FF39 Nightly and uplifted to FF38. May not hit this schedule, but the pipeline needs to be ready
    • 2015-02-23 Nightly
    • 2015-05-19 Release

Work Queue

Risks/Questions

  • Send something to dev-planning? [kparlante, telliot]
  • Old-FHR data through pipeline? Yes/No: [telliot]
  • Deletes & legal policy [telliot, mreid to provide cost estimate]

Needs more discussion/definition

  • Maintaining a sample data set for faster queries
  • Implement a specific flag to determine if data gets warehoused or not
  • Integrate Roberto’s spark data flow into new DWH
  • Implies a similar db-backed table of DWH filenames for filtering (don’t want to list S3 every time - too slow)
  • Elasticsearch (Kibana) output filter
  • Complete list of outputs (and filters and any other support)
  • Build a shim for debugging CEPs with local data
  • Store the “raw raw” data for some period to ensure we’re safe if our code and/or CEP code is badly broken. Can’t just lose data.
  • Tee off to short-lived S3 before it goes through the main pipeline?
  • BI query example that cross references data sources
  • example: does fxa/sync increase browser usage?

To Do

  • Q4 telemetry: (re) implement telemetry monitoring dashboards [?]
  • Q1 BI: define schema for data warehouse (talk to jjensen) [kparlante]
  • should use multiple data sources
  • Q1 BI: write filter for data warehouse [trink]
  • Q1 BI: signal & schedule loading of data warehouse [mreid]
  • Q1 BI: redshift output [trink]
  • Q1 BI: setup domo and/or tableau to look at mysql or csv or whatever is easy [?]
  • Q1: Data format spec [kparlante, trink]
  • JSON schema, specifically for FHR+telemetry, also anticipate other sources
  • Q1: implement best guess at per user sampling [trink]
  • follow up with saptarshi for more complex algorithm

Doing

  • Opsify stack [whd]
  • Q4 telemetry: Send telemetry data through the pipeline [mreid]
  • Q4 telemetry: Larger payloads (32MB) for telemetry [trink]
  • risk mitigation: Estimate cost of “full scan” DWH query [mreid]
  • risk mitigation: Estimate cost of single DWH delete [mreid]

Done

  • Parallelize sandbox filters (eg FHRSearch) [trink]
  • Enable Lua JIT [trink]