CloudServices/DataPipeline: Difference between revisions

Formatted work queue
(Add work queue)
(Formatted work queue)
Line 31: Line 31:
= Work Queue =
= Work Queue =
== Risks/Questions ==
== Risks/Questions ==
Send something to dev-planning? [kparlante, telliot]
* Send something to dev-planning? [kparlante, telliot]
Old-FHR data through pipeline? Yes/No: [telliot]
* Old-FHR data through pipeline? Yes/No: [telliot]
Deletes & legal policy [telliot, mreid to provide cost estimate]
* Deletes & legal policy [telliot, mreid to provide cost estimate]
Stewing
 
Maintaining a sample data set for faster queries
== Needs more discussion/definition ==
Implement a specific flag to determine if data gets warehoused or not
* Maintaining a sample data set for faster queries
Integrate Roberto’s spark data flow into new DWH
* Implement a specific flag to determine if data gets warehoused or not
Implies a similar db-backed table of DWH filenames for filtering (don’t want to list S3 every time - too slow)
* Integrate Roberto’s spark data flow into new DWH
Elasticsearch (Kibana) output filter
* Implies a similar db-backed table of DWH filenames for filtering (don’t want to list S3 every time - too slow)
Complete list of outputs (and filters and any other support)
* Elasticsearch (Kibana) output filter
Build a shim for debugging CEPs with local data
* Complete list of outputs (and filters and any other support)
Store the “raw raw” data for some period to ensure we’re safe if our code and/or CEP code is badly broken. Can’t just lose data.  
* Build a shim for debugging CEPs with local data
Tee off to short-lived S3 before it goes through the main pipeline?
* Store the “raw raw” data for some period to ensure we’re safe if our code and/or CEP code is badly broken. Can’t just lose data.  
BI query example that cross references data sources
* Tee off to short-lived S3 before it goes through the main pipeline?
example: does fxa/sync increase browser usage?
* BI query example that cross references data sources
Queueing
* example: does fxa/sync increase browser usage?
Q4 telemetry: (re) implement telemetry monitoring dashboards [?]
 
Q1 BI: define schema for data warehouse (talk to jjensen) [kparlante]
== To Do ==
should use multiple data sources
* Q4 telemetry: (re) implement telemetry monitoring dashboards [?]
Q1 BI: write filter for data warehouse [trink]
* Q1 BI: define schema for data warehouse (talk to jjensen) [kparlante]
Q1 BI: signal & schedule loading of data warehouse [mreid]
* should use multiple data sources
Q1 BI: redshift output [trink]
* Q1 BI: write filter for data warehouse [trink]
Q1 BI: setup domo and/or tableau to look at mysql or csv or whatever is easy [?]
* Q1 BI: signal & schedule loading of data warehouse [mreid]
Q1: Data format spec [kparlante, trink]
* Q1 BI: redshift output [trink]
JSON schema, specifically for FHR+telemetry, also anticipate other sources
* Q1 BI: setup domo and/or tableau to look at mysql or csv or whatever is easy [?]
Q1: implement best guess at per user sampling [trink]
* Q1: Data format spec [kparlante, trink]
follow up with saptarshi for more complex algorithm
* JSON schema, specifically for FHR+telemetry, also anticipate other sources
Doing
* Q1: implement best guess at per user sampling [trink]
Opsify stack [whd]
* follow up with saptarshi for more complex algorithm
Q4 telemetry: Send telemetry data through the pipeline [mreid]
== Doing ==
Q4 telemetry: Larger payloads (32MB) for telemetry [trink]
* Opsify stack [whd]
risk mitigation: Estimate cost of “full scan” DWH query [mreid]
* Q4 telemetry: Send telemetry data through the pipeline [mreid]
risk mitigation: Estimate cost of single DWH delete [mreid]
* Q4 telemetry: Larger payloads (32MB) for telemetry [trink]
Done
* risk mitigation: Estimate cost of “full scan” DWH query [mreid]
Parallelize sandbox filters (eg FHRSearch) [trink]
* risk mitigation: Estimate cost of single DWH delete [mreid]
Enable Lua JIT [trink]
== Done ==
* Parallelize sandbox filters (eg FHRSearch) [trink]
* Enable Lua JIT [trink]
Confirmed users
539

edits