CloudServices/DataPipeline: Difference between revisions

CloudServices/DataPipeline (view source)

Revision as of 22:41, 8 January 2015

112 bytes added , 8 January 2015

Formatted work queue

Kparlante

Confirmed users

539

edits

@@ Line 31: / Line 31: @@
 = Work Queue =
 == Risks/Questions ==
-Send something to dev-planning? [kparlante, telliot]
+* Send something to dev-planning? [kparlante, telliot]
-Old-FHR data through pipeline? Yes/No: [telliot]
+* Old-FHR data through pipeline? Yes/No: [telliot]
-Deletes & legal policy [telliot, mreid to provide cost estimate]
+* Deletes & legal policy [telliot, mreid to provide cost estimate]
-Stewing
-Maintaining a sample data set for faster queries
+== Needs more discussion/definition ==
-Implement a specific flag to determine if data gets warehoused or not
+* Maintaining a sample data set for faster queries
-Integrate Roberto’s spark data flow into new DWH
+* Implement a specific flag to determine if data gets warehoused or not
-Implies a similar db-backed table of DWH filenames for filtering (don’t want to list S3 every time - too slow)
+* Integrate Roberto’s spark data flow into new DWH
-Elasticsearch (Kibana) output filter
+* Implies a similar db-backed table of DWH filenames for filtering (don’t want to list S3 every time - too slow)
-Complete list of outputs (and filters and any other support)
+* Elasticsearch (Kibana) output filter
-Build a shim for debugging CEPs with local data
+* Complete list of outputs (and filters and any other support)
-Store the “raw raw” data for some period to ensure we’re safe if our code and/or CEP code is badly broken. Can’t just lose data.
+* Build a shim for debugging CEPs with local data
-Tee off to short-lived S3 before it goes through the main pipeline?
+* Store the “raw raw” data for some period to ensure we’re safe if our code and/or CEP code is badly broken. Can’t just lose data.
-BI query example that cross references data sources
+* Tee off to short-lived S3 before it goes through the main pipeline?
-example: does fxa/sync increase browser usage?
+* BI query example that cross references data sources
-Queueing
+* example: does fxa/sync increase browser usage?
-Q4 telemetry: (re) implement telemetry monitoring dashboards [?]
-Q1 BI: define schema for data warehouse (talk to jjensen) [kparlante]
+== To Do ==
-should use multiple data sources
+* Q4 telemetry: (re) implement telemetry monitoring dashboards [?]
-Q1 BI: write filter for data warehouse [trink]
+* Q1 BI: define schema for data warehouse (talk to jjensen) [kparlante]
-Q1 BI: signal & schedule loading of data warehouse [mreid]
+* should use multiple data sources
-Q1 BI: redshift output [trink]
+* Q1 BI: write filter for data warehouse [trink]
-Q1 BI: setup domo and/or tableau to look at mysql or csv or whatever is easy [?]
+* Q1 BI: signal & schedule loading of data warehouse [mreid]
-Q1: Data format spec [kparlante, trink]
+* Q1 BI: redshift output [trink]
-JSON schema, specifically for FHR+telemetry, also anticipate other sources
+* Q1 BI: setup domo and/or tableau to look at mysql or csv or whatever is easy [?]
-Q1: implement best guess at per user sampling [trink]
+* Q1: Data format spec [kparlante, trink]
-follow up with saptarshi for more complex algorithm
+* JSON schema, specifically for FHR+telemetry, also anticipate other sources
-Doing
+* Q1: implement best guess at per user sampling [trink]
-Opsify stack [whd]
+* follow up with saptarshi for more complex algorithm
-Q4 telemetry: Send telemetry data through the pipeline [mreid]
+== Doing ==
-Q4 telemetry: Larger payloads (32MB) for telemetry [trink]
+* Opsify stack [whd]
-risk mitigation: Estimate cost of “full scan” DWH query [mreid]
+* Q4 telemetry: Send telemetry data through the pipeline [mreid]
-risk mitigation: Estimate cost of single DWH delete [mreid]
+* Q4 telemetry: Larger payloads (32MB) for telemetry [trink]
-Done
+* risk mitigation: Estimate cost of “full scan” DWH query [mreid]
-Parallelize sandbox filters (eg FHRSearch) [trink]
+* risk mitigation: Estimate cost of single DWH delete [mreid]
-Enable Lua JIT [trink]
+== Done ==
+* Parallelize sandbox filters (eg FHRSearch) [trink]
+* Enable Lua JIT [trink]