Unified Telemetry
Jump to navigation
Jump to search
Overview
We're unifying the Telemetry and Firefox Health Report collection systems on the client, and sending them through one Data Pipeline. To accomplish this on the client, we're migrating all of the FHR data to the Telemetry system. The new data pipeline has some features of the old telemetry pipeline as well as the cloud services data pipeline that we use to ingest server log data from Firefox services.
Goals for Unified Telemetry
- On the client, unify the telemetry and FHR measurement systems so that measurements do not have to be implemented more than once in different systems.
- Reduce the latency from the time a measurement occurs until it can be analyzed on the server.
- Increase the accuracy of measurements so that they can be better correlated with factors in the user environment such as the specific build, enabled addons, and other hardware or software factors.
- Use a common data pipeline for client telemetry and service log data.
People and Roles
- Alessio Placitelli, :Dexter (client data collection)
- Georg Fritzsche (client data collection)
- Katie Parlante (eng manager)
- Mark Reid (data pipeline, telemetry server)
- Michael Trinkala, :trink (data pipeline, heka)
- Wesley Dawson, :whd (data pipeline operations)
- Daniel Thornton, :relud (data pipeline operations)
- Benjamin Smedberg (budget, data steward)
- Brendan Colloran (metrics team, data validation)
- Sam Penrose (data validation)
- Roberto Vitillo (Spark analysis tool, telemetry data validation)
- (Telemetry dashboard)
- Thomas Huelbert (project management)
- Stuart Philp (Test automation)
Resources
- Kickoff document
- "Query Requirements" section has list of sample queries/questions that get asked frequently of FHR data
- Format documentation
Milestones
Plan of record, subject to change if acceptance criteria are not met.
Deliverables
- Monitoring and alerting about pipeline health
- Basic tool support
- Telemetry Dashboard works against new pipeline dwh
- Telemetry-dash (or new equivalent) can launch spark, heka reporting jobs
- Derived data sets
- Executive dashboard rollup
- 1% sample of clientIds for longitudinal analysis
Dates
- 2015-05-29: 39 Beta (slipped due to 38.0.5)
- We start receiving Beta traffic on new pipeline
- FHR v2 data still sent to old pipeline
- saved-session pings to both old telemetry and new pipeline
- main pings go to new pipeline from beta, aurora, and nightly channels
- 2015-06-29: 40 Beta, 39 Release
- No change
- FHR v2 data still sent to old pipeline
- saved-session pings to both old telemetry and new pipeline
- main pings go to new pipeline from beta, aurora, and nightly channels
- 2015-08-11: 40 Release
- FHR v2 data stops
- saved-sessions ping stops
- main pings sent to new pipeline from all channels
- base data sent from most of release population (unless they've opted out)
Acceptance Criteria (Beta -> Release)
- metrics team signoff
- metrics team analysis can proceed on new data streams
- longitudinal data has internal consistency and consistency with v2: Tracking Bug 1169103
- executive dashboard (in particular MAU)
- search analysis
- pipeline/ops team signoff
- pipeline is ready and can handle capacity
- monitoring and alerting set up
- no blocking issues:
- performance team signoff
- performance team analysis can proceed on new data streams
- <bug tree here>
- qa signoff
- <bug tree here>
- ua signoff
- Doesn't put any burden on the user (prefs are respected, no performance issues, etc.)
- <bug tree here>
Client work
- Backlog as spreadsheet, with estimates
- Bug tree, phase 3: https://bugzilla.mozilla.org/show_bug.cgi?id=1120356
- Bug tree, phase 2: https://bugzilla.mozilla.org/show_bug.cgi?id=1069869 (Done)
- Bug tree, phase 1: https://bugzilla.mozilla.org/show_bug.cgi?id=1040800 (Done)
Pipeline work
- Bugzilla: http://mzl.la/1KWiNST
Data validation
Metrics Team Validation
- https://bugzilla.mozilla.org/show_bug.cgi?id=1134661 (An automated script to compare FHR v2 results and FHR-v4 for a sample of users)
- For beta period, rollup fields compare reasonably to v2
- # of sessions
- session lengths
- searches
- default browser status
- places counts
Client Testing
Monitoring Tasks
- Compare a few telemetry measurements between "saved-session" and "main" pings
- Reporting to make sure we don't have broken or incomplete session fragment chains
- unified-FHR quality report: activity latency
Monitors
Investigations
Analysis and Reporting
Tools
- Automated data dump for data validation exercise
- Spark
- Stream processing on real time data
- Reporting using stream processing tools
Communication
- Conversation about unified telemetry on fhr-dev: https://mail.mozilla.org/listinfo/fhr-dev
- Data verification meeting notes: https://etherpad.mozilla.org/fhr-v4-status
- IRC: #telemetry, #datapipeline, #metrics