Unified Telemetry/Status reports/July 17 2015: Difference between revisions

m (Kparlante moved page Status reports/July 17 2015 to Unified Telemetry/Status reports/July 17 2015: move to namespace)
 
(20 intermediate revisions by 2 users not shown)
Line 1: Line 1:
[https://wiki.mozilla.org/Status_reports/July_10_2015 previous weeks report]
[https://wiki.mozilla.org/Status_reports/July_10_2015 previous weeks report]
placeholder for the weeks report


== Unified Telemetry status report July 17, 2015 ==
== Unified Telemetry status report July 17, 2015 ==
Line 7: Line 5:
=== Overall Project Health ===
=== Overall Project Health ===


Green -  
Green - r41 is go live for unified Telemetry. All issues triaged and assigned milestones. Dev Team continues to focus on data validation.


=== Exec Summary ===
=== Exec Summary ===
* Ongoing effort to validate data from nightly, aurora and beta channels
* Client work delayed this week by sick time. Send logic and a few other changes planned for uplift to Aurora & Beta next week. Remaining work for 41 waiting for reviews.
* Ongoing effort to prepare pipeline to scale to release traffic
* July 30 milestone for first complete pass of data validation, deployment of pipeline scaling work
* Ongoing effort to make telemetry tools and APIs work with v4 data
* Testing plan up on wiki:[[Telemetry/Testing]]
* Working on a mitigation plan for projects that were hoping to analyze release population data in r40
** Data available from nightly, aurora and beta channels now; analysis can begin
** Create python notebooks with example code for these projects
* Re-prioritizing two visualization projects that make use of pre-release data:
** https://bugzilla.mozilla.org/show_bug.cgi?id=1160626 (Map prerelease to release populations)
** https://bugzilla.mozilla.org/show_bug.cgi?id=1160636 (Allow query of "how many users of type X")
* Ongoing planning on FHR V2/V3 historic pipeline migration link to status [https://mana.mozilla.org/wiki/display/PM/FHR+historic+pipeline+update+July+6 here].
* Ongoing planning on FHR V2/V3 historic pipeline migration link to status [https://mana.mozilla.org/wiki/display/PM/FHR+historic+pipeline+update+July+6 here].
* Creation of milestones and plan for r41 delivery begins


=== Risks/Issues ===
=== Risks/Issues ===
Line 27: Line 18:
! Description of Risks/Issues !! State !! Owner !! Plan to Resolve/Mitigation !! Target Date
! Description of Risks/Issues !! State !! Owner !! Plan to Resolve/Mitigation !! Target Date
|-
|-
| Data integrity between V2/V4 and V4 internal data consistency || Open || Brendan/Sam || Investigation in progress. Added resources (Sam). https://etherpad.mozilla.org/fhr-v4-validation || 7/15
| Data integrity between V2/V4 and V4 internal data consistency || Open || Brendan/Sam || Investigation in progress. Added resources (Sam). https://etherpad.mozilla.org/fhr-v4-validation || 7/30
|-
|-
| Data continuity across V2/V4 || Open || Katie/Mark/Trink || Mark writing up plan from Whistler; metrics team specifying data sets and reviewing "executive" data set. https://bugzilla.mozilla.org/show_bug.cgi?id=1182684 || 7/15
| Data continuity across V2/V4 || Open || Katie/Mark/Trink || [https://docs.google.com/a/mozilla.com/document/d/1VzQHfzfA-S_lO2wpXDFjDzSJntJCMwP03TzefIj7RrE/edit?usp=sharing Plan], [https://bugzilla.mozilla.org/show_bug.cgi?id=1182684 Metabug] || 7/23
|-
|-
| Legal review || Open || BDS/Legal || Meeting between groups || 8/04
| Legal review || Open || BDS/Legal || Meeting between groups || 8/04
|-
|-
| QA sign off (functional, load) || Open || Stuart || Working with QA on creating test cases/test plans || 8/04
| QA sign off (functional, load) || Open || Stuart || [[Telemetry/Testing]] || 8/04
|-
|-
| Operations - data retention requirements || Open || Travis/Katie || Eng team owes ops a doc defining ping types and data retention requirements || 8/04
| Operations - data retention requirements || Open || Travis/Katie || Eng team owes ops a doc defining ping types and data retention requirements || 8/04
|-
|-
| Operations - analysis tools & microservices || Open || Travis/Mark/Roberto || [https://docs.google.com/a/mozilla.com/document/d/1KoLtIFV-aZtxruSVNmcc26F22MfqWjDynKgZ6adYk54/edit?usp=sharing%20 Architecture/Data flow diagram]; meeting next Monday (7/13) || 8/04
| Operations - analysis tools & microservices || Open || Travis/Mark/Roberto || [https://docs.google.com/a/mozilla.com/document/d/1KoLtIFV-aZtxruSVNmcc26F22MfqWjDynKgZ6adYk54/edit?usp=sharing%20 Architecture/Data flow diagram]|| 8/04
|-
|-
| Data loss incident || Open || mreid/whd/trink || [https://bugzilla.mozilla.org/show_bug.cgi?id=1179128 Tee server needs to return error status from old or new]. Added Ops resources (Daniel Thornton). || 7/15
| Data loss incident || Fixed || mreid/whd/trink || [https://bugzilla.mozilla.org/show_bug.cgi?id=1179128 Tee server needs to return error status from old or new]. Added Ops resources (Daniel Thornton). || 7/15
|-
|-
| Remote about:healthreport content || Open || Katie/Georg || Made a request to Laura Thomson for help || 8/04
| Remote about:healthreport content || Open || Katie/Georg || Made a request to Laura Thomson for help || 8/04
Line 49: Line 40:


=== Accomplished for Last Period ===
=== Accomplished for Last Period ===
 
Engineering & Ops
* Heka 0.10.0 beta released
* Client work: [https://docs.google.com/spreadsheets/d/1yAJmgCGYyk1d7A41DZa653Z3u2AbH-kDWsO1vPSgbfE/edit?usp=sharing Spreadsheet]
* Client work: [https://docs.google.com/spreadsheets/d/1yAJmgCGYyk1d7A41DZa653Z3u2AbH-kDWsO1vPSgbfE/edit?usp=sharing Spreadsheet]
* Updates to the unified telemetry decoder and executive report
** Not uplifting recent send logic changes to Beta (needs more bake time for confidence)
* [https://docs.google.com/a/mozilla.com/document/d/1KoLtIFV-aZtxruSVNmcc26F22MfqWjDynKgZ6adYk54/edit?usp=sharing Architecture flow diagram] in preparation for meeting with ops
** Uplifting a few patches around the send-logic ([uplift2], http://bit.ly/1Je45UA) to Aurora as soon as the send-logic impact is verified
* Progress on data validation
** Remaining client work ([uplift3], http://bit.ly/1TCl4r8) for 41 is manageable and either blocked by info requests or review
** Compare FHR v2 and FHR v4 search, crash, and other fields: https://bugzilla.mozilla.org/show_bug.cgi?id=1179376 -- close agreement for search counts
* Data validation
** Saved-session vs main pings: https://bugzilla.mozilla.org/show_bug.cgi?id=1147395 -- mismatch in about 7% of sessions for one of the metrics investigated
** Generated v4 data set with complete set of pings from all clients seen on nightly: https://bugzilla.mozilla.org/show_bug.cgi?id=1171265#c24
** Work on missing subsessions analysis (hints at a client bug): https://bugzilla.mozilla.org/show_bug.cgi?id=1171268
* Pipeline scaling work
** Finished distributed aggregation work started at workweek: https://github.com/mozilla-services/data-pipeline/pull/93
** Deployed next round of changes
* Telemetry tools and microservices
** Work on memory footprint of the Spark jobs: https://bugzilla.mozilla.org/show_bug.cgi?id=1182499
** Kickoff meeting for deployment plan for telemetry tools and microservices: [https://docs.google.com/a/mozilla.com/document/d/1KoLtIFV-aZtxruSVNmcc26F22MfqWjDynKgZ6adYk54/edit?usp=sharing Architecture flow diagram]
QA
* test cases, bug closing
Project management
* meeting, emails, hand waving


=== Planned for Upcoming Period ===
=== Planned for Upcoming Period ===


Engineering
Engineering
* Uplift final client changes for r40: [https://docs.google.com/spreadsheets/d/1yAJmgCGYyk1d7A41DZa653Z3u2AbH-kDWsO1vPSgbfE/edit?usp=sharing spreadsheet]
* Client
* Data validation: https://etherpad.mozilla.org/fhr-v4-validation
** Do code reviews for deletion pings and choices info bar
* Continue working on work in "b5" milestone: http://mzl.la/1FPWuJG
** Pending ping cleanup
** Investigate count discrepancies between "main" pings and "saved session" pings
* Pipeline
** Continue with scaling work
** Monitoring work for Telemetry data
** Investigate executive stream discrepancies
** Bug fixes
* Data validation
** Join corresponding v2 data to v4 nightly clients data set
** Continue writing callbacks that look at other measures
** Breadth first, do a first pass at most validations and flag big issues
** Deep dive on missing subsessions as it may indicate a client bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1171268
* Data continuity
** Document strategy for executive dashboards with v2 + v4 data
Ops
Ops
* Meeting to go over Telemetry tools/microservices production deployment
* data bricks investigation (big jobs on big clusters) - cost, resourcing etc
* Continued work on scaling for release loads
QA
* closing bugs
* test suite creation
* finalizing long term QA engagement (softvision engagement, tooling asks for CI loop based testing)
Project Management
Project Management
* create meeting for legal review
* Finish triage of bugs
* follow up with ops and qa
* remainder of release tasks scheduled
* mitigation plan for projects depending on UT
* reassess milestones given schedule adjustment


=== Outstanding requests not yet road mapped into a release ===
=== Outstanding requests not yet road mapped into a release ===

Latest revision as of 20:53, 17 July 2015

previous weeks report

Unified Telemetry status report July 17, 2015

Overall Project Health

Green - r41 is go live for unified Telemetry. All issues triaged and assigned milestones. Dev Team continues to focus on data validation.

Exec Summary

  • Client work delayed this week by sick time. Send logic and a few other changes planned for uplift to Aurora & Beta next week. Remaining work for 41 waiting for reviews.
  • July 30 milestone for first complete pass of data validation, deployment of pipeline scaling work
  • Testing plan up on wiki:Telemetry/Testing
  • Ongoing planning on FHR V2/V3 historic pipeline migration link to status here.

Risks/Issues

Description of Risks/Issues State Owner Plan to Resolve/Mitigation Target Date
Data integrity between V2/V4 and V4 internal data consistency Open Brendan/Sam Investigation in progress. Added resources (Sam). https://etherpad.mozilla.org/fhr-v4-validation 7/30
Data continuity across V2/V4 Open Katie/Mark/Trink Plan, Metabug 7/23
Legal review Open BDS/Legal Meeting between groups 8/04
QA sign off (functional, load) Open Stuart Telemetry/Testing 8/04
Operations - data retention requirements Open Travis/Katie Eng team owes ops a doc defining ping types and data retention requirements 8/04
Operations - analysis tools & microservices Open Travis/Mark/Roberto Architecture/Data flow diagram 8/04
Data loss incident Fixed mreid/whd/trink Tee server needs to return error status from old or new. Added Ops resources (Daniel Thornton). 7/15
Remote about:healthreport content Open Katie/Georg Made a request to Laura Thomson for help 8/04
Budget, size of UT pings Open Mark/BDS https://bugzilla.mozilla.org/show_bug.cgi?id=1182693 8/04
Analysis difficulty Open Katie/tbd No plan yet, aside from ongoing work on tools 8/04

Accomplished for Last Period

Engineering & Ops

QA

  • test cases, bug closing

Project management

  • meeting, emails, hand waving

Planned for Upcoming Period

Engineering

  • Client
    • Do code reviews for deletion pings and choices info bar
    • Pending ping cleanup
    • Investigate count discrepancies between "main" pings and "saved session" pings
  • Pipeline
    • Continue with scaling work
    • Monitoring work for Telemetry data
    • Investigate executive stream discrepancies
    • Bug fixes
  • Data validation
    • Join corresponding v2 data to v4 nightly clients data set
    • Continue writing callbacks that look at other measures
    • Breadth first, do a first pass at most validations and flag big issues
    • Deep dive on missing subsessions as it may indicate a client bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1171268
  • Data continuity
    • Document strategy for executive dashboards with v2 + v4 data

Ops

  • data bricks investigation (big jobs on big clusters) - cost, resourcing etc

QA

  • closing bugs
  • test suite creation
  • finalizing long term QA engagement (softvision engagement, tooling asks for CI loop based testing)

Project Management

  • Finish triage of bugs
  • remainder of release tasks scheduled

Outstanding requests not yet road mapped into a release

Description State Owner Plan to Resolve/Mitigation Target Date
FireFox OS - app pings Open Katie Need to schedule and understand impact on project TBD
histograms for loop/hello Open Katie Need to schedule and understand impact on project TBD

Important Links/References