ProgramManagement/Programs/Telemetry/Meetings/2012-05-16
From MozillaWiki
Telemetry Meeting 2012-05-16
Attendance: metrics and perf teams
Note: I was late to this meeting and had Vidyo trouble in the middle. There may be missing content.
Data validation
- Data validation doesn’t need to be perfect
- We can prototype validation with demo data
- Limited validations to look at trunk versions of Firefox, not applicable to older data, still have data from older versions coming in every day
- Data from older versions is still valuable to collect
- Already making decisions based on telemetry in current form, can’t make as many decisions as we’d like
- Historical data used by memshink to track over time
- We have ability to validate old data we just have to do it per version
- How do we have continuity of data when it differs?
- View data as different histograms
- We should see if the trend is in any way comparable
- When do we stop paying attention to releases?
- Stakeholders have different requirements.
- Validate consistency of each payload
- How much data do we take in?
- 5% is 3G, 60Gb/day as of May 14 – this is data that we’re storing
- might be as high as 120Gb/day
- Can get max of 2 pings in 24 hours
- 2 pings is the common (70%) case
- Telemetry persistence is in beta population
- Concern for metrics as this can potentially double amount of traffic
- Persistence went in at the same time as compression, which only reduces over the wire transmission
- We can look at optimizing data storage
- First piece of validation is amount of traffic on a daily basis
- Gauge based on how much traffic we expect to get
- Do we want to base this off of the data that we were previously receiving?
- Have volume check in the dashboard
- Want poll check
- Don’t have production system yet
- Multiple ways to fail,
- Fail sending data on browser level - fail to send
- Fail to ETL data from hbase to elastic search
- Metadata that said how much data made it through to elastic search may also be correct
- Perf cares about whether submissions are coming in as there is no way to recover from that failure
- Should implement fail safes to notify if volume changes significantly
- Should push validation through as the specific validation would be useful in the new system
- Raw push notification of volume – count, size – of submissions going into hbase
- Needs an owner to monitor
- For as long as we have multiple tiers we can do the same thing for each tier
- Raw push notification of volume – count, size – of submissions going into hbase
- Would like Perf team to be responsible for the validation checks
- Metrics will take extracts of data that you want - versions, other variations - and deliver to you a corpus of JSONs that you devise the appropriate checks to spit out valid or invalid
- Metrics proposed method of transforming invalid data
- Suggesting that we give you data or give you access for perf to do validation – agreed
- Production deliverable is a harness that pulls down validation script on a regular basis and tests data
- Eventually you need to validate the whole data set, 5% of data should catch most of the issues
- Only thing not handled are start-up histograms
- Perf wants to run the script on the entire data set
- Metrics doesn’t have enough domain experience to look at data in elastic search for inconsistencies
- Think we need to do this the hard, manual way, with domain experts who understand the knowledge
- Perf thinks that testing is not feasible for this data
- Metrics team will not be culpable for this type of data
- Enable push notifications for 3 storage repositories for data volume
- Once we have a validation script it will push out counts of valid/invalid
- Want check about percentage of Firefox users reporting to telemetry
- Metrics will create a validation proposal for data validation and share with Perf for review
- Lawrence and Taras to set the statement of work
- Perf will sign off on validation
- Metrics (Daniel) commits to perf being able to run validation scripts on server
Team interaction
- Metrics feels that they are being told what to do by Perf
- Metrics has suggestions that we need to listen to for collaboration
- Two problems from Webdetails perspective
- 1 analyst team feel that the Metrics team should be able to give more input to how we are collecting Telemetry data
- Don’t understand why we are aggregating and collecting the data in the method that we are
- 2. Tools we have are the only way Telemetry data can be viewed, two dashboards give only visibility into data, that’s why it’s so important for Perf to communicate deliverables
- Metrics hasn’t been able to prioritize features in a proper way - we're being inefficient
- Webdetails can’t keep working in the way that they have
- We had a problem with IT removing a server
- We need well defined statements of work from Perf team where requirements are documented so that we can prioritize work
- Answers that are expected immediately puts us in a bad position
- 1 analyst team feel that the Metrics team should be able to give more input to how we are collecting Telemetry data
- There is a difference between technical detail requirement and business detail requirement
- Here is what my devs want to do
- First set of requirements will have some that are infeasible, too expensive, etc.
- Lots of discussion on how to get from first requirement to implementation
- Need to try to insulate requests so that we can act on them and provide a good deliverable
- More isolated checks of work with specific requirements about what you’re trying to analyze
- Telemetry evolution is an example of this
- Never had reasonable checkpoints that both sides agreed on
- Taras isn’t the Telemetry person
- Can have conversations but there are more people who need to be involved
- There are a comprehensive set of business questions coming from various sources
- Getting this from interviews that Daniel and Lawrence conducted
- Once we have this analysts can work on proposals for how we can meet these requirements
Other
- Should we start Telemetry over from the business questions to see what it would look like and then scope potential modifications to the existing system?
- Who decided on histograms?
- We used Chrome as an example because they had an implementation that took multiple years to develop
- Lots of opposition to the idea of getting rid of histograms
- In response, we dropped the idea
- Taras understood that Justin, Joy and him agreed on this
- Might be a better investment in time to do an a/b test framework instead of building out telemetry further
- Could be used by other teams
- Telemetry needs to sample the world
- The question is how does this feature behave in the real world?
- We're not performing experiments
- JS does want a/b testing
- Memshrink may want it as well
- Other projects need further discussion for prioritization and scope
- Telemetry UX
- Notifications
- Production quality offering