ProgramManagement/Programs/Telemetry/Meetings/2012-05-16

From MozillaWiki
Jump to: navigation, search

Telemetry Meeting 2012-05-16

Attendance: metrics and perf teams

Note: I was late to this meeting and had Vidyo trouble in the middle. There may be missing content.

Data validation

  • Data validation doesn’t need to be perfect
  • We can prototype validation with demo data
  • Limited validations to look at trunk versions of Firefox, not applicable to older data, still have data from older versions coming in every day
    • Data from older versions is still valuable to collect
    • Already making decisions based on telemetry in current form, can’t make as many decisions as we’d like
    • Historical data used by memshink to track over time
    • We have ability to validate old data we just have to do it per version
  • How do we have continuity of data when it differs?
    • View data as different histograms
    • We should see if the trend is in any way comparable
  • When do we stop paying attention to releases?
    • Stakeholders have different requirements.
  • Validate consistency of each payload
  • How much data do we take in?
    • 5% is 3G, 60Gb/day as of May 14 – this is data that we’re storing
    • might be as high as 120Gb/day
  • Can get max of 2 pings in 24 hours
    • 2 pings is the common (70%) case
  • Telemetry persistence is in beta population
    • Concern for metrics as this can potentially double amount of traffic
    • Persistence went in at the same time as compression, which only reduces over the wire transmission
    • We can look at optimizing data storage
  • First piece of validation is amount of traffic on a daily basis
    • Gauge based on how much traffic we expect to get
    • Do we want to base this off of the data that we were previously receiving?
  • Have volume check in the dashboard
    • Want poll check
    • Don’t have production system yet
    • Multiple ways to fail,
      • Fail sending data on browser level - fail to send
      • Fail to ETL data from hbase to elastic search
      • Metadata that said how much data made it through to elastic search may also be correct
    • Perf cares about whether submissions are coming in as there is no way to recover from that failure
    • Should implement fail safes to notify if volume changes significantly
  • Should push validation through as the specific validation would be useful in the new system
    • Raw push notification of volume – count, size – of submissions going into hbase
      • Needs an owner to monitor
    • For as long as we have multiple tiers we can do the same thing for each tier
  • Would like Perf team to be responsible for the validation checks
    • Metrics will take extracts of data that you want - versions, other variations - and deliver to you a corpus of JSONs that you devise the appropriate checks to spit out valid or invalid
    • Metrics proposed method of transforming invalid data
    • Suggesting that we give you data or give you access for perf to do validation – agreed
  • Production deliverable is a harness that pulls down validation script on a regular basis and tests data
    • Eventually you need to validate the whole data set, 5% of data should catch most of the issues
    • Only thing not handled are start-up histograms
    • Perf wants to run the script on the entire data set
  • Metrics doesn’t have enough domain experience to look at data in elastic search for inconsistencies
    • Think we need to do this the hard, manual way, with domain experts who understand the knowledge
    • Perf thinks that testing is not feasible for this data
    • Metrics team will not be culpable for this type of data
  • Enable push notifications for 3 storage repositories for data volume
    • Once we have a validation script it will push out counts of valid/invalid
  • Want check about percentage of Firefox users reporting to telemetry
  • Metrics will create a validation proposal for data validation and share with Perf for review
    • Lawrence and Taras to set the statement of work
    • Perf will sign off on validation
  • Metrics (Daniel) commits to perf being able to run validation scripts on server

Team interaction

  • Metrics feels that they are being told what to do by Perf
    • Metrics has suggestions that we need to listen to for collaboration
  • Two problems from Webdetails perspective
    • 1 analyst team feel that the Metrics team should be able to give more input to how we are collecting Telemetry data
      • Don’t understand why we are aggregating and collecting the data in the method that we are
    • 2. Tools we have are the only way Telemetry data can be viewed, two dashboards give only visibility into data, that’s why it’s so important for Perf to communicate deliverables
      • Metrics hasn’t been able to prioritize features in a proper way - we're being inefficient
      • Webdetails can’t keep working in the way that they have
      • We had a problem with IT removing a server
      • We need well defined statements of work from Perf team where requirements are documented so that we can prioritize work
      • Answers that are expected immediately puts us in a bad position
  • There is a difference between technical detail requirement and business detail requirement
    • Here is what my devs want to do
    • First set of requirements will have some that are infeasible, too expensive, etc.
      • Lots of discussion on how to get from first requirement to implementation
  • Need to try to insulate requests so that we can act on them and provide a good deliverable
  • More isolated checks of work with specific requirements about what you’re trying to analyze
    • Telemetry evolution is an example of this
    • Never had reasonable checkpoints that both sides agreed on
  • Taras isn’t the Telemetry person
    • Can have conversations but there are more people who need to be involved
    • There are a comprehensive set of business questions coming from various sources
      • Getting this from interviews that Daniel and Lawrence conducted
    • Once we have this analysts can work on proposals for how we can meet these requirements

Other

  • Should we start Telemetry over from the business questions to see what it would look like and then scope potential modifications to the existing system?
  • Who decided on histograms?
    • We used Chrome as an example because they had an implementation that took multiple years to develop
    • Lots of opposition to the idea of getting rid of histograms
      • In response, we dropped the idea
      • Taras understood that Justin, Joy and him agreed on this
  • Might be a better investment in time to do an a/b test framework instead of building out telemetry further
    • Could be used by other teams
    • Telemetry needs to sample the world
      • The question is how does this feature behave in the real world?
      • We're not performing experiments
    • JS does want a/b testing
    • Memshrink may want it as well
  • Other projects need further discussion for prioritization and scope
    • Telemetry UX
    • Notifications
    • Production quality offering