Breakpad/Status Meetings/2016-08-03

From MozillaWiki
Jump to: navigation, search

« previous meetingindexnext week » create?

Meeting Info

Breakpad status meetings occur on Wed at 10:00am Pacific Time.

Conference numbers:

   Vidyo: Stability 
   650-903-0800 x92 conf 98200#
   800-707-2533 (pin 369) conf 98200# 

IRC backchannel: #breakpad
Mountain View: Dancing Baby (3rd floor)

Operations Updates

P1 Infra Bugs

  • focused on "not being on fire"
    • seems to be going well
  • root cause of last weeks issues
    • configuration mismatch with the rest of the cluster
    • puppet missed putting the .yaml file in there
    • they defaulted to 2GB and when they exhausted themselves everything went to hell
    • we initially suspected that it was retention related
    • debated but didn't land a change that would lower retention temporarily
  • new pingdom accounts coming if you have one already
  • monitoring of ES
    • Jason has been helping us to figure out our ES config and make it more robust
    • JP has new monitoring agent
    • we expect to have new, aggressive alerts
  • super search errors are checked in webapp health check
    • should catch individual shard failures
    • shard failures break pingdom and sentry now
    • jp will own a plan for failure
  • python upgrade
    • on the horizon
    • JP wants a stable stage and prod before he does it
    • let's do it this week, shortly after our next ship to prod

Project Updates

  • Socorro::Middleware component to Graveyard
  • Monitoring/healthcheck now checks for ES shards errors. In prod. Every minute.
  • Home page AJAX code now cached! Yay faster home page.
  • in the ES fire there was talk of a spike among release drivers
    • laura was saying we needed to be stable so they could investigate
  • api now has cache headers
    • webapp is using them now, improved perf
  • google auth is getting ready to go out
    • we see an error on stage that we cannot reproduce
    • we suspect its security scanner tools, benign
    • one more fix going out and then we're ready for prod

Deployment Triage

PR Triage

Major Projects

Migrating off of persona

Sending public data to parquet for reading from spark/re:dash

  • Adrian and Peter has prototype to add another crash storage that sends to S3
  • Mark's awareness of reprocessing (aka. primary keys)
    • how useful is it and how often do we do it?
  • we are going to unify raw and processed crashes into a single crash report json document based on the public schema
    • avoids duplicate info, unifies all the info we have into one doc and uses the prettier name where we have it
    • starts only at the point where we transmit to telemetry data platform

Symbols service refactoring (snappy, somewhat tangental to us)

No update.

Signature generation across crash reporters

Splitting out collector

No update.

Collecting client-side JavaScript errors

  • No update

Handling more PII data in crashes

Sending stacks for all crashes from the client

  • no update

Replacing FTPscraper

other business

Travel, etc

Links