Breakpad/Status Meetings/2016-08-03: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
Line 20: Line 20:
* focused on "not being on fire"
* focused on "not being on fire"
** seems to be going well
** seems to be going well
** pull request went up for changing the retention in ES
 
*** did it get merged?
* root cause of last weeks issues
*** it did _not_!
** configuration mismatch with the rest of the cluster
** well then JP should have slept better last night
** puppet missed putting the .yaml file in there
** we had started a job that ran once a week to expire old indexes
** they defaulted to 2GB and when they exhausted themselves everything went to hell
** we trimmed down how many indexes are kept
** we initially suspected that it was retention related
** root cause of last weeks issues
** debated but didn't land a change that would lower retention temporarily
*** configuration mismatch with the rest of the cluster
* new pingdom accounts coming if you have one already
*** puppet missed putting the .yaml file in there
 
*** they defaulted to 2GB and when they exhausted themselves everything went to hell
* monitoring of ES
*** we initially suspected that it was retention related
** Jason has been helping us to figure out our ES config and make it more robust
*** debated but didn't land a change that would lower retention temporarily
** JP has new monitoring agent
** new pingdom accounts coming if you have one already
** we expect to have new, aggressive alerts
 
* super search errors are checked in webapp health check
** should catch individual shard failures
** shard failures break pingdom and sentry now
** jp will own a plan for failure
 
* python upgrade
** on the horizon
** JP wants a stable stage and prod before he does it
** let's do it this week, shortly after our next ship to prod
 
[https://bugzilla.mozilla.org/buglist.cgi?priority=P1&resolution=---&query_format=advanced&bug_status=UNCONFIRMED&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&component=Infra&product=Socorro&list_id=13148014  P1 infra bugs]


== Project Updates ==
== Project Updates ==

Revision as of 17:22, 3 August 2016

« previous meetingindexnext week » create?

Meeting Info

Breakpad status meetings occur on Wed at 10:00am Pacific Time.

Conference numbers:

   Vidyo: Stability 
   650-903-0800 x92 conf 98200#
   800-707-2533 (pin 369) conf 98200# 

IRC backchannel: #breakpad
Mountain View: Dancing Baby (3rd floor)

Operations Updates

  • focused on "not being on fire"
    • seems to be going well
  • root cause of last weeks issues
    • configuration mismatch with the rest of the cluster
    • puppet missed putting the .yaml file in there
    • they defaulted to 2GB and when they exhausted themselves everything went to hell
    • we initially suspected that it was retention related
    • debated but didn't land a change that would lower retention temporarily
  • new pingdom accounts coming if you have one already
  • monitoring of ES
    • Jason has been helping us to figure out our ES config and make it more robust
    • JP has new monitoring agent
    • we expect to have new, aggressive alerts
  • super search errors are checked in webapp health check
    • should catch individual shard failures
    • shard failures break pingdom and sentry now
    • jp will own a plan for failure
  • python upgrade
    • on the horizon
    • JP wants a stable stage and prod before he does it
    • let's do it this week, shortly after our next ship to prod

P1 infra bugs

Project Updates

  • Socorro::Middleware component to Graveyard
  • Monitoring/healthcheck now checks for ES shards errors. In prod. Every minute.

Deployment Triage

PR Triage

Major Projects

Migrating off of persona

  • Deployed in stage
    • emails will be sent as soon as we put this in prod

Sending public data to parquet for reading from spark/re:dash

Symbols service refactoring (snappy, somewhat tangental to us)

No update.

Signature generation across crash reporters

Splitting out collector

No update.

Collecting client-side JavaScript errors

Handling more PII data in crashes

Sending stacks for all crashes from the client

Replacing FTPscraper

other business

Travel, etc

Links