Confirmed users
1,031
edits
Line 20: | Line 20: | ||
* focused on "not being on fire" | * focused on "not being on fire" | ||
** seems to be going well | ** seems to be going well | ||
* root cause of last weeks issues | |||
** configuration mismatch with the rest of the cluster | |||
** puppet missed putting the .yaml file in there | |||
** they defaulted to 2GB and when they exhausted themselves everything went to hell | |||
** we initially suspected that it was retention related | |||
** debated but didn't land a change that would lower retention temporarily | |||
* new pingdom accounts coming if you have one already | |||
* monitoring of ES | |||
** Jason has been helping us to figure out our ES config and make it more robust | |||
** JP has new monitoring agent | |||
** we expect to have new, aggressive alerts | |||
* super search errors are checked in webapp health check | |||
** should catch individual shard failures | |||
** shard failures break pingdom and sentry now | |||
** jp will own a plan for failure | |||
* python upgrade | |||
** on the horizon | |||
** JP wants a stable stage and prod before he does it | |||
** let's do it this week, shortly after our next ship to prod | |||
[https://bugzilla.mozilla.org/buglist.cgi?priority=P1&resolution=---&query_format=advanced&bug_status=UNCONFIRMED&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&component=Infra&product=Socorro&list_id=13148014 P1 infra bugs] | |||
== Project Updates == | == Project Updates == |