Breakpad/Status Meetings/2015-08-19: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
Line 19: Line 19:
== Operations Updates ==
== Operations Updates ==
* stage is working again
* stage is working again
** updating supersearchfields is not working
** fair and balanced submitter is running
*** 10% of what we receive is now going to stage
** adi is coming in now
 
* JSON problems reoccured in prod
** stage has a patch the filters keys as well as values
** manually submitting a problematic crash leads us to believe stage is fixed
** write out the null bytes
 
* updating supersearchfields is not working
** adrian working on a patch
 
* datadog runs on UDP
** we've been losing windows of time
 
* webapp has been registering as down to pingdom/nagios
** high response time requests from the webapp, but they are serving requests in those windows
** can we get newrelic to look?
*** talk to Travis
*** we need 8 hosts for a while
** moztrap was seeing this
*** cache was acclimating pingdom to fast responses
*** whole cache was invalidating at once
** we are not actually down in these times, may just have higher response times above the timeouts
** the only thing more dangerous than no alerting is noisy alerting
 
* pingdom accounts for the team?
** going to take approx 1 month to set us up for pingdom on the mozilla account
 
* sentry is down
** cannot ingest
** should be back by the end of the day
 
* loggly
** django errors don't end up in syslog, they don't go to stdout
** maybe we don't want to continue with loggly
*** the features it has over other log aggregators are not that useful
 
 


=== Other ===
=== Other ===

Revision as of 18:25, 19 August 2015

« previous meetingindexnext week » create?

Meeting Info

Breakpad status meetings occur on Wed at 11:00am Pacific Time.

Conference numbers:

   Vidyo: Stability 
   650-903-0800 x92 conf 98200#
   800-707-2533 (pin 369) conf 98200# 

IRC backchannel: #breakpad
Mountain View: Dancing Baby (3rd floor)

Operations Updates

  • stage is working again
    • fair and balanced submitter is running
      • 10% of what we receive is now going to stage
    • adi is coming in now
  • JSON problems reoccured in prod
    • stage has a patch the filters keys as well as values
    • manually submitting a problematic crash leads us to believe stage is fixed
    • write out the null bytes
  • updating supersearchfields is not working
    • adrian working on a patch
  • datadog runs on UDP
    • we've been losing windows of time
  • webapp has been registering as down to pingdom/nagios
    • high response time requests from the webapp, but they are serving requests in those windows
    • can we get newrelic to look?
      • talk to Travis
      • we need 8 hosts for a while
    • moztrap was seeing this
      • cache was acclimating pingdom to fast responses
      • whole cache was invalidating at once
    • we are not actually down in these times, may just have higher response times above the timeouts
    • the only thing more dangerous than no alerting is noisy alerting
  • pingdom accounts for the team?
    • going to take approx 1 month to set us up for pingdom on the mozilla account
  • sentry is down
    • cannot ingest
    • should be back by the end of the day
  • loggly
    • django errors don't end up in syslog, they don't go to stdout
    • maybe we don't want to continue with loggly
      • the features it has over other log aggregators are not that useful


Other

  • How many EC2 webhead nodes do we have to prod?
    • Wanna scale that down?

Project Updates

Socorro Bug Tracker
this week's bugs

Deployment Triage

PR Triage

QA

  • Help tracking down an intermittent failure
    • ReadTimeout: HTTPSConnectionPool(host='crash-stats.mozilla.com', port=443): Read timed out. (read timeout=10)

other business

Travel, etc

Links