Socorro:DowntimePostmortem

From MozillaWiki
Jump to: navigation, search

Agenda

Background

  • Thursday's downtime:
    • bug 600998
    • Root cause: traffic? (how sure are we?)
    • Exactly what changes were made to address?
    • How long will these changes tide us over?
  • Today's downtime:
    • bug 601625
    • Compaction during peak hours
    • Concern over IO problems on the boxes
    • Collectors at capacity (added two more in bug 601622 - thanks IT for fast action)

Future

  • Discussion of traffic numbers (dre)
  • What else do we need to do to keep things ticking over before PHX?
  • Is PHX hardware going to be sufficient?
  • What's the growth plan?
  • What else do we need to do?

Actions

  1. Create a script to schedule compactions. Non automatic compactions have some issues so we need to cleanup manually. (dre)
  2. Add a nagios monitor on the compactionQueueSize (dre, aravind)
  3. Continue plans to upgrade to 0.89 (rhelmer, xstevens testing on research cluster)
  4. Hire {one, some} HBase Ops Engineer(s) (everybody)
  5. Pull hadoop{29,30} from the research cluster, reformat, reinstall, and add to production (dre?)
    1. These boxes don't use RAID5. A subpoint is: profile them once in prod to see if they perform better. If so, reformat and reimage rest of prod to not use RAID5. (dre, stack)
  6. Forensics on ganglia metrics from this morning to see if there are any other issues we need to deal with (dre, stack)
  7. Review fallback storage requirements and plan (laura)