Socorro:DowntimePostmortem
From MozillaWiki
Contents
Agenda
Background
- Thursday's downtime:
- bug 600998
- Root cause: traffic? (how sure are we?)
- Exactly what changes were made to address?
- How long will these changes tide us over?
- Today's downtime:
- bug 601625
- Compaction during peak hours
- Concern over IO problems on the boxes
- Collectors at capacity (added two more in bug 601622 - thanks IT for fast action)
Future
- Discussion of traffic numbers (dre)
- What else do we need to do to keep things ticking over before PHX?
- Is PHX hardware going to be sufficient?
- What's the growth plan?
- What else do we need to do?
Actions
- Create a script to schedule compactions. Non automatic compactions have some issues so we need to cleanup manually. (dre)
- Add a nagios monitor on the compactionQueueSize (dre, aravind)
- Continue plans to upgrade to 0.89 (rhelmer, xstevens testing on research cluster)
- Hire {one, some} HBase Ops Engineer(s) (everybody)
- Pull hadoop{29,30} from the research cluster, reformat, reinstall, and add to production (dre?)
- These boxes don't use RAID5. A subpoint is: profile them once in prod to see if they perform better. If so, reformat and reimage rest of prod to not use RAID5. (dre, stack)
- Forensics on ganglia metrics from this morning to see if there are any other issues we need to deal with (dre, stack)
- Review fallback storage requirements and plan (laura)