Agenda

Background

Thursday's downtime:
- bug 600998
- Root cause: traffic? (how sure are we?)
- Exactly what changes were made to address?
- How long will these changes tide us over?
Today's downtime:
- bug 601625
- Compaction during peak hours
- Concern over IO problems on the boxes
- Collectors at capacity (added two more in bug 601622 - thanks IT for fast action)

Create a script to schedule compactions. Non automatic compactions have some issues so we need to cleanup manually. (dre)
Add a nagios monitor on the compactionQueueSize (dre, aravind)
Continue plans to upgrade to 0.89 (rhelmer, xstevens testing on research cluster)
Hire {one, some} HBase Ops Engineer(s) (everybody)
Pull hadoop{29,30} from the research cluster, reformat, reinstall, and add to production (dre?)
1. These boxes don't use RAID5. A subpoint is: profile them once in prod to see if they perform better. If so, reformat and reimage rest of prod to not use RAID5. (dre, stack)
Forensics on ganglia metrics from this morning to see if there are any other issues we need to deal with (dre, stack)
Review fallback storage requirements and plan (laura)