Improve stability
- Solve compaction issue
- Better compaction monitoring (done)
- Upgrade to 0.89
- Review system architecture with stability as a goal
- Architecture meeting starting 10/25 [details]
- Seek advice from experts (Cloudera, SU, PGExperts, Rabbit)
- Review fallback storage requirements (done)
- See section below for architectural changes
- Hire {one, some} Hadoop Ops engineers PD
- Schedule 3x weekly HBase cluster restarts to avoid unscheduled restarts
Modify architecture for improved stability
- Change collectors to write to local storage by default bug 608734
- Build a simpler HBase API (Q1 2010)
- Investigate queues to use in front of HBase, behind API (Q1 2010)
- Hazelcast: talk to reference users, build a prototype
- Rabbit?
- Memcache?
Create build process
- Set up automated tests with Hudson bug 601501 (done)
- Create a build script bug 602559 (done)
- Write load testing guidelines and plan bug 605491
- Enable replication of production traffic to a mirror environment bug 599859
Improve release processImprove insight into systems
- Improve monitoring
- Add compaction queue monitors (done)
- Add collector log monitors bug 604397
- Build an ops dashboard bug 605495
- On-call roster for Metrics bug 598746
Move to new, bigger hardware in PHX (updated 11/8)
- Rack boxes [IT] - ETA 11/19
- Last delivery due 11/17 (processors, webapp, mware), all else 11/12
- Racking in progress
- Generate a diagram of required network connections for dmoore [dre, laura]
- Have puppet configs ready to go [rhelmer, jabba]
- Install software - ETA 12/1
- Sneakernet data transfer - by 12/1
- Load test - by 12/8
- QA 12/8- 12/15
- Sneakernet data update 1/5
- GO-NO-GO Wednesday 1/5
- Cutover to new system Saturday 1/8