SocorroStabilityPlan
From MozillaWiki
Contents
Improve stability
- Solve compaction issue
- Better compaction monitoring (done)
- Upgrade to 0.89
- Test on research cluster bug 602201 (done)
- Test in stage bug 605490
- Load test bug 605500
- Review system architecture with stability as a goal
- Architecture meeting starting 10/25 [details]
- Seek advice from experts (Cloudera, SU, PGExperts, Rabbit)
- Review fallback storage requirements (done)
- See section below for architectural changes
- Hire {one, some} Hadoop Ops engineers PD
- Schedule 3x weekly HBase cluster restarts to avoid unscheduled restarts
Modify architecture for improved stability
- Change collectors to write to local storage by default bug 608734
- Build a simpler HBase API (Q1 2010)
- Investigate queues to use in front of HBase, behind API (Q1 2010)
- Hazelcast: talk to reference users, build a prototype
- Rabbit?
- Memcache?
Create build process
- Set up automated tests with Hudson bug 601501 (done)
- Create a build script bug 602559 (done)
- Write load testing guidelines and plan bug 605491
- Enable replication of production traffic to a mirror environment bug 599859
Improve release process
- New release management practices bug 598386 (in review)
- Developer access to stage bug 598750 (done)
- Developer access to prod bug 598750
- Simplify configuration bug 598734
- Introduce config management bug 598747
- Access to prod configs for devs prior to release bug 598748
- Better staging environment (for next release, PHX) bug 497026
Improve insight into systems
- Improve monitoring
- Add compaction queue monitors (done)
- Add collector log monitors bug 604397
- Build an ops dashboard bug 605495
- On-call roster for Metrics bug 598746
Move to new, bigger hardware in PHX (updated 11/8)
- Rack boxes [IT] - ETA 11/19
- Last delivery due 11/17 (processors, webapp, mware), all else 11/12
- Racking in progress
- Generate a diagram of required network connections for dmoore [dre, laura]
- Have puppet configs ready to go [rhelmer, jabba]
- Install software - ETA 12/1
- Sneakernet data transfer - by 12/1
- Load test - by 12/8
- QA 12/8- 12/15
- Sneakernet data update 1/5
- GO-NO-GO Wednesday 1/5
- Cutover to new system Saturday 1/8