SocorroStabilityPlan

From MozillaWiki
Jump to: navigation, search

Improve stability

  • Solve compaction issue
    • Better compaction monitoring (done)
  • Upgrade to 0.89
  • Review system architecture with stability as a goal
    • Architecture meeting starting 10/25 [details]
    • Seek advice from experts (Cloudera, SU, PGExperts, Rabbit)
    • Review fallback storage requirements (done)
    • See section below for architectural changes
  • Hire {one, some} Hadoop Ops engineers PD
  • Schedule 3x weekly HBase cluster restarts to avoid unscheduled restarts

Modify architecture for improved stability

  • Change collectors to write to local storage by default bug 608734
  • Build a simpler HBase API (Q1 2010)
  • Investigate queues to use in front of HBase, behind API (Q1 2010)
    • Hazelcast: talk to reference users, build a prototype
    • Rabbit?
    • Memcache?

Create build process

  • Set up automated tests with Hudson bug 601501 (done)
  • Create a build script bug 602559 (done)
  • Write load testing guidelines and plan bug 605491
    • Enable replication of production traffic to a mirror environment bug 599859

Improve release process

Improve insight into systems

  • Improve monitoring
    • Add compaction queue monitors (done)
    • Add collector log monitors bug 604397
  • Build an ops dashboard bug 605495
  • On-call roster for Metrics bug 598746

Move to new, bigger hardware in PHX (updated 11/8)

  • Rack boxes [IT] - ETA 11/19
    • Last delivery due 11/17 (processors, webapp, mware), all else 11/12
    • Racking in progress
  • Generate a diagram of required network connections for dmoore [dre, laura]
  • Have puppet configs ready to go [rhelmer, jabba]
  • Install software - ETA 12/1
  • Sneakernet data transfer - by 12/1
  • Load test - by 12/8
  • QA 12/8- 12/15
  • Sneakernet data update 1/5
  • GO-NO-GO Wednesday 1/5
  • Cutover to new system Saturday 1/8