Socorro:MigrationPostmortem
Jump to navigation
Jump to search
Schedule / Location / Call Information
- Thursday, 2011-01-27 @ 1:30 pm PST
- Location The Bridge
- 650-903-0800 x92 Conf# 362 (US/INTL)
- 1-800-707-2533 (pin 369) Conf# 362 (US)
- join irc.mozilla.org #socorro for back channel
Overview
- Stability Plan included migration to Phoenix
- Timeline leading up to migration
- Migration checklist
Things that went right
- Teamwork between Dev/IT was the best Laura has seen (anywhere)
- Smoke/load testing gave us a good level of confidence
- Getting configs into puppet
- Instrumentation via nagios/ganglia helped us find problems during tests
- Actual release day went great - total anticlimax
- Having a unified task list with dates leading up to the migration
- Having a checklist and rollback plan on the day
- QA tests on WebUI were passing for days beforehand
Things that went wrong
- HBase data sync not complete and verified until just a few days before the migration
- Various issues through the last week with network configuration issues: Zeus slowness, bonding setup etc.
- Getting backlog in was going to take a long time. We corrected our approach on Sunday night (and finished back processing before Monday a.m.), but we should have acted on this earlier.
- Difficulties getting correct ADUS due to an unrelated problem with Vertica; poor timing here (SJC was broken in the same way as PHX)
- Missed a cron job despite multiple audits (signatureProductDims)
- Upgrade to RHEL6 coincided with hardware/software architecture change.
- Checklist could have been followed more closely
Suggested Improvements
- Better communication with Netops:
- get a unified list of requested changes in order before requesting change
- get notice of concurrent changes in the network environment
- Better communication to groups outside release-drivers. Actions bug 628318:
- Get Socorro on status.mozilla.org
- Set up a Socorro specific blog for maintenance/downtime information
- devs and ops have different bugzilla workflow (discussion versus unit-of-work action)
- file ops bugs as clear action items, spun off from dev bug or tracking bug
- need more insight into network/systems changes
- cannot keep everyone apprised of every change
- maybe better tools can help here
- cannot keep everyone apprised of every change
- Need better coordination of all staff with timeline
- Need better tools than bugzilla + google spreadsheets -- a professional project management tool is called for
- Mozilla is aware of and working on this, agreement that Bugzilla is not ideal
- Need to have a manager in charge for every day of the timeline, even when people are on vacation
- If we ever have a project this big again, consider hiring a professional project manager consultant.
- Need better tools than bugzilla + google spreadsheets -- a professional project management tool is called for
Actions
- xstevens to publicize MapReduce Backup job (alternative to distcp)
- cshields to publicize network problems (esp. broadcomm driver bug)
- netops to run a training session on next onsite