Socorro:MigrationPostmortem

From MozillaWiki
Jump to navigation Jump to search

Schedule / Location / Call Information

  • Thursday, 2011-01-27 @ 1:30 pm PST
  • Location The Bridge
  • 650-903-0800 x92 Conf# 362 (US/INTL)
  • 1-800-707-2533 (pin 369) Conf# 362 (US)
  • join irc.mozilla.org #socorro for back channel

Overview

Things that went right

  • Teamwork between Dev/IT was the best Laura has seen (anywhere)
  • Smoke/load testing gave us a good level of confidence
  • Getting configs into puppet
  • Instrumentation via nagios/ganglia helped us find problems during tests
  • Actual release day went great - total anticlimax
  • Having a unified task list with dates leading up to the migration
  • Having a checklist and rollback plan on the day
  • QA tests on WebUI were passing for days beforehand

Things that went wrong

  • HBase data sync not complete and verified until just a few days before the migration
  • Various issues through the last week with network configuration issues: Zeus slowness, bonding setup etc.
  • Getting backlog in was going to take a long time. We corrected our approach on Sunday night (and finished back processing before Monday a.m.), but we should have acted on this earlier.
  • Difficulties getting correct ADUS due to an unrelated problem with Vertica; poor timing here (SJC was broken in the same way as PHX)
  • Missed a cron job despite multiple audits (signatureProductDims)
  • Upgrade to RHEL6 coincided with hardware/software architecture change.
  • Checklist could have been followed more closely

Suggested Improvements

  • Better communication with Netops:
    • get a unified list of requested changes in order before requesting change
    • get notice of concurrent changes in the network environment
  • Better communication to groups outside release-drivers. Actions bug 628318:
    • Get Socorro on status.mozilla.org
    • Set up a Socorro specific blog for maintenance/downtime information
    • devs and ops have different bugzilla workflow (discussion versus unit-of-work action)
    • file ops bugs as clear action items, spun off from dev bug or tracking bug
  • need more insight into network/systems changes
    • cannot keep everyone apprised of every change
      • maybe better tools can help here
  • Need better coordination of all staff with timeline
    • Need better tools than bugzilla + google spreadsheets -- a professional project management tool is called for
      • Mozilla is aware of and working on this, agreement that Bugzilla is not ideal
    • Need to have a manager in charge for every day of the timeline, even when people are on vacation
    • If we ever have a project this big again, consider hiring a professional project manager consultant.

Actions

  • xstevens to publicize MapReduce Backup job (alternative to distcp)
  • cshields to publicize network problems (esp. broadcomm driver bug)
  • netops to run a training session on next onsite