Socorro:MigrationPostmortem: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
(sync notes from rhelmer)
 
(One intermediate revision by one other user not shown)
Line 26: Line 26:
* Getting backlog in was going to take a long time.  We corrected our approach on Sunday night (and finished back processing before Monday a.m.), but we should have acted on this earlier.
* Getting backlog in was going to take a long time.  We corrected our approach on Sunday night (and finished back processing before Monday a.m.), but we should have acted on this earlier.
* Difficulties getting correct ADUS due to an unrelated problem with Vertica; poor timing here (SJC was broken in the same way as PHX)
* Difficulties getting correct ADUS due to an unrelated problem with Vertica; poor timing here (SJC was broken in the same way as PHX)
* Missed a cron job despite multiple audits
* Missed a cron job despite multiple audits (signatureProductDims)
* Upgrade to RHEL6 coincided with hardware/software architecture change.
* Upgrade to RHEL6 coincided with hardware/software architecture change.
* Checklist could have been followed more closely


= Suggested Improvements =
= Suggested Improvements =
* Better communication with Netops: get a unified list of requested changes in order before requesting change
* Better communication with Netops:
** get a unified list of requested changes in order before requesting change
** get notice of concurrent changes in the network environment
* Better communication to groups outside release-drivers.  Actions {{bug|628318}}:
* Better communication to groups outside release-drivers.  Actions {{bug|628318}}:
** Get Socorro on status.mozilla.org
** Get Socorro on status.mozilla.org
** Set up a Socorro specific blog for maintenance/downtime information
** Set up a Socorro specific blog for maintenance/downtime information
** devs and ops have different bugzilla workflow (discussion versus unit-of-work action)
** file ops bugs as clear action items, spun off from dev bug or tracking bug
* need more insight into network/systems changes
** cannot keep everyone apprised of every change
*** maybe better tools can help here
* Need better coordination of all staff with timeline
* Need better coordination of all staff with timeline
** Need better tools than bugzilla + google spreadsheets -- a professional project management tool is called for
** Need better tools than bugzilla + google spreadsheets -- a professional project management tool is called for
*** Mozilla is aware of and working on this, agreement that Bugzilla is not ideal
** Need to have a manager in charge for every day of the timeline, even when people are on vacation
** Need to have a manager in charge for every day of the timeline, even when people are on vacation
** If we ever have a project this big again, consider hiring a professional project manager consultant.
** If we ever have a project this big again, consider hiring a professional project manager consultant.
= Actions =
* xstevens to publicize MapReduce Backup job (alternative to distcp)
* cshields to publicize network problems (esp. broadcomm driver bug)
* netops to run a training session on next onsite

Latest revision as of 22:48, 27 January 2011

Schedule / Location / Call Information

  • Thursday, 2011-01-27 @ 1:30 pm PST
  • Location The Bridge
  • 650-903-0800 x92 Conf# 362 (US/INTL)
  • 1-800-707-2533 (pin 369) Conf# 362 (US)
  • join irc.mozilla.org #socorro for back channel

Overview

Things that went right

  • Teamwork between Dev/IT was the best Laura has seen (anywhere)
  • Smoke/load testing gave us a good level of confidence
  • Getting configs into puppet
  • Instrumentation via nagios/ganglia helped us find problems during tests
  • Actual release day went great - total anticlimax
  • Having a unified task list with dates leading up to the migration
  • Having a checklist and rollback plan on the day
  • QA tests on WebUI were passing for days beforehand

Things that went wrong

  • HBase data sync not complete and verified until just a few days before the migration
  • Various issues through the last week with network configuration issues: Zeus slowness, bonding setup etc.
  • Getting backlog in was going to take a long time. We corrected our approach on Sunday night (and finished back processing before Monday a.m.), but we should have acted on this earlier.
  • Difficulties getting correct ADUS due to an unrelated problem with Vertica; poor timing here (SJC was broken in the same way as PHX)
  • Missed a cron job despite multiple audits (signatureProductDims)
  • Upgrade to RHEL6 coincided with hardware/software architecture change.
  • Checklist could have been followed more closely

Suggested Improvements

  • Better communication with Netops:
    • get a unified list of requested changes in order before requesting change
    • get notice of concurrent changes in the network environment
  • Better communication to groups outside release-drivers. Actions bug 628318:
    • Get Socorro on status.mozilla.org
    • Set up a Socorro specific blog for maintenance/downtime information
    • devs and ops have different bugzilla workflow (discussion versus unit-of-work action)
    • file ops bugs as clear action items, spun off from dev bug or tracking bug
  • need more insight into network/systems changes
    • cannot keep everyone apprised of every change
      • maybe better tools can help here
  • Need better coordination of all staff with timeline
    • Need better tools than bugzilla + google spreadsheets -- a professional project management tool is called for
      • Mozilla is aware of and working on this, agreement that Bugzilla is not ideal
    • Need to have a manager in charge for every day of the timeline, even when people are on vacation
    • If we ever have a project this big again, consider hiring a professional project manager consultant.

Actions

  • xstevens to publicize MapReduce Backup job (alternative to distcp)
  • cshields to publicize network problems (esp. broadcomm driver bug)
  • netops to run a training session on next onsite