Socorro:Releases/18

From MozillaWiki
Jump to: navigation, search

Upgrade Steps

This upgrade requires a downtime. Likely the downtime will only be 1/2 hour, but we should schedule a 1-hour downtime just in case. Both the processors and the web UI should be down during the upgrade.

Database changes need to be deployed first, then UI/mware changes.

FULL STEPS FOR UPGRADE

1PM: begin archival backup of database (mpressman). This must be complete before upgrade.

5PM: begin upgrade

  1. disable alerts (mpressman)
  2. stop monitor and processors (mpressman)
  3. stop cron jobs (mpressman)
  4. break replication between master01 and master02 and to ReplayDB (jberkus)
  5. fail over zeus from master01 to master02 (netops)
  6. upgrade database (jberkus)
  7. verify database (jberkus, mpressman)
  8. code push (ops)
  9. fail over zeus from master02 to master01 (netops)
  10. verify web application (QA)
  11. decide to pass or not - team (see ROLLBACK if no pass)
  12. deploy cron job changes (rhelmer?)
  13. restart monitor and processors (mpressman)
  14. restart cron jobs (mpressman)
  15. backfill missing hours (jberkus)

ROLLBACK PROCEDURE

  1. fail over zeus to master02
  2. roll back code push
  3. roll back cron job changes.

Post-upgrade cleanup

  1. resync master02 from master01 and restart replication (mpressman or jberkus)
  2. restore nagios/ganglia checks (mpressman)

Database Upgrade

We will need to coordinate with NetOps about failover: bug 790711 and coordinate with IT for code push bug 790707

IMPORTANT: Several hours before the upgrade, we need to do a full archival backup of the pre-Mobeta database. bug 790705

This database upgrade involves a number of irreversable changes. As such, two additional steps are required before deploying it:

  1. Archival final "pre-mobeta" offsite backup. See bug: https://bugzilla.mozilla.org/show_bug.cgi?id=762305
  2. Disable replication to master02 before running the upgrade, and do not restore it until after QA verification. This may then require a full resync of master02.

This database upgrade takes around 1/2 hour to run, assuming a 2-week backfill of the new matviews. Should we decide to do more than 2 weeks of backfill, this can be adjusted by editing upgrade.sh.

This database upgrade is lock-sensitive. As such, it requires the downtime per above. This downtime needs to include processors, monitor, web application, and cron jobs.

Procedure for minimal downtime upgrade:

  1. stop monitor, processors, cron jobs
  2. break replication between Master01 and Master02.
  3. fail over to master02
  4. upgrade master01
  5. fail back to master01
  6. QA master01
  7. if it passes, resync master02.
    • otherwise, fail over to master02.

Cron job adjustments

  • remove oldtcbs cron (cron_aggregates.sh) for bug 778255
    • dev - Sep 5 09:04
    • stage - TODO
    • prod - TODO

Config changes

  • bug 789410 makes changes to the daily.php-dist file - this needs to be checked into puppet, where it will be stored at /etc/socorro/web/ and then copied out to the socorro install using deploy scripts
    • dev - Sep 12 09:29
    • stage - bug 790646
    • prod - TODO