Socorro:Releases/18: Difference between revisions
| (8 intermediate revisions by 3 users not shown) | |||
| Line 5: | Line 5: | ||
Database changes need to be deployed first, then UI/mware changes. | Database changes need to be deployed first, then UI/mware changes. | ||
== | == FULL STEPS FOR UPGRADE == | ||
1PM: begin archival backup of database (mpressman). This must be complete before upgrade. | |||
5PM: begin upgrade | |||
# disable alerts (mpressman) | |||
# stop monitor and processors (mpressman) | |||
# stop cron jobs (mpressman) | |||
# break replication between master01 and master02 and to ReplayDB (jberkus) | |||
# fail over zeus from master01 to master02 (netops) | |||
# upgrade database (jberkus) | |||
# verify database (jberkus, mpressman) | |||
# code push (ops) | |||
# fail over zeus from master02 to master01 (netops) | |||
# verify web application (QA) | |||
# decide to pass or not - team (see ROLLBACK if no pass) | |||
# deploy cron job changes (rhelmer?) | |||
# restart monitor and processors (mpressman) | |||
# restart cron jobs (mpressman) | |||
# backfill missing hours (jberkus) | |||
=== ROLLBACK PROCEDURE === | |||
# fail over zeus to master02 | |||
# roll back code push | |||
# roll back cron job changes. | |||
=== Post-upgrade cleanup === | |||
# resync master02 from master01 and restart replication (mpressman or jberkus) | |||
# restore nagios/ganglia checks (mpressman) | |||
== Database Upgrade == | |||
We will need to coordinate with NetOps about failover: {{bug|790711}} and coordinate with IT for code push {{bug|790707}} | |||
IMPORTANT: Several hours before the upgrade, we need to do a full archival backup of the pre-Mobeta database. {{bug|790705}} | |||
This database upgrade involves a number of irreversable changes. As such, two additional steps are required before deploying it: | This database upgrade involves a number of irreversable changes. As such, two additional steps are required before deploying it: | ||
| Line 17: | Line 48: | ||
# Archival final "pre-mobeta" offsite backup. See bug: https://bugzilla.mozilla.org/show_bug.cgi?id=762305 | # Archival final "pre-mobeta" offsite backup. See bug: https://bugzilla.mozilla.org/show_bug.cgi?id=762305 | ||
# Disable replication to master02 before running the upgrade, and do not restore it until after QA verification. This may then require a full resync of master02. | # Disable replication to master02 before running the upgrade, and do not restore it until after QA verification. This may then require a full resync of master02. | ||
This database upgrade takes around 1/2 hour to run, assuming a 2-week backfill of the new matviews. Should we decide to do more than 2 weeks of backfill, this can be adjusted by editing upgrade.sh. | |||
This database upgrade is lock-sensitive. As such, it requires the downtime per above. This downtime needs to include processors, monitor, web application, and cron jobs. | |||
Procedure for minimal downtime upgrade: | |||
# stop monitor, processors, cron jobs | |||
# break replication between Master01 and Master02. | |||
# fail over to master02 | |||
# upgrade master01 | |||
# fail back to master01 | |||
# QA master01 | |||
# if it passes, resync master02. | |||
#* otherwise, fail over to master02. | |||
== Cron job adjustments == | |||
* remove oldtcbs cron (cron_aggregates.sh) for {{bug|778255}} | |||
** dev - Sep 5 09:04 | |||
** stage - TODO | |||
** prod - TODO | |||
== Config changes == | |||
* {{bug|789410}} makes changes to the [https://raw.github.com/mozilla/socorro/stage/webapp-php/application/config/daily.php-dist daily.php-dist] file - this needs to be checked into puppet, where it will be stored at /etc/socorro/web/ and then copied out to the socorro install using deploy scripts | |||
** dev - Sep 12 09:29 | |||
** stage - <strike>{{bug|790646}}</strike> | |||
** prod - TODO | |||
Latest revision as of 19:24, 12 September 2012
Upgrade Steps
This upgrade requires a downtime. Likely the downtime will only be 1/2 hour, but we should schedule a 1-hour downtime just in case. Both the processors and the web UI should be down during the upgrade.
Database changes need to be deployed first, then UI/mware changes.
FULL STEPS FOR UPGRADE
1PM: begin archival backup of database (mpressman). This must be complete before upgrade.
5PM: begin upgrade
- disable alerts (mpressman)
- stop monitor and processors (mpressman)
- stop cron jobs (mpressman)
- break replication between master01 and master02 and to ReplayDB (jberkus)
- fail over zeus from master01 to master02 (netops)
- upgrade database (jberkus)
- verify database (jberkus, mpressman)
- code push (ops)
- fail over zeus from master02 to master01 (netops)
- verify web application (QA)
- decide to pass or not - team (see ROLLBACK if no pass)
- deploy cron job changes (rhelmer?)
- restart monitor and processors (mpressman)
- restart cron jobs (mpressman)
- backfill missing hours (jberkus)
ROLLBACK PROCEDURE
- fail over zeus to master02
- roll back code push
- roll back cron job changes.
Post-upgrade cleanup
- resync master02 from master01 and restart replication (mpressman or jberkus)
- restore nagios/ganglia checks (mpressman)
Database Upgrade
We will need to coordinate with NetOps about failover: bug 790711 and coordinate with IT for code push bug 790707
IMPORTANT: Several hours before the upgrade, we need to do a full archival backup of the pre-Mobeta database. bug 790705
This database upgrade involves a number of irreversable changes. As such, two additional steps are required before deploying it:
- Archival final "pre-mobeta" offsite backup. See bug: https://bugzilla.mozilla.org/show_bug.cgi?id=762305
- Disable replication to master02 before running the upgrade, and do not restore it until after QA verification. This may then require a full resync of master02.
This database upgrade takes around 1/2 hour to run, assuming a 2-week backfill of the new matviews. Should we decide to do more than 2 weeks of backfill, this can be adjusted by editing upgrade.sh.
This database upgrade is lock-sensitive. As such, it requires the downtime per above. This downtime needs to include processors, monitor, web application, and cron jobs.
Procedure for minimal downtime upgrade:
- stop monitor, processors, cron jobs
- break replication between Master01 and Master02.
- fail over to master02
- upgrade master01
- fail back to master01
- QA master01
- if it passes, resync master02.
- otherwise, fail over to master02.
Cron job adjustments
- remove oldtcbs cron (cron_aggregates.sh) for bug 778255
- dev - Sep 5 09:04
- stage - TODO
- prod - TODO
Config changes
- bug 789410 makes changes to the daily.php-dist file - this needs to be checked into puppet, where it will be stored at /etc/socorro/web/ and then copied out to the socorro install using deploy scripts
- dev - Sep 12 09:29
- stage -
bug 790646 - prod - TODO