ReleaseEngineering/How To/Bring Buildbot Masters Up After an Outage

We had a data centre event June 9, 2015 bug 1172666 which caused db corruption in bug 1172750. This required recovery and restarting all of our buildbot masters.

Our systems do not scale up well in these circumstances. The steps to bring them up in an organized fashion are:

Bring up scheduling masters, then some build masters, then some test masters
Run test build and test jobs
Modify parts of cloud tools code so that instance generation is not held back by default (In build-cloud-tools/scripts/aws_watch_pending.py on the aws-manager, modify find_prev_latest_amis_needed so maximum AMIs are requested)
Removing stale jobs in buildbot scheduling db from outage window
Bring up remaining test and build masters gradually

ReleaseEngineering/How To/Bring Buildbot Masters Up After an Outage

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

How to Contribute

MozillaWiki

Around Mozilla

Tools