ReleaseEngineering/How To/Bring Buildbot Masters Up After an Outage

From MozillaWiki
Jump to: navigation, search


We had a data centre event June 9, 2015 bug 1172666 which caused db corruption in bug 1172750. This required recovery and restarting all of our buildbot masters.

Our systems do not scale up well in these circumstances. The steps to bring them up in an organized fashion are:

  1. Bring up scheduling masters, then some build masters, then some test masters
  2. Run test build and test jobs
  3. Modify parts of cloud tools code so that instance generation is not held back by default (In build-cloud-tools/scripts/aws_watch_pending.py on the aws-manager, modify find_prev_latest_amis_needed so maximum AMIs are requested)
  4. Removing stale jobs in buildbot scheduling db from outage window
  5. Bring up remaining test and build masters gradually