ReleaseEngineering/How To/Restart Buildbot Masters
We occasionally need to restart buildbot masters for various reasons:
- upgrades to the underlying OS
- gradual increase in memory usage over time, leading to reduced master performance
There's a Nagios check in place reminding us when the masters need to be rebooted:
<nagios-releng> Wed 23:19:55 UTC  [moc] nagios1.private.releng.scl3.mozilla.com:buildbot-master-machines-buildbot_age cluster is WARNING:CLUSTER WARNING: buildbot-master-machines-buildbot_age cluster: 0 ok, 41 warning, 0 unknown, 0 critical (http://m.mozilla.org/buildbot-master-machines-buildbot_age+cluster)
If you need to restart a single master by hand, here's the sequence you should follow:
- disable the master in slavealloc. This prevents the master from taking more slave connections while you're waiting for it to shutdown.
- click the "Clean Shutdown" button on the web interface for the given master, e.g. http://buildbot-master82.bb.releng.scl3.mozilla.com:8001/
- wait for the jobs currently running on that master to complete. You can track progress in two ways:
- search in-page for "Running" on the master's buildslaves page, e.g. http://buildbot-master82.bb.releng.scl3.mozilla.com:8001/buildslaves?no_builders=1
- check BuildAPI for the list of running jobs and look for the ones corresponding to that master
- once the master is shutdown, perform whatever upgrades are required, etc.
- restart the master. NOTE: buildbot masters are configured to restart buildbot automatically on boot, so if you reboot the master, buildbot will restart itself. To restart manually:
xebec:buildduty ccooper$ ssh cltbld@buildbot-master82 Unauthorized access prohibited [firstname.lastname@example.org ~]$ cd /builds/buildbot/build1/ [email@example.com build1]$ make start
- re-enable the master in slavealloc.
- Don't forget to remove the reconfig.lock file if the script gets interrupted for any reason.
The above actions have been encapsulated into a script: https://hg.mozilla.org/build/tools/file/default/buildfarm/maintenance/restart_masters.py. The script is setup to run on buildduty-tools.srv.releng.usw2.mozilla.com, located at /home/buildduty/restart_masters/ under buildduty's account.
The wrapper script /home/buildduty/restart_masters.sh is used to update repositories and call restart_masters.py. The latter will prompt for the usernames and passwords it needs (ldap email for slavealloc, cltbld and root for ssh).
Here is an example invocation:
# ssh -A buildduty@buildduty-tools $ screen -R restart_masters $ /home/buildduty/restart_masters.sh
You can enter nonsense for the cltbld and root passwords because key auth is used but for username you need to give it the full ldap email otherwise the prompt will appear after the script invocation and nothing will happen, also looking into papertrail logs for this should look like:
"ERROR - __main__ - Unable to retrieve masters from slavealloc. Check LDAP credentials.".
Forwarding your ssh agent is required for ssh access to the masters. If you use a timeout make sure it's sufficiently long for the script to repeat, but don't leave it running indefinitely. NB: rebooting masters by adding the -r arg to restart_masters.py is non-functional because root logins are disabled.
The logs go to papertrail, outputting a progress report every 60 minutes. For really minimal view (which may hide errors) see this filtered view.
You can send a SIGUSR1 to restart_masters.py to prompt an extra progress report; allow time for the current status check to complete first.
The above script requires sensitive credentials that shouldn't be stored on disk. For now, we're still running this script by hand.