ReleaseEngineering/Breakout Sessions/Rolling Restarts of Buildbot Masters
To discuss long term solution to needing to (en-masse):
- restart buildbot master services
- reboot buildbot master machines
- terminate + start buildbot aws machine
If you care about this topic, and you are available at this time, would be great to have you.
Sometimes we are required to restart our buildbot masters, e.g. for EC2 maintenance work (which may include shutting down vm so that it starts up on different hardware) or other times just to pick up some other system change.
At the moment we have no easy way to do this, and at times people will sacrifice weekend time to do this.
A big part of the problem is that we have to do this in a rolling fashion to keep throughput high, we only really want to do one master (of a particular type) at a time. Another problem, is that graceful restarts can take a *long* time, so manually doing this becomes a real pain. As the number of masters we have grows, the harder this becomes.
One radical solution might be to forget graceful restarts altogether, and see if our buildbot infrastructure is robust enough so that bringing down a master "gracelessly" (e.g. ssh root@<buildbotmaster> shutdown -r now) will correctly farm running jobs to alternative masters, with minimal disruption. This might allow us to even cron restarts directly on the masters (i.e. the root cron has its own reboot command, to run once per month, in a staggered arrangement so that several masters are not rebooting at the same time for a given buildbot master type).
There are of course alternatives - cron's that do a graceful shutdown, or a centralised service that manages the reboots of masters remotely.
Before going too deep into theory, I'd like to see if a radical approach can at least be tested without too much impact, to see how effective/dangerous it may be. This *may* be a quick win without too much work if we can get agreement e.g. from sheriffs to just "pull the plug" on some masters and see how well buildbot handles it.
Come join us for the breakout, if this is a topic you might be interested in!
Release Engineering Room Monday 29 September 2014 8am Pacific Time Meeting etherpad: https://etherpad.mozilla.org/BfQdzn4twP
- Attendees: catlee, arr, coop, pmoore, dustin, kmoir, Callek
- Apologies: nthomas (unsuitable time in NZ!)
We should spin off a project to take care of this. Goals would be:
- Use graceful restarts for test masters and build masters, to avoid wasting resources (if we kill masters aggressively, all slave jobs have to start from scratch)
- Scheduler master can be aggressively restarted - no losses, and has no UI to provide graceful restart interface anyway
- We should aim to build into RelEngAPI in order to benefit from standard logging, distrubuted job scheduling with badpenny, job distribution with celery, self-documenting API, authentication/authorisation controls etc
- We should have a mechanism to trigger restarts on-demand
- Need to integrate with slaveapi to disable masters while rebooting masters
- Need to integrate with nagios to downtime alerts while rebooting masters
- Need nagios monitoring to make sure service is/has been running
- At a minimum, aim to parallelise rebooting testers / builders / try builders / scheduler (only one scheduler) - ideally parallelise even further, based on slave pool (e.g. windows tester masters can be rebooted at same time as os x tester masters) - and could even potentially parallelise across environments (staging / production)
- Sometimes graceful restarts issued via manage_masters.py action can hang - need to make sure there are timeouts, and handle intelligently
- On Amazon consider doing a full shutdown and restart, to potentially move to new hardware
- Some "buckets" we might want to reboot on a different cadence to other buckets - each bucket can have its own cadence
- During development/roll out, we probably want to do it on a Sunday night PDT so that NZ and Europe have a chance to look on Monday morning before things are too busy
This meeting focussed on brainstorming a strategy, but did not include planning around a project to deliver these changes.