ReleaseEngineering/Breakout Sessions/Rolling Restarts of Buildbot Masters

From MozillaWiki
< ReleaseEngineering‎ | Breakout Sessions
Revision as of 15:08, 25 September 2014 by Pmoore (talk | contribs) (Created page with "Sometimes we are required to restart our buildbot masters, e.g. for EC2 maintenance work (which may include shutting down vm so that it starts up on different hardware) or oth...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Sometimes we are required to restart our buildbot masters, e.g. for EC2 maintenance work (which may include shutting down vm so that it starts up on different hardware) or other times just to pick up some other system change.

At the moment we have no easy way to do this, and at times people will sacrifice weekend time to do this.

A big part of the problem is that we have to do this in a rolling fashion to keep throughput high, we only really want to do one master (of a particular type) at a time. Another problem, is that graceful restarts can take a *long* time, so manually doing this becomes a real pain. As the number of masters we have grows, the harder this becomes.

One radical solution might be to forget graceful restarts altogether, and see if our buildbot infrastructure is robust enough so that bringing down a master "gracelessly" (e.g. ssh root@<buildbotmaster> shutdown -r now) will correctly farm running jobs to alternative masters, with minimal disruption. This might allow us to even cron restarts directly on the masters (i.e. the root cron has its own reboot command, to run once per month, in a staggered arrangement so that several masters are not rebooting at the same time for a given buildbot master type).

There are of course alternatives - cron's that do a graceful shutdown, or a centralised service that manages the reboots of masters remotely.

Before going too deep into theory, I'd like to see if a radical approach can at least be tested without too much impact, to see how effective/dangerous it may be. This *may* be a quick win without too much work if we can get agreement e.g. from sheriffs to just "pull the plug" on some masters and see how well buildbot handles it.

Come join us for the breakout, if this is a topic you might be interested in!

Release Engineering Room Monday 29 September 2014 8am Pacific Time