User:Ianconnolly/runner
So, following the example of Anhad, I've decided that I should document my own project here to make sure that I've got this straight in my own head and to allow other people to correct any misunderstandings I may have had.
Resources
- Original Bugzilla thread: https://bugzilla.mozilla.org/show_bug.cgi?id=712206
- Puppet code: https://github.com/catlee/puppet/compare/master...runner
- Catlee's Runner: https://github.com/catlee/runner
The Idea
"A lot of slave maintenance is done at the beginning of build and test jobs that could be done instead before even starting buildbot. things like doing clobbers, purging old directories, making sure the tools checkout is up-to-date. in the case where we have spare capacity in the pool this can be a slight win for end-to-end time since builds don't have to waste time cleaning up. in the case where we're maxed out, it's no worse than the current situation." -- catlee
So if I understand this correctly, build and test job configs are cluttered with pre-/post-flight logic that could be pulled out for performance and sanity wins. Catlee has built a framework for running a set of these jobs called Runner. He's also made a start on a runner puppet module for deploying this.
Short-term To-Dos
Lint that puppet code!Template-tise the Puppet code, abstracting out the hard-coded pathsMake Runner distributableEnsure Runner runs before buildbotDeploy Runner as is for a quick win and to identify problems.
As of June 24th Runner 1.0 is deployed.
Bugs arising:
Bug 1029777 - Redirect runner logging output to /var/log/runner.logBug 1029704 - rc3.d/ symlinks should refresh on initscript updates for runner and buildbot- Bug 1029903 - Stop update_shared_repos runner task from pulling Try
Long-term To-Dos/Stretch goals
Currently we reboot after almost every job on buildbot in order to do things like:
- Make sure we re-puppetize
- Clean up temporary files
- Make sure no old processes are running
- Clean up memory fragmentation
However, by rebooting, we cause some problems:
- We lose the filesystem cache between every job. In AWS this turns into lots of extra IO to read back the same files over and over after each reboot
- We waste 2-5 minutes per job doing a reboot
- Extra load on puppet masters
We can address nearly all of the issues we reboot for in pre-flight checks:
- Check if there are puppet (or AMI) changes that need to be applied
- We can still clean up temporary files
- We can kill stray processes
- I don't think memory fragmentation is an issue any more. We used to have problems on 32-bit linux machines that were up for a long time. Eventually they weren't able to link large libraries. All our build machines are 64-bit now I believe.
This will require that 'runner' be in charge of starting and stopping buildbot. I imagine we'd do something like this:
- Run through pre-flight checks
- Start buildbot
- Watch twistd.log, make sure buildbot actually starts and connects to a master
- Initiate graceful shutdown of buildbot after X minutes (30?). There are ways to do this locally (e.g. by touching a shutdown.stamp file) instead of poking the buildbot master.
- Run any post-flight tasks
- Go back to beginning
From Bug 1028191 - Stop rebooting after every job
Stuff I Need
Somewhere to test the puppet deployments: Catlee indicated I should ping whoever's buildduty (currently coop)- Thanks coop!SSH access to puppet masters- Thanks dustin!Help with slave CA/config problem- Thanks dustin!