User:Ianconnolly/runner

So, following the example of Anhad, I've decided that I should document my own project here to make sure that I've got this straight in my own head and to allow other people to correct any misunderstandings I may have had.

Resources

Original Bugzilla thread: https://bugzilla.mozilla.org/show_bug.cgi?id=712206
Puppet code: https://github.com/catlee/puppet/compare/master...runner
Catlee's Runner: https://github.com/catlee/runner

The Idea

"A lot of slave maintenance is done at the beginning of build and test jobs that could be done instead before even starting buildbot. things like doing clobbers, purging old directories, making sure the tools checkout is up-to-date. in the case where we have spare capacity in the pool this can be a slight win for end-to-end time since builds don't have to waste time cleaning up. in the case where we're maxed out, it's no worse than the current situation." -- catlee

So if I understand this correctly, build and test job configs are cluttered with pre-/post-flight logic that could be pulled out for performance and sanity wins. Catlee has built a framework for running a set of these jobs called Runner. He's also made a start on a runner puppet module for deploying this.

Short-term To-Dos

~~Lint that puppet code!~~
~~Template-tise the Puppet code, abstracting out the hard-coded paths~~
~~Make Runner distributable~~
~~Ensure Runner runs before buildbot~~
~~Deploy Runner as is for a quick win and to identify problems.~~

As of June 24th Runner 1.0 is deployed.

Bugs arising:

Long-term To-Dos/Stretch goals

Currently we reboot after almost every job on buildbot in order to do things like:

Make sure we re-puppetize

Clean up temporary files

Make sure no old processes are running

Clean up memory fragmentation
However, by rebooting, we cause some problems:
We lose the filesystem cache between every job. In AWS this turns into lots of extra IO to read back the same files over and over after each reboot

We waste 2-5 minutes per job doing a reboot

Extra load on puppet masters
We can address nearly all of the issues we reboot for in pre-flight checks:
Check if there are puppet (or AMI) changes that need to be applied

We can still clean up temporary files

We can kill stray processes

I don't think memory fragmentation is an issue any more. We used to have problems on 32-bit linux machines that were up for a long time. Eventually they weren't able to link large libraries. All our build machines are 64-bit now I believe.
This will require that 'runner' be in charge of starting and stopping buildbot. I imagine we'd do something like this:

Run through pre-flight checks

Start buildbot

Watch twistd.log, make sure buildbot actually starts and connects to a master

Initiate graceful shutdown of buildbot after X minutes (30?). There are ways to do this locally (e.g. by touching a shutdown.stamp file) instead of poking the buildbot master.

Run any post-flight tasks

Go back to beginning

From Bug 1028191 - Stop rebooting after every job

Stuff I Need

~~Somewhere to test the puppet deployments: Catlee indicated I should ping whoever's buildduty (currently coop)~~ - Thanks coop!
~~SSH access to puppet masters~~ - Thanks dustin!
~~Help with slave CA/config problem~~ - Thanks dustin!

User:Ianconnolly/runner

Contents

Resources

The Idea

Short-term To-Dos

Long-term To-Dos/Stretch goals

Stuff I Need

Navigation menu

User:Ianconnolly/runner

Resources

The Idea

Short-term To-Dos

Long-term To-Dos/Stretch goals

Stuff I Need

Navigation menu

Search