User:Ianconnolly/runner

From MozillaWiki
Jump to navigation Jump to search

So, following the example of Anhad, I've decided that I should document my own project here to make sure that I've got this straight in my own head and to allow other people to correct any misunderstandings I may have had.

Resources

The Idea

"A lot of slave maintenance is done at the beginning of build and test jobs that could be done instead before even starting buildbot. things like doing clobbers, purging old directories, making sure the tools checkout is up-to-date. in the case where we have spare capacity in the pool this can be a slight win for end-to-end time since builds don't have to waste time cleaning up. in the case where we're maxed out, it's no worse than the current situation." -- catlee

So if I understand this correctly, build and test job configs are cluttered with pre-/post-flight logic that could be pulled out for performance and sanity wins. Catlee has built a framework for running a set of these jobs called Runner. He's also made a start on a runner puppet module for deploying this.

Short-term To-Dos

  • Lint that puppet code!
  • Template-tise the Puppet code, abstracting out the hard-coded paths
  • Make Runner distributable
  • Ensure Runner runs before buildbot
  • Deploy Runner as is for a quick win and to identify problems.

As of June 24th Runner 1.0 is deployed.

Bugs arising:

Long-term To-Dos/Stretch goals

Currently we reboot after almost every job on buildbot in order to do things like:

  • Make sure we re-puppetize
  • Clean up temporary files
  • Make sure no old processes are running
  • Clean up memory fragmentation

However, by rebooting, we cause some problems:

  • We lose the filesystem cache between every job. In AWS this turns into lots of extra IO to read back the same files over and over after each reboot
  • We waste 2-5 minutes per job doing a reboot
  • Extra load on puppet masters

We can address nearly all of the issues we reboot for in pre-flight checks:

  • Check if there are puppet (or AMI) changes that need to be applied
  • We can still clean up temporary files
  • We can kill stray processes
  • I don't think memory fragmentation is an issue any more. We used to have problems on 32-bit linux machines that were up for a long time. Eventually they weren't able to link large libraries. All our build machines are 64-bit now I believe.

This will require that 'runner' be in charge of starting and stopping buildbot. I imagine we'd do something like this:

  • Run through pre-flight checks
  • Start buildbot
  • Watch twistd.log, make sure buildbot actually starts and connects to a master
  • Initiate graceful shutdown of buildbot after X minutes (30?). There are ways to do this locally (e.g. by touching a shutdown.stamp file) instead of poking the buildbot master.
  • Run any post-flight tasks
  • Go back to beginning

From Bug 1028191 - Stop rebooting after every job

Stuff I Need

  • Somewhere to test the puppet deployments: Catlee indicated I should ping whoever's buildduty (currently coop) - Thanks coop!
  • SSH access to puppet masters - Thanks dustin!
  • Help with slave CA/config problem - Thanks dustin!