User:Djmitche/Slave Wrangling with Nagios

From MozillaWiki
Jump to: navigation, search

The overall plan for getting a handle on slave up-ness and keeping wait times low is to make nagios a lot smarter about slaves' state.

Proposal

There are a *lot* of things that can go wrong with slaves, and a lot of behaviors that are perfectly normal. The common factor, however, is that a slave should continue to make progress around the "do stuff; reboot" cycle. When a slave fails to make progress in this cycle for a significant amount of time, something is wrong.

Since this is a cycle, if we monitor a single point in the cycle and check that it occurs with a certain minimal frequency, we can catch all slave failures (aside from quickly burning builds, of course). That minimal frequency is still pretty long - on the order of 6 hours - so where possible, and for common failure modes, we should have additional, more sensitive checks.

The cycle looks something like:

  • system startup
  • run config management (OPSI or Pupppet; not on all slaves)
  • start buildbot process
  • execute zero or more steps
  • reboot
  • repeat

The proposal in bug 627126 is to instrument the 'start buildbot process' step with a passive nagios check. The nagios service entry stays green as long as this check occurs at least every 7h.

Advantages

This will catch all sorts of common failures:

  • slaves that fail to start or hang on boot
  • slaves that are hung trying to connect to puppet/OPSI
  • slaves that cannot connect to their master
  • slaves that are running a hung job
  • kernel panics, hung VM's, etc.

Issues

We have lots of slaves that are idle for four or more days - particularly slow slaves (which are disfavored for jobs) and slaves on masters that do not run the fuzzer. bug 565397 would have slaves automatically restarting every 6h, which has many incidental benefits aside from keeping the proposed nagios service green.

Some slaves are still attached to 0.7 masters, and it doesn't make sense to implement bug 565397 for these slaves. Instead, they can be marked as a long-term downtime in nagios, so that we are not alerted because of idleness.

Slaves which cannot contact their master will restart after 6h, and thus seem alive. This shouldn't be a problem for two reasons: first, the slave allocator should be sending slaves to working masters; second, if we set the check up correctly, it can send a CRITICAL status for the nagios check if it cannot connect to the master for some short time, and then reboot automatically.