<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://wiki.mozilla.org/index.php?action=history&amp;feed=atom&amp;title=User%3ADjmitche%2FSlave_Wrangling_with_Nagios</id>
	<title>User:Djmitche/Slave Wrangling with Nagios - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://wiki.mozilla.org/index.php?action=history&amp;feed=atom&amp;title=User%3ADjmitche%2FSlave_Wrangling_with_Nagios"/>
	<link rel="alternate" type="text/html" href="https://wiki.mozilla.org/index.php?title=User:Djmitche/Slave_Wrangling_with_Nagios&amp;action=history"/>
	<updated>2026-06-12T08:39:11Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.39.10</generator>
	<entry>
		<id>https://wiki.mozilla.org/index.php?title=User:Djmitche/Slave_Wrangling_with_Nagios&amp;diff=280642&amp;oldid=prev</id>
		<title>Djmitche: Created page with &quot;The overall plan for getting a handle on slave up-ness and keeping wait times low is to make nagios a lot smarter about slaves&#039; state.  = Proposal = There are a *lot* of things t...&quot;</title>
		<link rel="alternate" type="text/html" href="https://wiki.mozilla.org/index.php?title=User:Djmitche/Slave_Wrangling_with_Nagios&amp;diff=280642&amp;oldid=prev"/>
		<updated>2011-01-27T21:18:00Z</updated>

		<summary type="html">&lt;p&gt;Created page with &amp;quot;The overall plan for getting a handle on slave up-ness and keeping wait times low is to make nagios a lot smarter about slaves&amp;#039; state.  = Proposal = There are a *lot* of things t...&amp;quot;&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;The overall plan for getting a handle on slave up-ness and keeping wait times low is to make nagios a lot smarter about slaves&amp;#039; state.&lt;br /&gt;
&lt;br /&gt;
= Proposal =&lt;br /&gt;
There are a *lot* of things that can go wrong with slaves, and a lot of behaviors that are perfectly normal.  The common factor, however, is that a slave should continue to make progress around the &amp;quot;do stuff; reboot&amp;quot; cycle.  When a slave fails to make progress in this cycle for a significant amount of time, something is wrong.&lt;br /&gt;
&lt;br /&gt;
Since this is a cycle, if we monitor a single point in the cycle and check that it occurs with a certain minimal frequency, we can catch &amp;#039;&amp;#039;all&amp;#039;&amp;#039; slave failures (aside from quickly burning builds, of course).  That minimal frequency is still pretty long - on the order of 6 hours - so where possible, and for common failure modes, we should have additional, more sensitive checks.&lt;br /&gt;
&lt;br /&gt;
The cycle looks something like:&lt;br /&gt;
* system startup&lt;br /&gt;
* run config management (OPSI or Pupppet; not on all slaves)&lt;br /&gt;
* start buildbot process&lt;br /&gt;
* execute zero or more steps&lt;br /&gt;
* reboot&lt;br /&gt;
* repeat&lt;br /&gt;
&lt;br /&gt;
The proposal in {{bug|627126}} is to instrument the &amp;#039;start buildbot process&amp;#039; step with a passive nagios check.  The nagios service entry stays green as long as this check occurs at least every 7h.&lt;br /&gt;
&lt;br /&gt;
== Advantages ==&lt;br /&gt;
This will catch all sorts of common failures:&lt;br /&gt;
* slaves that fail to start or hang on boot&lt;br /&gt;
* slaves that are hung trying to connect to puppet/OPSI&lt;br /&gt;
* slaves that cannot connect to their master&lt;br /&gt;
* slaves that are running a hung job&lt;br /&gt;
* kernel panics, hung VM&amp;#039;s, etc.&lt;br /&gt;
&lt;br /&gt;
== Issues ==&lt;br /&gt;
We have lots of slaves that are idle for four or more days - particularly slow slaves (which are disfavored for jobs) and slaves on masters that do not run the fuzzer.  {{bug|565397}} would have slaves automatically restarting every 6h, which has many incidental benefits aside from keeping the proposed nagios service green.&lt;br /&gt;
&lt;br /&gt;
Some slaves are still attached to 0.7 masters, and it doesn&amp;#039;t make sense to implement {{bug|565397}} for these slaves.  Instead, they can be marked as a long-term downtime in nagios, so that we are not alerted because of idleness.&lt;br /&gt;
&lt;br /&gt;
Slaves which cannot contact their master will restart after 6h, and thus seem alive.  This shouldn&amp;#039;t be a problem for two reasons: first, the slave allocator should be sending slaves to working masters; second, if we set the check up correctly, it can send a CRITICAL status for the nagios check if it cannot connect to the master for some short time, and then reboot automatically.&lt;/div&gt;</summary>
		<author><name>Djmitche</name></author>
	</entry>
</feed>