148
edits
(Provide a more recent example for adding slaves to slavealloc) |
(Various minor updates) |
||
Line 1: | Line 1: | ||
In general, slave management involves: | In general, slave management involves: | ||
* keeping as many slaves up as possible, including | * keeping as many slaves up as possible, including | ||
** proactively checking for hung/broken slaves - see [https://secure.pub.build.mozilla.org/builddata/reports/slave_health/index.html | ** proactively checking for hung/broken slaves - see [https://secure.pub.build.mozilla.org/builddata/reports/slave_health/index.html SlaveHealth] dashboard. | ||
** returning re-imaged slaves to production | ** returning re-imaged slaves to production | ||
* handling [[ReleaseEngineering:Buildduty:Nagios|nagios alerts]] for slaves | * handling [[ReleaseEngineering:Buildduty:Nagios|nagios alerts]] for slaves | ||
Line 7: | Line 7: | ||
= Known failure modes = | = Known failure modes = | ||
* | * HW machines getting unreachable, a reboot is generally needed: | ||
** | ** all Mac OS machines (bld-lion-r5* and t-yosemite-r7*) are connected to [https://wiki.mozilla.org/ReleaseEngineering/How_To/Connect_To_IPMI PDU]. | ||
* | ** Windows and Linux machines use [https://wiki.mozilla.org/ReleaseEngineering/How_To/Connect_To_IPMI IPMI] | ||
* | |||
= Automated = | = Automated = | ||
There are currently no automated mechanisms for recovering individual slaves. | There are currently no automated mechanisms for recovering individual slaves. | ||
* AWS instances will automatically terminate when idle. | |||
AWS instances will automatically terminate when idle. | |||
= Manual = | = Manual = | ||
== Rebooting slaves == | == Rebooting slaves == | ||
Find the slave page on | Find the slave page on [https://secure.pub.build.mozilla.org/builddata/reports/slave_health/index.html SlaveHealth]. There's a button to reboot the machine. | ||
== Filing bugs for IT == | == Filing bugs for IT == | ||
* File a bug using the link in the | * File a bug using the link in the SlaveHealth page for the slave - it will "do the right thing" to set up a new bug if needed. | ||
* File a [https://bugzilla.mozilla.org/enter_bug.cgi?product=mozilla.org&component=Server%20Operations%3A%20DCOps&short_desc=HOST%20is%20unreachable "slave is unreachable bug"] for IT. | * File a [https://bugzilla.mozilla.org/enter_bug.cgi?product=mozilla.org&component=Server%20Operations%3A%20DCOps&short_desc=HOST%20is%20unreachable "slave is unreachable bug"] for IT. | ||
* Create dependent bugs for any IT actions. (As of | *** '''Note''': SlaveAPI will do that automatically when failing to reboot the machine. | ||
** should block | * Create dependent bugs for any IT actions. (As of 2017, we should file per-slave bugs for reboots instead of grouping together machines in the same DC into one bug.) | ||
** should block the per host bug (for record keeping) | |||
** consider whether the slave should be disabled in slavealloc, and note that in bug (no slave without a detailed bug should be disabled) | ** consider whether the slave should be disabled in slavealloc, and note that in bug (no slave without a detailed bug should be disabled) | ||
** | ** DCOps assumes if there is no separate bug, they only need to reboot and see the machine come online. | ||
** | ** e.g. bug [https://bugzilla.mozilla.org/show_bug.cgi?id=1420132 1420132]. | ||
== Slave Tracking == | == Slave Tracking == | ||
Line 78: | Line 75: | ||
<pre> | <pre> | ||
nickname,fqdn,http_port,pb_port,datacenter,pool | nickname,fqdn,http_port,pb_port,datacenter,pool | ||
bm141-tests1-linux32,buildbot-master141.bb.releng.use1.mozilla.com,8201,9201,scl3,tests-use1-linux32 | |||
bm142-tests1-linux32,buildbot-master142.bb.releng.usw2.mozilla.com,8201,9201,scl3,tests-usw2-linux32 | |||
</pre> | </pre> | ||
Line 108: | Line 106: | ||
<pre> | <pre> | ||
SELECT name FROM slaves WHERE notes LIKE '%bumblebumble%'; | SELECT name FROM slaves WHERE notes LIKE '%bumblebumble%'; | ||
DELETE | DELETE FROM slaves WHERE notes LIKE '%bumblebumble%'; | ||
</pre> | </pre> | ||
== Returning a re-imaged slave to production == | == Returning a re-imaged slave to production == | ||
* see [[ReleaseEngineering/How_To/Set_Up_a_Freshly_Imaged_Slave]] | |||
== How to decommission a slave == | == How to decommission a slave == | ||
* | * https://wiki.mozilla.org/ReleaseEngineering/How_To/Decommission_Slave | ||
= Windows = | = Windows = |
edits