Slave Management

In general, slave management involves:

keeping as many slaves up as possible, including
- proactively checking for hung/broken slaves - see the last build per slave page which is updated once an hour. The Puppet dashboard is also useful.
- returning re-imaged slaves to production
handling nagios alerts for slaves
interacting with IT regarding slave maintenance

Known failure modes

talos-r3-*
- all of the r3 slaves are minis and require manual intervention if you cannot ping them or ssh into them to reboot them yourself. Add slaves in this mode to the appropriate reboots bug for IT.
talos-r3-fed|fed64
- these slaves frequently fail to reboot cleanly, knocking themselves off the network entirely. Also, check for stale puppet locks /var/lib/puppet/state/puppetdlock if they fail to puppetize cleanly.
talos-r3-[w7|xp]
- Windows slaves have issues with modal dialogs, and sometimes the msys shell will fail to close properly. A manual reboot will usually clear this up.
talos-r4-[lion|snow], talos-mtnlion-r5
- These slaves will sometimes fail to puppetize correctly. The remote_scutil_cmds.bash script can help with this.
- all r4 and r5 slaves are connected to PDUs for power-cycling
tegras and pandas
- tegras and pandas can fail in many disparate ways. See ReleaseEngineering/How_To/Android_Tegras for more info.
AWS slaves
- a common failure is running out of disk space. They have default disk allocations of 150GB versus our which have 250GB. Catlee is working on changing that.
  - To clean them, you can run mock_mozilla -v -r mozilla-centos6-i386 --scrub=all See bug 829186 for an example.
- Rail wrote a tool to manage aws slaves - enable or disable automatic reboot and automatic shutdown.
- Mozilla DNS servers don't resolve AWS hostnames, thus this document describes how to resolve them

Rebooting slaves

You should always try to connect to bad slaves via ssh first. This gives you the chance to examine the current state of the machine and hopefully grab any logs that might be pertinent to the failure. If possible, you can also try connecting via VNC to see whether a stray crash or system dialog is being displayed.

Rebooting via ssh

Windows

shutdown -r -f -t 0

Linux/Mac

sudo reboot

When ssh doesn't work

PDU

You can determine which PDU and outlet a slave is connected to by checking the inventory (login required). Find the entry for the slave in question, and then scroll down to the Key/Value Store. There should be a key like the following, e.g.:

Key                      Value
system.pdu.0             pdu1.r101-21.ops.releng.scl3:BC3
system.hostname.alias.0  talos-mtnlion-r5-006.test.releng.scl3

From the system.pdu.0 line, we can see that we should connect to http://pdu1.r101-21.ops.releng.scl3.mozilla.com to power-cycle this slave (login required), and that the slave is attached to outlet BC3 on the PDU. Some of the PDUs have the outlets labelled with the slave name, but it's always good to double-check before rebooting anything.

Once connected to the web interface of the PDU, navigate to Outlet Control->Individual and find the appropriate outlet to reset.

Slave attached to PDUs:

talos-r4-*
talos-mtnlion-r5-*
tegras, but you should really follow the reboot instructions in ReleaseEngineering/How_To/Android_Tegras

IPMI

All iX build slaves can be rebooted via an IPMI interface. If the slave name is linux-ix-slave22, then you can access the IPMI interface for that slave at http://linux-ix-slave22-mgmt.build.mozilla.org/. It's protected by a username/password that you can get from any release engineer. Power Control is under the Remote Control menu.

Slaves that have IPMI:

linux-ix-*
linux64-ix-*
mv-moz2-linux-ix-*
mw32-ix-*
w64-ix-*

Filing bugs for IT

File a bug using the link in slavealloc - it will "do the right thing" to set up a new bug if needed.
Make the individual slave bug block the appropriate colo reboot/recovery bug (check the machine domain):
- reboots-mtv1 - MTV
- reboots-scl1 - SCL1
- reboots-scl3 - SCL3
- tegra-recovery - tegras
- These bugs get closed when IT has recovered all of the individual blocking slaves. You should clone the recovery bug and move the alias forward as required. Otherwise, you may risk having other machines unintentionally rebooted that were added to the original alias.
Make sure the alias of the bug is the hostname (done automatically if you follow slavealloc bug link)
Create dependent bugs for any IT actions (beyond normal reboot)
- should block both the datacenter bug & the per host bug (for record keeping)
- consider whether the slave should be disabled in slavealloc, and note that in bug (no slave without a detailed bug should be disabled)
- dcops assumes if there is no separate bug, they only need to reboot and see the machine come online.

Slave Tracking

Slave tracking is done via the Slave Allocator. Please disable/enable slaves in slavealloc.

NOTE: you no longer need to add the slave-specific bug number to the Notes field. Clicking on the icon in slavealloc will look up the bug number and status for you, or create a template you can use to file a new bug. If there is another bug, e.g. for IT re-imaging, please add that extra bug number to the Notes field instead using the format: 'bug #######.'

Slavealloc

Adding a slave

Slaves are added to slavealloc via the 'dbimport' subcommand of the 'slavealloc' command. This is generally run as the slavealloc user on the slavealloc server, which is most easily accessed via su from root.

You'll want a command line something like

/tools/slavealloc/bin/slavealloc dbimport -D $db_url --slave-data mydata.csv

where $db_url is most easily found in slavealloc's shell history. The CSV file should have the headers specified by 'slavealloc dbimport --help':

name,distro,bitlength,speed,datacenter,trustlevel,environment,purpose,pool,basedir,enabled
talos-r3-xp-096,winxp,32,mini,scl1,try,prod,tests,tests-scl1-windows,C:\\talos-slave,1

Adding masters is similar - see dbimport's help for more information.

Removing slaves

Connect to slavealloc@slavealloc and look at the history for a command looking like this:

 mysql -h $host_ip -p -u buildslaves buildslaves
 # type the password
 SELECT name FROM slaves WHERE notes LIKE '%bumblebumble%';
 DELETE name FROM slaves WHERE notes LIKE '%bumblebumble%';

Returning a re-imaged slave to production

XXX: this is one specific instance until I get the chance to generalize it

changed password for cltbld via passwd
cleaned the existing slave cert on scl3-production-puppet.srv.releng.scl3.mozilla.com

puppetca --clean bld-lion-r5-047.build.releng.scl3.mozilla.com

ran puppetd on the slave

/usr/bin/puppetd --onetime --no-daemonize --logdest console --server scl3-production-puppet.srv.releng.scl3.mozilla.com

new version of python (2.7.3) was installed, do i deleted (as root, |sudo bash|) the existing buildbot dirs so they would be re-installed by puppet

rm -rf /tools/buildbot /tools/buildbot-0.8.4-pre-moz2

logged into the slave via VNC. Opened 'Users & Groups' under System Preferences. Clicked 'Login Options', and then toggled the Automatic Login for cltbld off, and then back on again (needed to put in the new cltbld password)
rebooted the box
re-ran same puppetd command as above to re-install buildbot. screenesolution check passes now too because auto-login is working.
copy ssh keys from another slave of the same class, i.e. lion build

cd $HOME;mv .ssh .ssh-old; scp -r bld-lion-r5-046:~/.ssh .;rm -rf .ssh-old

re-enabled slave in slavealloc
rebooted slave again, and it came back up in production

How to decommission a slave

disable the slave in slavealloc, also setting its environment to "decomm"
if the hardware has failed:
- file a bug against Server Ops:Releng to decomm the slave. They should (at the very least) make sure the nagios alerts are updated, DNS updated, and the hardware recovered from the dc.
if the hardware is still viable and can be used by another pool (e.g. r3 mini)
- file a bug against Server Ops:Releng to have the slave re-imaged to another OS with bad wait times (usually Windows)
- add the new slave to the buildbot configs, and make sure nagios monitoring is setup for the new slave (may require a new bug against relops)
remove the slave from the buildbot configs
remove the slave from puppet or opsi configs, if it exists in one

Using briar-patch tools (kitten) to manage slaves

See ReleaseEngineering:Buildduty:Kitten

CIDuty/How To/Deprecated / Archived/Slave Management

Contents

Slave Management

Known failure modes

Rebooting slaves

Rebooting via ssh

Windows

Linux/Mac

When ssh doesn't work

PDU

IPMI

Filing bugs for IT

Slave Tracking

Slavealloc

Adding a slave

Removing slaves

Returning a re-imaged slave to production

How to decommission a slave

Using briar-patch tools (kitten) to manage slaves

Navigation menu

CIDuty/How To/Deprecated / Archived/Slave Management

Slave Management

Known failure modes

Rebooting slaves

Rebooting via ssh

Windows

Linux/Mac

When ssh doesn't work

PDU

IPMI

Filing bugs for IT

Slave Tracking

Slavealloc

Adding a slave

Removing slaves

Returning a re-imaged slave to production

How to decommission a slave

Using briar-patch tools (kitten) to manage slaves

Navigation menu

Search