Slave Management

In general, slave management involves:

keeping as many slaves up as possible, including
- proactively checking for hung/broken slaves - see slave health dashboard.
- returning re-imaged slaves to production
handling nagios alerts for slaves
interacting with IT regarding slave maintenance

Known failure modes

talos-r3-*
- all of the r3 slaves are minis and require manual intervention if you cannot ping them or ssh into them to reboot them yourself. Add slaves in this mode to the appropriate reboots bug for IT.
talos-r3-fed|fed64
- these slaves frequently fail to reboot cleanly, knocking themselves off the network entirely. Also, check for stale puppet locks /var/lib/puppet/state/puppetdlock if they fail to puppetize cleanly.
talos-r4-snow, talos-mtnlion-r5
- ~~These slaves will sometimes fail to puppetize correctly. The remote_scutil_cmds.bash script can help with this.~~
- all r4 and r5 slaves are connected to PDUs for power-cycling
tegras and pandas
- tegras and pandas can fail in many disparate ways. See ReleaseEngineering/How_To/Android_Tegras for more info.
AWS slaves
- a common failure is running out of disk space. They have default disk allocations of 150GB versus our which have 250GB. Catlee is working on changing that.
  - To clean them, you can run mock_mozilla -v -r mozilla-centos6-i386 --scrub=all See bug 829186 for an example.
- Rail wrote a tool to manage aws slaves - enable or disable automatic reboot and automatic shutdown.
- ~~Mozilla DNS servers don't resolve AWS hostnames, thus this document describes how to resolve them~~

Automated

Slave Rebooter

Slave rebooter is a script that analyzes recent slave activity and attempts to reboot slaves that it thinks are stuck. It is a SlaveAPI based replacement for Kittenherder. It lives in the build/tools repository, gets deployed by Puppet, and currently lives on buildbot-master65.

At the time of writing, it works for all hardware machines except Tegras and Pandas. Cloud machines are explicitly ignored because they don't suffer from the same types of transient failures.

Manual

Rebooting slaves

Find the slave page on slave health. There's a button to reboot the machine.

Filing bugs for IT

File a bug using the link in the slave health page for the slave - it will "do the right thing" to set up a new bug if needed.
File a "slave is unreachable bug" for IT.
Create dependent bugs for any IT actions. (As of 2014, we should file per-slave bugs for reboots instead of grouping together machines in the same DC into one bug.)
- should block both the datacenter bug & the per host bug (for record keeping)
- consider whether the slave should be disabled in slavealloc, and note that in bug (no slave without a detailed bug should be disabled)
- dcops assumes if there is no separate bug, they only need to reboot and see the machine come online.
- Examples: https://bugzilla.mozilla.org/show_bug.cgi?id=966954, https://bugzilla.mozilla.org/show_bug.cgi?id=828602

Slave Tracking

Slave tracking is done via the Slave Allocator. Please disable/enable slaves in slavealloc.

NOTE: you no longer need to add the slave-specific bug number to the Notes field. Clicking on the icon in slavealloc will look up the bug number and status for you, or create a template you can use to file a new bug. If there is another bug, e.g. for IT re-imaging, please add that extra bug number to the Notes field instead using the format: 'bug #######.'

Slavealloc

Connecting

Slaves are added to slavealloc via the 'dbimport' subcommand of the 'slavealloc' command.

You will need to ssh as your own user onto the server which hosts slavealloc:

ssh <your user>@relengwebadm.private.scl3.mozilla.com

Staging vs production

The DB urls for staging and production are shared in a PGP encrypted file used by the Release Engineering team. Ask someone else in the team if you do not have this file.

Adding a slave

Once you connect to relengwebadm (see above), to see the help for the slavealloc dbimport command, run:

/data/releng/www/slavealloc/slavealloc dbimport -h

To import data, first you need to create a CSV file, like this one:

name,distro,bitlength,speed,datacenter,trustlevel,environment,purpose,pool,basedir
panda-0887,panda,32,mini,scl1,try,prod,tests,tests-scl1-panda,/builds/panda-0887
panda-0888,panda,32,mini,scl1,try,prod,tests,tests-scl1-panda,/builds/panda-0888
panda-0889,panda,32,mini,scl1,try,prod,tests,tests-scl1-panda,/builds/panda-0889

You'll want a command line something like:

/data/releng/www/slavealloc/slavealloc dbimport -D mysql://user:password@host/DB_name --slave-data <the csv file you just created containing slaves>

Adding a master

Adding masters is similar to adding a slave:

/data/releng/www/slavealloc/slavealloc dbimport -D mysql://user:password@host/DB_name  --master-data <csv file containing masters>

The following example shows the required fields, and example values:

nickname,fqdn,http_port,pb_port,datacenter,pool
bm89-tests1-panda,buildbot-master89.srv.releng.scl3.mozilla.com,8201,9201,scl3,tests-panda

To get a full list of allowed values for the various normalized fields to use in both import files, you can connect to the mysql database and query the tables directly:

SELECT name FROM bitlengths;
SELECT name FROM datacenters;
SELECT name FROM distros;
SELECT name FROM environments;
SELECT name FROM pools;
SELECT name FROM purposes;
SELECT name FROM speeds;
SELECT name FROM trustlevels;

Please note you'll need to set values in your CSV file that correspond to these allowed values.

The slavealloc dbimport mechanism will convert lines of the CSV file into INSERT sql statements. Non specified fields will essentially be set to NULL. To see how the fields are mapped and normalized, see: https://hg.mozilla.org/build/tools/file/5439f10a7127/lib/python/slavealloc/scripts/dbimport.py#l111 (lines 111-137).

Moving slaves

Connect to relengwebadmn and then connect to the mysql DB.

You have to determine the correct poolid and trustid values.

UPDATE slaves SET poolid=43, trustid=4 WHERE notes LIKE 'bug 917923 - to be converted into try hosts';

Removing slaves

Connect to relengwebadmn and then connect to the mysql DB.

 SELECT name FROM slaves WHERE notes LIKE '%bumblebumble%';
 DELETE name FROM slaves WHERE notes LIKE '%bumblebumble%';

Returning a re-imaged slave to production

(aka. post-imaging)

See ReleaseEngineering/How_To/Set_Up_a_Freshly_Imaged_Slave

How to decommission a slave

disable the slave in slavealloc, also setting its environment to "decomm"
if the hardware has failed:
- file a bug against Server Ops:Releng to decomm the slave. They should (at the very least) make sure the nagios alerts are updated, DNS updated, and the hardware recovered from the dc.
if the hardware is still viable and can be used by another pool (e.g. r3 mini)
- file a bug against Server Ops:Releng to have the slave re-imaged to another OS with bad wait times (usually Windows)
- add the new slave to the buildbot configs, and make sure nagios monitoring is setup for the new slave (may require a new bug against relops)
remove the slave from the buildbot configs
remove the slave from puppet or opsi configs, if it exists in one

Old stuff

Rebooting slaves

You should always try to connect to bad slaves via ssh first. This gives you the chance to examine the current state of the machine and hopefully grab any logs that might be pertinent to the failure. If possible, you can also try connecting via VNC to see whether a stray crash or system dialog is being displayed.

Rebooting via ssh

Windows

shutdown -r -f -t 0

Linux/Mac

sudo reboot

When ssh doesn't work

On Windows, if the SSH connection closes immediately upon connecting, chances are that the KTS SSH daemon has either a) banned your IP address or b) has reference to a disconnected SSH session and is preventing new connections with that username. To resolve, delete files under <program files>\KTS\log\active-sessions\*disconnected* and <program files>\KTS\log\ip-ban\*.* using Administrator privileges.

PDU

You can determine which PDU and outlet a slave is connected to by checking the inventory (login required). Find the entry for the slave in question, and then scroll down to the Key/Value Store. There should be a key like the following, e.g.:

Key                      Value
system.pdu.0             pdu1.r101-21.ops.releng.scl3:BC3
system.hostname.alias.0  talos-mtnlion-r5-006.test.releng.scl3

From the system.pdu.0 line, we can see that we should connect to http://pdu1.r101-21.ops.releng.scl3.mozilla.com to power-cycle this slave (login required), and that the slave is attached to outlet BC3 on the PDU. Some of the PDUs have the outlets labelled with the slave name, but it's always good to double-check before rebooting anything.

Once connected to the web interface of the PDU, navigate to Outlet Control->Individual and find the appropriate outlet to reset.

Slave attached to PDUs:

talos-r4-*
talos-mtnlion-r5-*
tegras, but you should really follow the reboot instructions in ReleaseEngineering/How_To/Android_Tegras

IPMI

All iX build slaves can be rebooted via an IPMI interface. If the slave name is linux-ix-slave22, then you can access the IPMI interface for that slave at http://linux-ix-slave22-mgmt.build.mozilla.org/. It's protected by a username/password that you can get from any release engineer. Power Control is under the Remote Control menu.

You can also use this:

ipmitool -U <user> -P <password> -H .*-ix-.*-mgmt chassis power soft

Slaves that have IPMI:

linux-ix-*
linux64-ix-*
mv-moz2-linux-ix-*
mw32-ix-*
w64-ix-*

CIDuty/How To/Deprecated / Archived/Slave Management

Contents

Slave Management

Known failure modes

Automated

Slave Rebooter

Manual

Rebooting slaves

Filing bugs for IT

Slave Tracking

Slavealloc

Connecting

Staging vs production

Adding a slave

Adding a master

Moving slaves

Removing slaves

Returning a re-imaged slave to production

How to decommission a slave

Old stuff

Rebooting slaves

Rebooting via ssh

Windows

Linux/Mac

When ssh doesn't work

PDU

IPMI

Navigation menu

CIDuty/How To/Deprecated / Archived/Slave Management

Slave Management

Known failure modes

Automated

Slave Rebooter

Manual

Rebooting slaves

Filing bugs for IT

Slave Tracking

Slavealloc

Connecting

Staging vs production

Adding a slave

Adding a master

Moving slaves

Removing slaves

Returning a re-imaged slave to production

How to decommission a slave

Old stuff

Rebooting slaves

Rebooting via ssh

Windows

Linux/Mac

When ssh doesn't work

PDU

IPMI

Navigation menu

Search