CIDuty/How To/Deprecated / Archived/Slave Management

From MozillaWiki
< CIDuty‎ | How To
Revision as of 05:51, 31 January 2013 by ChrisCooper (talk | contribs)
Jump to navigation Jump to search

Slave Management

In general, slave management involves:

  • keeping as many slaves up as possible, including
  • handling nagios alerts for slaves
  • interacting with IT regarding slave maintenance

Known failure modes

  • talos-r3-*
  • talos-r3-fed|fed64
    • these slaves frequently fail to reboot cleanly, knocking themselves off the network entirely. Also, check for stale puppet locks /var/lib/puppet/state/puppetdlock if they fail to puppetize cleanly.
  • talos-r3-[w7|xp]
    • Windows slaves have issues with modal dialogs, and sometimes the msys shell will fail to close properly. A manual reboot will usually clear this up.
  • talos-r4-[lion|snow], talos-mtnlion-r5
  • tegras and pandas
  • AWS slaves
    • a common failure is running out of disk space. They have default disk allocations of 150GB versus our which have 250GB. Catlee is working on changing that.
      • To clean them, you can run mock_mozilla -v -r mozilla-centos6-i386 --scrub=all See bug 829186 for an example.
    • Rail wrote a tool to manage aws slaves - enable or disable automatic reboot and automatic shutdown.
    • Mozilla DNS servers don't resolve AWS hostnames, thus this document describes how to resolve them

Rebooting slaves

You should always try to connect to bad slaves via ssh first. This gives you the chance to examine the current state of the machine and hopefully grab any logs that might be pertinent to the failure. If possible, you can also try connecting via VNC to see whether a stray crash or system dialog is being displayed.

Rebooting via ssh

Windows

shutdown -r -f -t 0

Linux/Mac

sudo reboot

When ssh doesn't work

PDU

You can determine which PDU and outlet a slave is connected to by checking the inventory (login required). Find the entry for the slave in question, and then scroll down to the Key/Value Store. There should be a key like the following, e.g.:

Key                      Value
system.pdu.0             pdu1.r101-21.ops.releng.scl3:BC3
system.hostname.alias.0  talos-mtnlion-r5-006.test.releng.scl3

From the system.pdu.0 line, we can see that we should connect to http://pdu1.r101-21.ops.releng.scl3.mozilla.com to power-cycle this slave (login required), and that the slave is attached to outlet BC3 on the PDU. Some of the PDUs have the outlets labelled with the slave name, but it's always good to double-check before rebooting anything.

Once connected to the web interface of the PDU, navigate to Outlet Control->Individual and find the appropriate outlet to reset.

Slave attached to PDUs:

IPMI

All iX build slaves can be rebooted via an IPMI interface. If the slave name is linux-ix-slave22, then you can access the IPMI interface for that slave at http://linux-ix-slave22-mgmt.build.mozilla.org/. It's protected by a username/password that you can get from any release engineer. Power Control is under the Remote Control menu.

Slaves that have IPMI:

  • linux-ix-*
  • linux64-ix-*
  • mv-moz2-linux-ix-*
  • mw32-ix-*
  • w64-ix-*

Filing bugs for IT

  • File a bug using the link in slavealloc - it will "do the right thing" to set up a new bug if needed.
  • Make the individual slave bug block the appropriate colo reboot/recovery bug (check the machine domain):
    • reboots-mtv1 - MTV
    • reboots-scl1 - SCL1
    • reboots-scl3 - SCL3
    • tegra-recovery - tegras
    • These bugs get closed when IT has recovered all of the individual blocking slaves. You should clone the recovery bug and move the alias forward as required. Otherwise, you may risk having other machines unintentionally rebooted that were added to the original alias.
  • Make sure the alias of the bug is the hostname (done automatically if you follow slavealloc bug link)
  • Create dependent bugs for any IT actions (beyond normal reboot)
    • should block both the datacenter bug & the per host bug (for record keeping)
    • consider whether the slave should be disabled in slavealloc, and note that in bug (no slave without a detailed bug should be disabled)
    • dcops assumes if there is no separate bug, they only need to reboot and see the machine come online.

Slave Tracking

  • Slave tracking is done via the Slave Allocator. Please disable/enable slaves in slavealloc.

NOTE: you no longer need to add the slave-specific bug number to the Notes field. Clicking on the help.png icon in slavealloc will look up the bug number and status for you, or create a template you can use to file a new bug. If there is another bug, e.g. for IT re-imaging, please add that extra bug number to the Notes field instead using the format: 'bug #######.'

Slavealloc

Adding a slave

Slaves are added to slavealloc via the 'dbimport' subcommand of the 'slavealloc' command. This is generally run as the slavealloc user on the slavealloc server, which is most easily accessed via su from root.

You'll want a command line something like

/tools/slavealloc/bin/slavealloc dbimport -D $db_url --slave-data mydata.csv

where $db_url is most easily found in slavealloc's shell history. The CSV file should have the headers specified by 'slavealloc dbimport --help':

name,distro,bitlength,speed,datacenter,trustlevel,environment,purpose,pool,basedir,enabled
talos-r3-xp-096,winxp,32,mini,scl1,try,prod,tests,tests-scl1-windows,C:\\talos-slave,1

Adding masters is similar - see dbimport's help for more information.

Removing slaves

Connect to slavealloc@slavealloc and look at the history for a command looking like this:

 mysql -h $host_ip -p -u buildslaves buildslaves
 # type the password
 SELECT name FROM slaves WHERE notes LIKE '%bumblebumble%';
 DELETE name FROM slaves WHERE notes LIKE '%bumblebumble%';

Returning a re-imaged slave to production

XXX: this is one specific instance until I get the chance to generalize it

  • changed password for cltbld via passwd
  • cleaned the existing slave cert on scl3-production-puppet.srv.releng.scl3.mozilla.com
puppetca --clean bld-lion-r5-047.build.releng.scl3.mozilla.com
  • ran puppetd on the slave
/usr/bin/puppetd --onetime --no-daemonize --logdest console --server scl3-production-puppet.srv.releng.scl3.mozilla.com
  • new version of python (2.7.3) was installed, do i deleted (as root, |sudo bash|) the existing buildbot dirs so they would be re-installed by puppet
rm -rf /tools/buildbot /tools/buildbot-0.8.4-pre-moz2
  • logged into the slave via VNC. Opened 'Users & Groups' under System Preferences. Clicked 'Login Options', and then toggled the Automatic Login for cltbld off, and then back on again (needed to put in the new cltbld password)
  • rebooted the box
  • re-ran same puppetd command as above to re-install buildbot. screenesolution check passes now too because auto-login is working.
  • copy ssh keys from another slave of the same class, i.e. lion build
cd $HOME;mv .ssh .ssh-old; scp -r bld-lion-r5-046:~/.ssh .;rm -rf .ssh-old
  • re-enabled slave in slavealloc
  • rebooted slave again, and it came back up in production

How to decommission a slave

  • disable the slave in slavealloc, also setting its environment to "decomm"
  • if the hardware has failed:
    • file a bug against Server Ops:Releng to decomm the slave. They should (at the very least) make sure the nagios alerts are updated, DNS updated, and the hardware recovered from the dc.
  • if the hardware is still viable and can be used by another pool (e.g. r3 mini)
    • file a bug against Server Ops:Releng to have the slave re-imaged to another OS with bad wait times (usually Windows)
    • add the new slave to the buildbot configs, and make sure nagios monitoring is setup for the new slave (may require a new bug against relops)
  • remove the slave from the buildbot configs
  • remove the slave from puppet or opsi configs, if it exists in one

Using briar-patch tools (kitten) to manage slaves

See ReleaseEngineering:Buildduty:Kitten