ReleaseEngineering/How To/Set Up a Freshly Imaged Slave: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
Line 146: Line 146:
# Install the correct set of secrets on the machine. These include:
# Install the correct set of secrets on the machine. These include:
#* [[ReleaseEngineering/How_To/Adjust_SSH_keys_on_a_slave|Update the ssh keys]] to the correct values for the destination pool
#* [[ReleaseEngineering/How_To/Adjust_SSH_keys_on_a_slave|Update the ssh keys]] to the correct values for the destination pool
#* Linux (doing android builds): <tt>scp -oBatchMode=no -r other_same_class_host:.{android,mozpass.cfg} .</tt>
#* if you're troubleshooting a recently returned slave, you may want to also reverse engineer [https://intranet.mozilla.org/RelEngWiki/index.php/How_To/Clean_A_Slave_For_Shipment_Externally How To/Clean A Slave For Shipment Externally]
#* if you're troubleshooting a recently returned slave, you may want to also reverse engineer [https://intranet.mozilla.org/RelEngWiki/index.php/How_To/Clean_A_Slave_For_Shipment_Externally How To/Clean A Slave For Shipment Externally]
# Change the slave's fields, eg production (non-try):
# Change the slave's fields, eg production (non-try):

Revision as of 20:42, 22 January 2013


If the machine is a re-purposed machine there are more steps that these needed. Check How to create new slaves or move them to other pools.

If your machine has simply been re-imaged follow the instructions from the appropriate section.

Linux/Mac

hostname verification

Linux

Verify the hostname, checking that it ends in 'build.(datacenter).mozilla.com':

hostname --fqdn

To fix it:

  • "su -" to become root.
  • edit the file /etc/sysconfig/network
    • changing the hostname to the host's *long* (with datacenter) fully qualified domain name.
  • reboot before running puppet

Mac

Verify the hostname, checking that it ends in 'build.(datacenter).mozilla.com':

hostname

To fix it (su - to become root):

  • run the following: scutil --set HostName XXX

armenzg: TODO: Who knows why this step is needed for? From my experience, even though the hostname looks like talos-r3-leopard-ref (XX) 1) Web Sharing, 2) Remote login and 3) Remote Management seems to have the correct HostName after running scutil and having rebooted

  • open System Preferences -> Sharing and change host name there

From Armen's experience this does not require intervention:

  • Note: cltbld user is listed for auto-login in the System Preferences -> Accounts-->Login Options dialog

Aki couldn't get CotVNC to work:

  • cmd-K vnc://... on mac finder

puppet

Note: For PuppetAgain slaves (e.g. HP Slaves) you should not need to do anything special for it to puppetize after a reimage. Just make sure that ~root/puppetize.log is from somepoint after it was imaged and the last lines in it do not show errors.
See PuppetAgain Process Docs for the gritty details on why this is true.

Note that initial setup of puppet on slaves is very different than from buildbot masters. On slaves, the daemon is not run, rather updates are polled for when it won't impact jobs. Do not enable the standard puppet service daemon on slaves.

To find the correct master to use with your slave(s), consult the puppet server list. If your slave isn't using a PuppetAgain master, you'll have to adjust /etc/sysconfig/puppet manually to reflect the correct master value for PUPPET_SERVER. (Search for your slave's hostname in the http://hg.mozilla.org/build/puppet-manifests *production.pp files at the root of the repo)

Darwin Note: remember to kill all instances of run-puppet-and-buildbot.sh script as it will be running with the refimage config and that will be overwriting your attempts to fix the puppet certs until you do

talos-r3-fed example (to help doing it):

 uname -a # to know if the hostname is correct and the FQDN
 su - # switch to root
 # on linux slave:
 rm -rf /var/lib/puppet/ssl/certs/*
 # on mac slave:
 rm -rf /etc/puppet/ssl/certs/*
 # on master
 # you have to figure out the master depending on the datacenter the slave belongs to
 puppetca --clean talos-r3-fed64-007.build.scl1.mozilla.com
 # on slave
 puppetd --test --server scl-production-puppet.build.scl1.mozilla.com
 # on puppet master
 puppetca --sign talos-r3-fed64-007.build.scl1.mozilla.com
 # on slave
 puppetd --test --server scl-production-puppet.build.scl1.mozilla.com
 # wait few seconds and it should reboot
  • get the slave talking to puppet. This will require a lot of repetitive work:
    • adjust the master it talks to be appropriate to its location
      • NOTE: syncing against the correct masters (I believe) it adjusts these values.
      • Linux builder: /etc/sysconfig/puppet
      • Linux tester: /home/cltbld/.config/autostart/gnome-terminal.desktop
      • Mac: /Library/LaunchDaemons/com.reductiv*.plist
    • run puppetd --test --noop --server $server_you_chose
      • If you get an error about directory not existing on linux, run without "--noop" once, so the directories can be created.
      • If you get an error that the slave can't access /N to access certain packages, update the fileserver.conf on the slave to ensure that that the subnet that the slave resides on is included in the list of subnets that can access /N. Otherwise the slave won't be able to access the resources it needs from the /N directory served by Apache.
      • note that the scl server has a funny name!
      • if you see errors about certificates, remove the certificate files (/var/lib/puppet/ssl/certs/*, /var/puppet/ssl/certs/*, or /etc/puppet/ssl/certs/*, depending on the slave)
      • run puppetca --sign $slave_fqdn repeatedly on the appropriate puppet master. Cron runs it every 60 seconds, but waiting for the crontask just slows you down.
      • if told to, run puppetca --clear $slave_fqdn on the master - this occurs when the master has an old key for this slave
      • Note that if you see a successful run but nothing happens, you're probably talking to a master which has no configuration for this slave - check that you're talking to the right master, and that the master's site.pp file contains the slave's name, and try again.
    • once puppet hits the right master, it will both blow away the certificates (even though they were correct) and reboot. So you'll need to wait for a restart, log in, and go through the above process again. Hopefully you'll only need to do this once.
  • once puppet is done eviscerating itself, have a look at the slave's twistd.log. If it's getting an UnauthorizedLogin for connection to the staging master, fix the password or add the slave to the master's config. Otherwise, watch the staging master until the slave finishes a job.

How to fix the hostname for Windows

Instead of replicating the information. Here are the instructions for all of our Windows platforms.

  • right-click on 'My Computer', go to 'Properties', 'Computer Name'
  • change the hostname
  • the domain name should be build.mozilla.org
    • Otherwise, click 'Change', type the computer name, click 'More', type the domain, and click OK until it restarts.

Windows slaves will come back from a re-image with "talos-r3-xp-ref" or "talos-r3-w7-ref" as the hostname.

Windows 2008 64-bit (MDT & unmanaged)

These machines are set up almost all the way with Group policy and only this is required to be setup after re-imaging:

Windows 2003 (soon to be obsoleted)

Activation

Nothing to be done but keeping track of it. Windows 2003 already comes pre-activated. You can check with:

oobe/msoobe /a

Hostname

OPSI

  • No action required

The entry of your slave on OPSI has been created from a template which has a package called "passwordupdate". That package is set to run "always" which ensures that the snapshot could have an older password and be updated immediately to the current ones.

Windows XP (OPSI partially)

tasklist

Make sure that you can run the command tasklist. If you can't, ask IT to re-image again. This issue is documented in their imaging instructions: https://mana.mozilla.org/wiki/pages/viewpage.action?pageId=28575847

Hostname

If you don't add the DNS change for Windows slaves using OPSI you will most likely get a Mit_Netzlaufwerken_verbinden error before the machine logs in.

OPSI

  • No action required

The entry of your slave on OPSI has been created from a template which has a package called "passwordupdate". That package is set to run "always" which ensures that the snapshot could have an older password and be updated immediately to the current ones.

Windows 7 (unmanaged)

The test reference platform is fairly complete.

Activation

Win7 will need to be activated. IT should have done this, but check by going to Control Panel -> System -> Activate Windows - a failure to activate will burn builds later.

If it is not activated asked IT to do so.

Hostname

Slavealloc notes and settings keys

  1. Install the correct set of secrets on the machine. These include:
  2. Change the slave's fields, eg production (non-try):
    • Trust: core
    • Environ: prod
    • Pool: build-scl1 (or whatever is appropriate)
  3. reboot it.