CIDuty/How To/Deprecated / Archived/Slave Management: Difference between revisions

From MozillaWiki
< CIDuty‎ | How To
Jump to navigation Jump to search
Line 205: Line 205:
[[File:Xp - working screen setup.png]]
[[File:Xp - working screen setup.png]]
== Win7 ==
== Win7 ==
=== Working graphical setup ===
TODO
TODO
=== Task library ===
NOTE: To run manually the talosslave task you need to change the property of the task.
[[File:W7_-_task_library.png]]


== Win8 x64 ==
== Win8 x64 ==
TODO
TODO

Revision as of 19:36, 15 April 2014

In general, slave management involves:

  • keeping as many slaves up as possible, including
    • proactively checking for hung/broken slaves - see slave health dashboard.
    • returning re-imaged slaves to production
  • handling nagios alerts for slaves
  • interacting with IT regarding slave maintenance

Known failure modes

  • talos-r4-snow, talos-mtnlion-r5
  • tegras and pandas
  • AWS slaves
    • a common failure is running out of disk space. They have default disk allocations of 150GB versus our which have 250GB. Catlee is working on changing that.
      • To clean them, you can run mock_mozilla -v -r mozilla-centos6-i386 --scrub=all See bug 829186 for an example.
    • Rail wrote a tool to manage aws slaves - enable or disable automatic reboot and automatic shutdown.
    • Mozilla DNS servers don't resolve AWS hostnames, thus this document describes how to resolve them

Automated

Slave Rebooter

Slave rebooter is a script that analyzes recent slave activity and attempts to reboot slaves that it thinks are stuck. It is a SlaveAPI based replacement for Kittenherder. It lives in the build/tools repository, gets deployed by Puppet, and currently lives on buildbot-master65.

At the time of writing, it works for all hardware machines except Tegras and Pandas. Cloud machines are explicitly ignored because they don't suffer from the same types of transient failures.

Manual

Rebooting slaves

Find the slave page on slave health. There's a button to reboot the machine.

Filing bugs for IT

  • File a bug using the link in the slave health page for the slave - it will "do the right thing" to set up a new bug if needed.
  • File a "slave is unreachable bug" for IT.
  • Create dependent bugs for any IT actions. (As of 2014, we should file per-slave bugs for reboots instead of grouping together machines in the same DC into one bug.)

Slave Tracking

  • Slave tracking is done via the Slave Allocator. Please disable/enable slaves in slavealloc.

NOTE: you no longer need to add the slave-specific bug number to the Notes field. Clicking on the help.png icon in slavealloc will look up the bug number and status for you, or create a template you can use to file a new bug. If there is another bug, e.g. for IT re-imaging, please add that extra bug number to the Notes field instead using the format: 'bug #######.'

Slavealloc

Connecting

Slaves are added to slavealloc via the 'dbimport' subcommand of the 'slavealloc' command.

You will need to ssh as your own user onto the server which hosts slavealloc:

ssh <your user>@relengwebadm.private.scl3.mozilla.com 

Staging vs production

The DB urls for staging and production are shared in a PGP encrypted file used by the Release Engineering team. Ask someone else in the team if you do not have this file.

Adding a slave

Once you connect to relengwebadm (see above), to see the help for the slavealloc dbimport command, run:

/data/releng/www/slavealloc/slavealloc dbimport -h

To import data, first you need to create a CSV file, like this one:

name,distro,bitlength,speed,datacenter,trustlevel,environment,purpose,pool,basedir
panda-0887,panda,32,mini,scl1,try,prod,tests,tests-scl1-panda,/builds/panda-0887
panda-0888,panda,32,mini,scl1,try,prod,tests,tests-scl1-panda,/builds/panda-0888
panda-0889,panda,32,mini,scl1,try,prod,tests,tests-scl1-panda,/builds/panda-0889

You'll want a command line something like:

/data/releng/www/slavealloc/slavealloc dbimport -D mysql://user:password@host/DB_name --slave-data <the csv file you just created containing slaves>

Adding a master

Adding masters is similar to adding a slave:

/data/releng/www/slavealloc/slavealloc dbimport -D mysql://user:password@host/DB_name  --master-data <csv file containing masters>

The following example shows the required fields, and example values:

nickname,fqdn,http_port,pb_port,datacenter,pool
bm89-tests1-panda,buildbot-master89.srv.releng.scl3.mozilla.com,8201,9201,scl3,tests-panda

To get a full list of allowed values for the various normalized fields to use in both import files, you can connect to the mysql database and query the tables directly:

SELECT name FROM bitlengths;
SELECT name FROM datacenters;
SELECT name FROM distros;
SELECT name FROM environments;
SELECT name FROM pools;
SELECT name FROM purposes;
SELECT name FROM speeds;
SELECT name FROM trustlevels;

Please note you'll need to set values in your CSV file that correspond to these allowed values.

The slavealloc dbimport mechanism will convert lines of the CSV file into INSERT sql statements. Non specified fields will essentially be set to NULL. To see how the fields are mapped and normalized, see: https://hg.mozilla.org/build/tools/file/5439f10a7127/lib/python/slavealloc/scripts/dbimport.py#l111 (lines 111-137).

Moving slaves

Connect to relengwebadmn and then connect to the mysql DB.

You have to determine the correct poolid and trustid values.

UPDATE slaves SET poolid=43, trustid=4 WHERE notes LIKE 'bug 917923 - to be converted into try hosts';

Removing slaves

Connect to relengwebadmn and then connect to the mysql DB.

 SELECT name FROM slaves WHERE notes LIKE '%bumblebumble%';
 DELETE name FROM slaves WHERE notes LIKE '%bumblebumble%';

Returning a re-imaged slave to production

(aka. post-imaging)

See ReleaseEngineering/How_To/Set_Up_a_Freshly_Imaged_Slave

How to decommission a slave

  • disable the slave in slavealloc, also setting its environment to "decomm"
  • if the hardware has failed:
    • file a bug against Server Ops:Releng to decomm the slave. They should (at the very least) make sure the nagios alerts are updated, DNS updated, and the hardware recovered from the dc.
  • if the hardware is still viable and can be used by another pool (e.g. r3 mini)
    • file a bug against Server Ops:Releng to have the slave re-imaged to another OS with bad wait times (usually Windows)
    • add the new slave to the buildbot configs, and make sure nagios monitoring is setup for the new slave (may require a new bug against relops)
  • remove the slave from the buildbot configs
  • remove the slave from puppet or opsi configs, if it exists in one

Windows

I'm hoping to add enough info to demystify Windows and allow anyone to debug a Windows machine.

Start up flow

This is how buildbot starts:

scheduled task (after login) -> start talos bat -> C:\slave\runslave.py

We started logging the start up of runslave.py under C:\slave\runslave.log We do some clean up steps inside of the .bat file.

TODO: correct file names and paths

Trigger buildbot the natural way

You're logged in and you want to trigger buildbot the same way as if the machine had come back from a reboot.

Go to the Task Library, change the property of the scheduled task to allow running manually and hit "run" on the task (more or less).

Infra setup

The Windows machines are managed via GPO.

The Windows test machines have the on-board graphics card and a third party graphic card. The screenshot below shows two devices listed: Xp - two graphic cards.png

Windows basics

Command Prompt

Aka cmd.exe, you can start it by clicking on the "start" button and then click on "Run..."

Quick edit mode

You can change the properties of a Command Prompt to allow you to do these neat things:

  • right-click to paste
  • select with mouse and press enter to copy from selected text

You can do so by doing a right click on the Command Prompt window and changing the properties. You can also change the defaults settings for Command Prompts being generated in the future.

If I recall correctly, this feature was requested for RelOps to deploy to all of our Windows machines.

runas

In many places you can right click and run a process as root. However, sometimes you would want to do that from the command prompt.

runas /user:root command_that_you_want

Screen resolution

Manually: You can do a right click on the desktop and click on "Properties". You can then click on the "Settings" tab.

A while ago I wrote a script that adjusts the screen resolution on Win7 machines: http://hg.mozilla.org/build/tools/file/default/scripts/support/mouse_and_screen_resolution.py

There is code to query screen resolutions.

We should find a way to prevent starting machines up with not big enough screen resolutions. We could use runslave.py or start-buildbot.bat to prevent that (since we don't have pre-flight tasks yet).

Registry

You can start the registry editor by running "regedit".

Computer Management

Do a "right click" on "My Computer" and click on "Manage"
Xp - computer management.png

Check logs

You can review the logs of the Windows machine to debug issues. You should things like reboot times and others. Xp - looking at logs.png

Task Library

TODO

Xp

Working graphics setup

Xp - working screen setup.png

Win7

Working graphical setup

TODO

Task library

NOTE: To run manually the talosslave task you need to change the property of the task. W7 - task library.png

Win8 x64

TODO