ReleaseEngineering/How To/Manage Buildbot with Fabric
RelEng has started writing some tools to manage all the buildbot masters using fabric.
The manage_masters.py script is available from the tools repository, in the buildfarm/maintenance directory.
Fabric is a pre-requisite for running these tools. It is easy-installable into a virtual environment. Setup of ssh-agent is strongly recommended (see below for details.
Contents
Setup
hg clone ssh://hg.mozilla.org/build/tools cd tools mkvirtualenv tools pip install fabric
Usage
Usage: manage_masters.py [options] action [action ...]
Supported actions (run the script to see if there are more): check, checkconfig, show_revisions, reconfig, restart, graceful_restart, stop, graceful_stop, start, update
Example:
python manage_masters.py \ -f http://hg.mozilla.org/build/tools/raw-file/tip/buildfarm/maintenance/production-masters.json \ -R scheduler check
or if your tools repo is up to date just
python manage_masters.py -f production-masters.json -R scheduler check
buildbot-wrangler.py
Make sure you run fabric from "buildfarm/maintenance" since buildbot-wrangler.py is there and needs to be uploaded to the masters when we try to do a reconfig.
Traceback (most recent call last): File "build/bdist.macosx-10.6-universal/egg/fabric/main.py", line 540, in main File "/Users/armenzg/repos/releng/braindump/buildbot-related/master_fabric.py", line 99, in reconfig put('buildbot-wrangler.py', '%s/buildbot-wrangler.py' % m['master_dir']) File "build/bdist.macosx-10.6-universal/egg/fabric/network.py", line 391, in host_prompting_wrapper File "build/bdist.macosx-10.6-universal/egg/fabric/operations.py", line 283, in put ValueError: 'buildbot-wrangler.py' is not a valid local path or glob. Disconnecting from production-master02.build.mozilla.org... done.
Suggestions
Don't use fabric with the test masters to reconfig if you are in a rush (backing something out) as it takes forever (sequential reconfigs).
If you need to reconfig everything it is much better if you run four instances of fabric (each on a different terminal). The reconfig step is blocking and it won't continue to the next host on a role group until it finishes. (Remember the reconfig step does NOT update.)
# in case it is not clear; Run each one on a different window python manage_masters.py -f production-masters.json -j16 -R scheduler update checkconfig reconfig python manage_masters.py -f production-masters.json -j16 -R build update checkconfig reconfig python manage_masters.py -f production-masters.json -j16 -R try update checkconfig reconfig python manage_masters.py -f production-masters.json -j16 -R tests update checkconfig reconfig
The tests reconfig can take a really long time, so you can parallelize the test process using -M {macosx|windows|linux|panda} (instead of "-R tests") each on a different tab plus -j16. So, replace the last line/window with these 5 (for a total of 8 windows):
python manage_masters.py -f production-masters.json -j16 -M macosx update checkconfig reconfig python manage_masters.py -f production-masters.json -j16 -M windows update checkconfig reconfig python manage_masters.py -f production-masters.json -j16 -M linux update checkconfig reconfig python manage_masters.py -f production-masters.json -j16 -M tegra update checkconfig reconfig python manage_masters.py -f production-masters.json -j16 -M panda update checkconfig reconfig
To validate the above (i.e. we haven't added any new platforms since the docs were updated), run:
diff -u \ <(./manage_masters.py -f production-masters.json -l -R tests) \ <(./manage_masters.py -f production-masters.json -l -M macosx \ -M windows -M linux -M tegra -M panda)
If any differences are reported, include those platforms and update the docs.
Hosts and role groups
Fabric works on individual hosts, and supports organizing these hosts into groups. This is mostly a good fit for how we need to work, except we often have multiple buildbot masters on a single host, so there is a bit of hacking in master_fabric.py to pick out the right hosts to operate on depending on what the user has selected.
Hosts are selected with the -H flag, and roles are selected with the -R flag. Hosts correspond to the 'name' field in the masters json file, and are short abbreviations to refer to each master, e.g. bm13-build1, bm19-tests1-tegra, bm33-try1, bm36-build_scheduler. We have 4 roles defined: build, scheduler, try, and tests. Selecting a role will restrict fabric to only operate on masters that operate on that role.
The string 'all' when specified via -H or -R means that all masters in the masters file will be operated on. You can also use -M flag to match on strings in the master name, eg -M tests1-windows to pick up all the windows test masters. Note that manage_masters.py will "or" all host specifications from the command line, e.g. "-R tests -M windows" will return all hosts in role "tests", not just the windows test masters.
Fabric relies on being able to ssh to the masters without password authentication, so be sure to have your ssh keys set up! Which means have the needed keys added into the running instance of your ssh-agent (your "~/.ssh/config" file is not consulted by Paramiko.) If you don't have the keys set up, you'll be asked for your password one time per invocation, so use multiple commands per invocation where appropriate.
Updating checkout
python manage_masters.py -f production-masters.json -R scheduler update [production-master02.build.mozilla.org] run: hg pull [production-master02.build.mozilla.org] out: pulling from http://hg.mozilla.org/build/buildbotcustom [production-master02.build.mozilla.org] out: searching for changes [production-master02.build.mozilla.org] out: adding changesets [production-master02.build.mozilla.org] out: adding manifests [production-master02.build.mozilla.org] out: adding file changes [production-master02.build.mozilla.org] out: added 11 changesets with 19 changes to 12 files [production-master02.build.mozilla.org] out: (run 'hg update' to get a working copy) [production-master02.build.mozilla.org] run: hg update -r default [production-master02.build.mozilla.org] err: .hgtags@8546abc704ee, line 93: tag 'FIREFOX_3_6_9_BUILD1' refers to unknown node [production-master02.build.mozilla.org] err: .hgtags@8546abc704ee, line 94: tag 'FIREFOX_3_6_9_RELEASE' refers to unknown node [production-master02.build.mozilla.org] out: 12 files updated, 0 files merged, 2 files removed, 0 files unresolved [production-master02.build.mozilla.org] run: hg pull [production-master02.build.mozilla.org] out: pulling from http://hg.mozilla.org/build/buildbot-configs [production-master02.build.mozilla.org] out: searching for changes [production-master02.build.mozilla.org] out: adding changesets [production-master02.build.mozilla.org] out: adding manifests [production-master02.build.mozilla.org] out: adding file changes [production-master02.build.mozilla.org] out: added 35 changesets with 49 changes to 32 files [production-master02.build.mozilla.org] out: (run 'hg update' to get a working copy) [production-master02.build.mozilla.org] run: hg update -r default [production-master02.build.mozilla.org] err: .hgtags@ac95f8973f7e, line 221: tag 'FIREFOX_3_6_13_RELEASE' refers to unknown node [production-master02.build.mozilla.org] err: .hgtags@ac95f8973f7e, line 222: tag 'FIREFOX_3_6_13_BUILD1' refers to unknown node [production-master02.build.mozilla.org] out: 32 files updated, 0 files merged, 0 files removed, 0 files unresolved [production-master01.build.mozilla.org] run: hg pull [production-master01.build.mozilla.org] out: pulling from http://hg.mozilla.org/build/buildbotcustom [production-master01.build.mozilla.org] out: searching for changes [production-master01.build.mozilla.org] out: adding changesets [production-master01.build.mozilla.org] out: adding manifests [production-master01.build.mozilla.org] out: adding file changes [production-master01.build.mozilla.org] out: added 5 changesets with 13 changes to 10 files [production-master01.build.mozilla.org] out: (run 'hg update' to get a working copy) [production-master01.build.mozilla.org] run: hg update -r default [production-master01.build.mozilla.org] out: 10 files updated, 0 files merged, 2 files removed, 0 files unresolved [production-master01.build.mozilla.org] run: hg pull [production-master01.build.mozilla.org] out: pulling from http://hg.mozilla.org/build/buildbot-configs [production-master01.build.mozilla.org] out: searching for changes [production-master01.build.mozilla.org] out: adding changesets [production-master01.build.mozilla.org] out: adding manifests [production-master01.build.mozilla.org] out: adding file changes [production-master01.build.mozilla.org] out: added 10 changesets with 11 changes to 9 files [production-master01.build.mozilla.org] out: (run 'hg update' to get a working copy) [production-master01.build.mozilla.org] run: hg update -r default [production-master01.build.mozilla.org] out: 9 files updated, 0 files merged, 0 files removed, 0 files unresolved Done. Disconnecting from production-master01.build.mozilla.org... done. Disconnecting from production-master02.build.mozilla.org... done.
Show which revisions are checked out
Order is master, buildbotcustom, buildbot-configs, tools.
$ python manage_masters.py -f production-masters.json -R build -R scheduler show_revisions pm01-bm 1046bc8c7e00 57e8bc4354d2 cfca31588669 pm01-sm 1046bc8c7e00 57e8bc4354d2 cfca31588669 pm02-sm 1046bc8c7e00 57e8bc4354d2 cfca31588669 pm03-bm 1046bc8c7e00+ 57e8bc4354d2 cfca31588669 bm3 1046bc8c7e00+ 57e8bc4354d2 cfca31588669 bm4 1046bc8c7e00 57e8bc4354d2 cfca31588669
Looks like we have some local modifications! Bad Release Engineers, no scotch^W cookie for you.
Checkconfig
python manage_masters.py -f production-masters.json -R build -R scheduler checkconfig bm3 OK pm02-sm OK pm01-bm OK pm01-sm OK bm4 OK Done. Disconnecting from buildbot-master1.build.mozilla.org... done. Disconnecting from production-master01.build.mozilla.org... done. Disconnecting from buildbot-master2.build.mozilla.org... done. Disconnecting from production-master02.build.mozilla.org... done.
Reconfigure
Reminder: reconfigure only does the reconfig; you need to have previously done an 'update' and 'checkconfig'
python manage_masters.py -f production-masters.json -R build reconfig [buildbot-master1.build.mozilla.org] put: buildbot-wrangler.py -> /builds/buildbot/build_master3/master/buildbot-wrangler.py [buildbot-master1.build.mozilla.org] run: rm -f *.pyc [buildbot-master1.build.mozilla.org] run: python buildbot-wrangler.py reconfig . [production-master01.build.mozilla.org] put: buildbot-wrangler.py -> /builds/buildbot/builder_master1/buildbot-wrangler.py [production-master01.build.mozilla.org] run: rm -f *.pyc [production-master01.build.mozilla.org] run: python buildbot-wrangler.py reconfig . [production-master01.build.mozilla.org] err: 2010-11-24 06:58:26-0800 [Broker,252,10.2.71.15] Unhandled Error [production-master01.build.mozilla.org] err: Traceback (most recent call last): [production-master01.build.mozilla.org] err: Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionDone'>: Connection was closed cleanly. [production-master01.build.mozilla.org] err: ] [buildbot-master2.build.mozilla.org] put: buildbot-wrangler.py -> /builds/buildbot/build_master4/master/buildbot-wrangler.py [buildbot-master2.build.mozilla.org] run: rm -f *.pyc [buildbot-master2.build.mozilla.org] run: python buildbot-wrangler.py reconfig . Done. Disconnecting from buildbot-master1.build.mozilla.org... done. Disconnecting from production-master01.build.mozilla.org... done. Disconnecting from buildbot-master2.build.mozilla.org... done.
If the reconfig gets stuck, see How To/Unstick a Stuck Slave From A Master.
As a special case for test masters, you can unstick things by either:
- triggering a "Clean Shutdown" from the web UI for that master, or
- using manage_masters.py graceful_restart command
After jobs complete, the master will shut down (web page will not be served). Fabric should notice and unstick itself at that point. If fabric doesn't notice, in a separate window, individually do the update and start steps. If fabric still doesn't notice, good luck and document what works.