Looking for who is on buildduty? - check the tree-info dropdown on tbpl
Buildduty not around? - please open a bug

Each week there is one person from the Release Engineering team dedicated to helping out developers with releng-related issues. This person will be available during his or her regular work hours for the whole week. This is similar to the sheriff role that rotates through the developer community. To avoid confusion, the releng sheriff position is known as "buildduty."

Here's how to do it.

Schedule

Mozilla Releng Sheriff Schedule (Google Calendar|iCal|XML)

General Duties

How should I make myself available for duty?

Add 'buildduty' to your IRC nick
be in at least #developers, #buildduty and #build (as well as #mozbuild of course)
- also useful to be in #mobile, #planning, #release-drivers, and #ateam
watch http://tbpl.mozilla.org

What else should I take care of?

You should keep on top of:

Handle developer requests in #developers and #build.
- Direct people to http://mzl.la/tryhelp for self-serve documentation when appropriate.
Pending builds - available in graphs or in the "Infrastructure" pulldown on TBPL. The graphs are helpful for noticing anomalous behavior.
Wait times - either this page or the emails (un-filter them in Zimbra). Respond to any unusually long wait times (hopefully with a reason)
- wait times emails are run via crontab entries setup on buildapi01.build.scl1.mozilla.com under the buildapi user
Bad/hung slaves - bad slaves can burn builds and hung slaves can cause bad wait times. These slaves need to be rebooted, or handed to IT for recovery. Recovered slaves need to be tracked on their way back to operational status.
- See the section below on #Slave_Maintenance
all bugs tagged with [buildduty] in the whiteboard: buildduty saved search
Monitor dev.tree-management newsgroup (by email or by nntp)
Watch for long running builds that are holding on to slaves ie > 1 day. See the build api page.
You may need to plan a downtime. Coordinate with IT to send downtime notices with enough advance notice.
- See the section below on #Downtimes

Scheduled Reconfigs

Buildduty is responsible for reconfiging the Buildbot masters every Monday and Thursday, their time. During this, buildduty needs to merge default -> production branches and reconfig the affected masters. This wiki page has step by step instructions. It is also valid to do other additional reconfigs anytime you want.

If the reconfig gets stuck, see How To/Unstick a Stuck Slave From A Master.

You should use Fabric to do the reconfig!

The person doing reconfigs should also update the reconfig deployments page.

Tree Maintenance

Repo Errors

If a dev reports a problem pushing to hg (either m-c or try repo) then you need to do the following:

File a bug (or have dev file it) and then poke in #ops noahm
- If he doesn't respond, then escalate the bug to page on-call
Follow the steps below for "How do I close the tree"

How do I see problems in TBPL?

All "infrastructure" (that's us!) problems should be purple at http://tbpl.mozilla.org. Some aren't, so keep your eyes open in IRC, but get on any purples quickly.

How do I close the tree?

See ReleaseEngineering/How_To/Close_or_Open_the_Tree

How do I claim a rentable project branch?

See ReleaseEngineering/DisposableProjectBranches#BOOKING_SCHEDULE

Re-run jobs

How to trigger Talos jobs

see ReleaseEngineering/How_To/Trigger_Talos_Jobs

How to re-trigger all Talos runs for a build (by using sendchange)

see ReleaseEngineering/How_To/Trigger_Talos_Jobs

How to re-run a build

Do not go to the page of the build you'd like to re-run and cook up a sendchange to try to re-create the change that caused it. Changes without revlinks trigger releases, which is not what you want.

Find the revision you want, find a builder page for the builder you want (preferably, but not necessarily, on the same master), and plug the revision, your name, and a comment into the "Force Build" form. Note that the YOU MUST specify the branch, so there's no null keys in the builds-running.js.

Try Server

Deploy TryChooser changes

ReleaseEngineering/How_To/Update_the_Try_Syntax

Jobs not scheduled at all?

Recreate the comment of their change with http://people.mozilla.org/~lsblakk/trychooser/ and compare it to make sure is correct.

Then do a sendchange and tail the scheduler master:

 buildbot sendchange --master buildbot-master10:9301 --revision 923103d5a656 --branch try --username mpalmgren@mozilla.com --comments "try: -b d -p linux -u all" doit

If tryserver was just reset verify that the scheduler has been reset

How do I trigger additional talos/test runs for a given try build?

see ReleaseEngineering/How_To/Trigger_Talos_Jobs

Using the TryChooser to submit build/test requests

buildduty can also use the same TryChooser syntax as developers use to (re)submit build and testing requests. Here is an example:

 buildbot sendchange --master buildbot-master10:9301 --revision 923103d5a656 --branch try --username mpalmgren@mozilla.com --comments "try: -b d -p linux -u all" doit

How do I cancel existing jobs?

The cancellator.py script is setup on pm02. Here is a standard example:

# Dry run first to see what would be cancelled. 
python cancellator.py -b try -r 5ff84b660e90
# Same command run again with the force option specified (--yes-really) to actually cancel the builds
python cancellator.py -b try -r 5ff84b660e90 --yes-really

The script is intended for try builds, but can be used on other branches as long as you are careful to check that no other changes have been merged into the jobs. Use the revision/branch/rev report to check.

Bug Commenter

This is on cruncher and is run in a crontab in lsblakk's account:

source /home/lsblakk/autoland/bin/activate && cd /home/lsblakk/autoland/tools/scripts/autoland \
&& time python schedulerDBpoller.py -b try -f -c schedulerdb_config.ini -u None -p None -v

You can see quickly if things are working by looking at:

/home/lsblakk/autoland/tools/scripts/autoland/postedbugs.log  # this shows what's been posted lately
/home/lsblakk/autoland/tools/scripts/autoland/try_cache  # this shows what the script thinks is 'pending' completion

Nightlies

How do I re-spin mozilla-central nightlies?

To rebuild the same nightly, buildbot's Rebuild button works fine.

To build a different revision, Force build all builders matching /.*mozilla-central.*nightly/, on any of the regular build masters. Set revision to the desired revision. With no revision set, the tip of the default branch will be used, but it's probably best to get an explicit revision from hg.mozilla.org/mozilla-central.

You can use https://build.mozilla.org/buildapi/self-serve/mozilla-central to do initiate this build and use the changeset at the tip of http://hg.mozilla.org/mozilla-central. Sometimes the developer will request a specific changeset in the bug.

Mobile

Android Tegras

Android Tegra BuildDuty Notes

Android Updates aren't working!

Did the version number just change? If so, you may be hitting bug 629528. Kick off another Android nightly.
Check aus3-staging for size 0 complete.txt snippets:
- https://bugzilla.mozilla.org/show_bug.cgi?id=652667#c1
- https://bugzilla.mozilla.org/show_bug.cgi?id=651925#c5
- If so, copy a non-size-0 complete.txt over the size 0 one. Only the most recent buildid should be size 0.
Check aus3-staging to see if the checksum is correct:
- https://bugzilla.mozilla.org/show_bug.cgi?id=652785#c2
- If so, either copy the complete.txt with the correct checksum to the 2nd-most-recent buildid directory, or kick off another Android nightly.

Update mobile talos webhosts

We have a balance loader (bm-remote) that is in front of three web hosts (bm-remote-talos-0{1,2,3}). Here is how you update them: Update Procedure:

 ssh root@bm-remote-talos-webhost-01
 cd /var/www/html/talos-repo
 # NOTICE that we have uncommitted files
 hg st
 # ? talos/page_load_test/tp4
 # Take note of the current revision to revert to (just in case)
 hg id
 hg pull -u
 # 488bc187a3ef tip
 rsync -azv --delete /var/www/html/. bm-remote-talos-webhost-02:/var/www/html/.
 rsync -azv --delete /var/www/html/. bm-remote-talos-webhost-03:/var/www/html/.

Keep track of what revisions is being run.

Deploy new tegra-host-utils.zip

There are three hosts behind a balance loader.

See bug 742597 for previous instance of this case.

ssh root@bm-remote-talos-webhost-01
cd /var/www/html/tegra
wget -O tegra-host-utils.742597.zip http://people.mozilla.org/~jmaher/tegra-host-utils.zip
rsync -azv /var/www/html/tegra/. bm-remote-talos-webhost-02:/var/www/html/tegra/
rsync -azv /var/www/html/tegra/. bm-remote-talos-webhost-03:/var/www/html/tegra/

Slave Maintenance

In general, slave maintenance involves:

keeping as many slaves up as possible, including
- proactively checking for hung/broken slaves - see the last build per slave page which is updated once an hour.
- returning re-imaged slaves to production
handling nagios alerts for slaves
interacting with IT regarding slave maintenance

Known failure modes

talos-r3-fed|fed64
- these slaves frequently fail to reboot cleanly, knocking themselves off the network entirely
talos-r3-[w7|xp]
- Windows slaves have issues with modal dialogs, and sometimes the msys shell will fail to close properly. A manual reboot will usually clear this up.
talos-r4-[lion|snow]
- These slaves will sometimes fail to puppetize correctly. The remote_scutil_cmds.bash script can help with this.
- r4 slaves
tegras
- tegras can fail in many disparate ways. See ReleaseEngineering/How_To/Android_Tegras for more info.

File a bug

Use these bugzilla templates to file a new:
- Build slave bug:
- Test slave bug:
Make the individual slave bug block the appropriate colo reboot/recovery bug (check the machine domain):
- reboots-mtv1 - MTV
- reboots-scl1 - SCL1
- reboots-scl3 - SCL3
- tegra-recovery - tegras
- These bugs get closed when IT has recovered all of the individual blocking slaves. You should clone the recovery bug and move the alias forward as required. Otherwise, you may risk having other machines unintentionally rebooted that were added to the original alias.
Make sure the alias of the bug is the hostname
Create dependent bugs for any IT actions

Slave Tracking

Slave tracking is done via the Slave Allocator. Please disable/enable slaves in slavealloc and add relevant bug numbers to the Notes field.

Slavealloc

Adding a slave

Slaves are added to slavealloc via the 'dbimport' subcommand of the 'slavealloc' command. This is generally run as the slavealloc user on the slavealloc server, which is most easily accessed via su from root.

You'll want a command line something like

/tools/slavealloc/bin/slavealloc dbimport -D $db_url --slave-data mydata.csv

where $db_url is most easily found in slavealloc's shell history. The CSV file should have the headers specified by 'slavealloc dbimport --help':

name,basedir,distro,bitlength,purpose,datacenter,trustlevel,speed,environment,pool

Adding masters is similar - see dbimport's help for more information.

Removing slaves

Connect to slavealloc@slavealloc and look at the history for a command looking like this:

 mysql -h $host_ip -p -u buildslaves buildslaves
 # type the password
 SELECT name FROM slaves WHERE notes LIKE '%bumblebumble%';
 DELETE name FROM slaves WHERE notes LIKE '%bumblebumble%';

Using briar-patch tools (kitten) to manage slaves

See ReleaseEngineering:Buildduty:Kitten

Nagios

What's the difference between a downtime and an ack?

Both will make nagios stop alerting, but there's an important difference: acks are forever. Never ack an alert unless the path to victory for that alert is tracked elsewhere (in a bug, probably). For example, if you're annoyed by tinderbox alerts every 5 minutes, which you can't address, and you ack them to make them disappear, then unless you remember to unack them later, nobody will ever see that alert again. For such a purpose, use a downtime of 12h or a suitable interval until someone who *should* see the alert is available.

How do I interact with the nagios IRC bot?

nagios: status  (gives current server stats)
nagios: status $regexp  (gives status for a particular host)
nagios: status host:svc  (gives status for a particular service)
nagios: ignore  (shows ignores
nagios: ignore $regexp  (ignores alerts matching $regexp)
nagios: unignore $regexp  (unignores an existing ignore)
nagios: ack $num $comment  (adds an acknowledgement comment; $num comes from [brackets] in the alert)
   (note that the numbers only count up to 100, so ack things quickly or use the web interface)
nagios: unack $num    (reverse an acknowledgement)
nagios: downtime $service $time $comment (copy/paste the $service from the alert; time suffixes are m,h,d)
e.g.: nagios-sjc1: downtime buildbot-master06.build.scl1:buildbot 2h bug 712988

How do I scan all problems Nagios has detected?

All unacknowledged problems:
- http://admin1.infra.scl1.mozilla.com/nagios/cgi-bin/status.cgi?host=all&servicestatustypes=28&hoststatustypes=15&serviceprops=10
All unacknowledged problems with notifications enabled with HARD failure states (i.e. have hit the retry attempt ceiling):
- http://admin1.infra.scl1.mozilla.com/nagios/cgi-bin/status.cgi?host=all&servicestatustypes=28&hoststatustypes=15&serviceprops=270346
Group hosts check
- http://admin1.infra.scl1.mozilla.com/nagios/cgi-bin/status.cgi?hostgroup=all&style=summary
  - e.g. tegras: http://admin1.infra.scl1.mozilla.com/nagios/cgi-bin/status.cgi?hostgroup=releng-build-tegra&style=overview

How do I deal with Nagios problems?

Note that most of the nagios alerts are slave-oriented, and the slave duty person should take care of them. If you see something that needs to be rectified immediately (e.g., a slave burning builds), do so, and hand off to slave duty as soon as possible.

Nagios will alert every 2 hours for most problems. This can get annoying if you don't deal with the issues. However: do not ever disable notifications.

You can acknowledge a problem if it's tracked to be dealt with elsewhere, indicating that "elsewhere" in the comment. Nagios will stop alerting for ack'd services, but will continue monitoring them and clear the acknowledgement as soon as the service returns to "OK" status -- so we hear about it next time it goes down.

For example, this can point to a bug (often the reboots bug) or to the slave-tracking spreadsheet. If you're dealing with the problem right away, an ACK is not usually necessary, as Nagios will notice that the problem has been resolved. Do *not* ack a problem and then leave it hanging - when we were cleaning out nagios we found lots of acks from 3-6 months ago with no resolution to the underlying problem.

You can also mark a service or host for downtime. You will usually do this in advance of a planned downtime, e.g., a mass move of slaves. You specify a start time and duration for a downtime, and nagios will silence alerts during that time, but begin alerting again when the downtime is complete. Again, this avoids getting us in a state where we are ignoring alerts for months at a time.

At worst, if you're overwhelmed, you can ignore certain alerts (see above) and scan the full list of problems (again, see above), then unignore.

Downtimes

The downtimes section had grown quite large. If you have questions about how to schedule a downtime, who to notify, or how to coordinate downtimes with IT, please see the Downtimes page.

Talos

Note because a change to the Talos bundle always causes changes in the baseline times, the following should be done for *any* change...

close all trees that are impacted by the change
ensure all pending builds are done and GREEN
do the update step below
send a Talos changeset to all trees to generate new baselines

How to update the talos zips

NOTE: Deploying talos.zip is not scary anymore as we don't replace the file anymore and the a-team has to land a change in the tree.

You may need to get IT to turn on access to build.mozilla.org.

# on your localhost
wget http://people.mozilla.org/~jmaher/taloszips/zips/talos.07322bbe0f7d.zip
# wget from people doesn't work anymore
scp talos.07322bbe0f7d.zip armenzg@relengweb1.dmz.scl3.mozilla.com:/var/www/html/build/talos/zips
# ssh into the machine
ssh armenzg@build
chmod 644 /var/www/html/build/talos/zips/talos.07322bbe0f7d.zip

Note that you can get to root by running |sudo su -|

For talos.zip changes: Once deployed, notify the a-team and let them know that they can land at their own convenience.

Updating talos for Tegras

To update talos on Android,

# for foopy05-11
csshX --login=cltbld foopy{05,06,07,08,09,10,11,12,13,14,15,16,17,18,19,20,22,23,24}
cd /builds/talos-data/talos
hg pull -u

This will update talos on each foopy to the tip of default.

TBPL

How to deploy changes

RelEng no longer has access to do this. TBPL devs will request a push from Server Ops.

How to hide/unhide builders

In the 'Tree Info' menu select 'Open tree admin panel'
Filter/select the builders you want to change
Save changes
Enter the sheriff password and a description (with bug number if available) of your changes

Useful Links

Build Dashboard Main Page
- You can get JSON dumps for people to analyze by adding &format=json
- You cam see all build and test jobs for a certain branch for a certain revision by appending branch/revision to this link (e.g. revision/places/c4f8232c7aef)
http://cruncher.build.mozilla.org/~bhearsum/cgi-bin/missing-slaves.py -- a list of slaves which are known on production masters but are not connected to any production masters. Note that this includes preprod and staging slaves, as well as some slaves that just don't exist. Use with care.
http://build.mozilla.org/builds/last-job-per-slave.html (replace html with txt for text only version)
L10n Nightly Dashboard
public How To documents
private How To documents

Standard Bugs

Reboots bugs have the Bugzilla aliases shown above.
For IT bugs that are marked "infra only", yet still need to be readable by RelEng, it is not enough to add release@ alias - people get updates but not able to comment or read prior comments. Instead, cc the following:
- :aki, :armenzg, :bhearsum, :catlee, :coop, :hwine, :jhopkins, :joduinn, :joey, :kmoir, :nthomas, :rail

Ganglia

if you see that a host is reporting to ganglia in an incorrect manner it might just take this to fix it (e.g. bug 674233):

switch to root, service gmond restart

Queue Directories

Queue directories

If you see this in #build:

<nagios-sjc1> [54] buildbot-master12.build.scl1:Command Queue is CRITICAL: 4 dead items

It means that there are items in the "dead" queue for the given master. You need to look at the logs and fix any underlying issue and then retry the command by moving *only* the json file over to the "new" queue. See the Queue directories wiki page for details.

Cruncher

If you get an alert about cruncher running out of space it might be a sendmail issue (backed up emails taking up too much space and not getting sent out):

<nagios-sjc1> [07] cruncher.build.sjc1:disk - / is WARNING: DISK WARNING - free space: / 384 MB (5% inode=93%):

As root:

du -s -h /var/spool/*
# confirm that mqueue or clientmqueue is the oversized culprit
# stop sendmail, clean out the queues, restart sendmail
/etc/init.d/sendmail stop
rm -rf /var/spool/clientmqueue/*
rm -rf /var/spool/mqueue/*
/etc/init.d/sendmail start

CIDuty

Schedule

General Duties

How should I make myself available for duty?

What else should I take care of?

Scheduled Reconfigs

Tree Maintenance

Repo Errors

How do I see problems in TBPL?

How do I close the tree?

How do I claim a rentable project branch?

Re-run jobs

How to trigger Talos jobs

How to re-trigger all Talos runs for a build (by using sendchange)

How to re-run a build

Try Server

Deploy TryChooser changes

Jobs not scheduled at all?

How do I trigger additional talos/test runs for a given try build?

Using the TryChooser to submit build/test requests

How do I cancel existing jobs?

Bug Commenter

Nightlies

How do I re-spin mozilla-central nightlies?

Mobile

Android Tegras

Android Updates aren't working!

Update mobile talos webhosts

Deploy new tegra-host-utils.zip

Slave Maintenance

Known failure modes

File a bug

Slave Tracking

Slavealloc

Adding a slave

Removing slaves

Using briar-patch tools (kitten) to manage slaves

Nagios

What's the difference between a downtime and an ack?

How do I interact with the nagios IRC bot?

How do I scan all problems Nagios has detected?

How do I deal with Nagios problems?

Downtimes

Talos

How to update the talos zips

Updating talos for Tegras

TBPL

How to deploy changes

How to hide/unhide builders

Useful Links

Standard Bugs

Ganglia

Queue Directories

Cruncher

Navigation menu

Search