Personal tools


From MozillaWiki

Jump to: navigation, search


What is buildduty?

Each week there is one person from the Release Engineering team dedicated to helping out developers with releng-related issues. This person will be available during his or her regular work hours for the whole week. This is similar to the sheriff role that rotates through the developer community. To avoid confusion, the releng sheriff position is known as "buildduty."

Who is on buildduty? (schedule)

Check the tree-info dropdown on tbpl. The person on buildduty should also have 'buildduty' appended to their IRC nick, and should be available in the #developers, #releng, and #buildduty IRC channels.

Mozilla Releng Sheriff Schedule (Google Calendar|iCal|XML)

Buildduty not around?

It happens, especially outside of standard North American working hours (0600-1800 PST). Please open a bug under these circumstances.

Buildduty priorities

How should I make myself available for duty?

What should I take care of?


Things fail. It's sad. Getting systems and services stood back up again is buildduty's top priority. Note: this doesn't mean you need to do all the work yourself. For big outages, rope in whatever help you need: domain experts from releng, managers, netops, relops...whoever.


  • Triage
    • Move bugs to the right component
    • Grab them if you're going to work on them
    • Find owners to the bugs if you're swamped
    • If the bug is not that important to fix, mark it with P1/P2 to indicate that they have been triaged and that we will deal with them in the future
      • Mark them as "enhancement" if they're bugs that will help buildduty initiatives
Explanation + hyperlink Actions
Loan requests triage During your buildduty week check this daily and keep it closer to zero. Instructions
Buildduty report (generated once a day) Make sure that bugs with no dependencies get dependencies filed, e.g. diagnosis bug, decomm bug, etc. Do the same for bugs in the "All dependencies resolved" section to make sure the next action is taken (re-image, decomm, etc). Use the "View list in bugzilla" links to navigate the bugs more easily.
Not acknowledged nagios alerts Deal with them. File bugs if needed.

You might find it easiest to do this by creating a text file containing the list of the slaves you want to reboot (let's call it naughty_slaves.list), set MY_LDAP_USER and MY_LDAP_PASSWORD environment variables to your LDAP credentials, making sure you are on the VPN; and then run:

 cat naughty_slaves.list | while read slave; do curl -u "${MY_LDAP_USER}:${MY_LDAP_PASSWORD}" -dfoo=bar "${slave}/actions/reboot"; done

This should reboot the lot in one fell swoop.


  • Reconfigs
  • Review "long running" and "lazy" AWS instances
    • When: 2-3 times a week (eg: Mondays or after weekends/holidays, Wednesdays, and Fridays)
    • How: use aws sanity check email (sent daily):
      • Email filter => to:, subject: [cron] aws sanity check
      • for each host under heading "Long running instances", follow steps in dealing with long running instances
      • for each host under heading "Lazy long running instances", figure out why they're still up and not taking jobs
        • twistd.log, uptime, reboot history on the slave health page et al
  • Buildduty triage


  • Review 'Unknown State/Type or have stopped for a while
    • When:
      • Once a week. Preferably, this would be evenly spaced out between tackling this so let's say Fridays if possible.
    • How:
      • use aws sanity check email (sent daily):
        • Email filter => to:, subject: [cron] aws sanity check
      • for each host under heading "Unknown State", "Unknown Type"
      • for each host under heading "Stopped For A While"

Infrastructure performance

  • Pending builds
    • A high number of pending build can indicate a problem with the scheduler, (a set of) buildbot-masters, or a particular pool of slaves (and hence possibly puppet)
    • The number of pending builds is available in graphs or in the "Infrastructure" pulldown on TBPL. The graphs are helpful for noticing anomalous behavior.
  • Wait times
    • This can be related to pending builds above.
    • Releng has made a commitment to developers that 95% or more of their jobs will start within 15 minutes of submission.
      • Build and Try (Build) slave pools have greater capacity (and can expand into AWS as required for linux/mobile/b2g) and are usually over 95% unless there is an outage.
      • Many Test jobs are triggered per build/try job, and the current slave pool is finite, so it is rare for us to meet our turnaround commitment for test jobs.
        • Fixing errant test slaves is hence more important fixing build slaves. See Slave Management below.
    • Wait times are available either from the buildAPI wait times report or the daily emails that go to dev.tree-management (un-filter them in Zimbra). Respond to any unusually long wait times in email, preferably with a reason.
      • wait times emails are run via crontab entries setup on under the buildapi user
  • Slave management
    • Bad slaves can burn builds and hung slaves can cause bad wait times. These slaves need to be rebooted, or handed to IT for recovery. Recovered slaves need to be tracked on their way back to operational status.
    • The Nagios wiki has more information about finding problem slaves using nagios.
  • dev.tree-management
    • Monitor dev.tree-management newsgroup (by email or by nntp).
  • Watch for long running builds that are holding on to slave, i.e. >1 day.


You should keep on top of:

Useful Links

Standard Bugs

  • For IT bugs that are marked "infra only", yet still need to be readable by RelEng, it is not enough to add release@ alias - people get updates but not able to comment or read prior comments. Instead, cc the following:
    •  :aki, :armenzg, :bhearsum, :catlee, :coop, :hwine, :jhopkins, :kmoir, :nthomas, :rail
    •  :edmorley, :Tomcat, :RyanVM, :KWierso

Meeting Notes