ReleaseEngineering/Buildduty

From MozillaWiki
Jump to: navigation, search

What is buildduty?

Each week there is one person from the Release Engineering team dedicated to helping out developers with releng-related issues. This person will be available during his or her regular work hours for the whole week. This is similar to the sheriff role that rotates through the developer community. To avoid confusion, the releng sheriff position is known as "buildduty."

Who is on buildduty? (schedule)

The person on buildduty should have 'buildduty' appended to their IRC nick, and should be available in the #developers, #releng, and #buildduty IRC channels.

Mozilla Releng Buildduty Schedule (Google Calendar|iCal|XML)

Buildduty not around?

It happens, especially outside of standard North American working hours (0600-1800 PT). Please open a bug under these circumstances.

Buildduty priorities

How should I make myself available for duty?

What should I take care of?

Outages

Things fail. It's sad. Getting systems and services stood back up again is buildduty's top priority. Note: this doesn't mean you need to do all the work yourself. For big outages, rope in whatever help you need: domain experts from releng, managers, netops, relops...whoever.

Daily

  • Triage
    • Move bugs to the right component
    • Grab them if you're going to work on them
    • Find other owners for the bugs if the bug is urgent and you're swamped
Explanation + hyperlink Actions
Loan requests triage During your buildduty week check this daily and keep it closer to zero. Instructions
Buildduty report (generated hourly) Make sure that bugs with no dependencies get dependencies filed, e.g. diagnosis bug, decomm bug, etc. Do the same for bugs in the "All dependencies resolved" section to make sure the next action is taken (re-image, decomm, etc). Use the "View list in bugzilla" links to navigate the bugs more easily.
Not acknowledged nagios alerts Deal with them. File bugs if needed.

You might find it easiest to do this by creating a text file containing the list of the slaves you want to reboot (let's call it naughty_slaves.list), set MY_LDAP_USER and MY_LDAP_PASSWORD environment variables to your LDAP credentials, making sure you are on the VPN; and then run:

cat naughty_slaves.list | \
  while read slave; do \
    curl -u "${MY_LDAP_USER}:${MY_LDAP_PASSWORD}" \
    -dfoo=bar "https://secure.pub.build.mozilla.org/slaveapi/slaves/${slave}/actions/reboot"; \
  done

This should reboot the lot in one fell swoop.

Please note, if you need to call other Slave API actions, such as "shutdown" instead of reboot, see the API docs here: http://mozilla-slaveapi.readthedocs.org/en/latest/api/#endpoints

Semi-Daily

  • Reconfigs
  • Review "long running" and "lazy" AWS instances
    • When: 2-3 times a week (eg: Mondays or after weekends/holidays, Wednesdays, and Fridays)
    • How: use aws sanity check email (sent daily):
      • Email filter => to: release+aws-sanity-check@mozilla.com, subject: [cron] aws sanity check
      • for each host under heading "Long running instances", follow steps in dealing with long running instances
      • for each host under heading "Lazy long running instances", figure out why they're still up and not taking jobs
        • twistd.log, uptime, reboot history on the slave health page et al
  • Buildduty triage

Weekly

  • Review 'Unknown State/Type or have stopped for a while
    • When:
      • Once a week. Preferably, this would be evenly spaced out between tackling this so let's say Fridays if possible.
    • How:
      • use aws sanity check email (sent daily):
        • Email filter => to: release+aws-sanity-check@mozilla.com, subject: [cron] aws sanity check
      • for each host under heading "Unknown State", "Unknown Type"
      • for each host under heading "Stopped For A While"

Infrastructure performance

  • Pending builds
    • A high number of pending build can indicate a problem with the scheduler, (a set of) buildbot-masters, or a particular pool of slaves (and hence possibly puppet)
    • The number of pending builds is available in graphs. The graphs are helpful for noticing anomalous behavior.
  • Wait times
    • This can be related to pending builds above.
    • Releng has made a commitment to developers that 95% or more of their jobs will start within 15 minutes of submission.
      • Build and Try (Build) slave pools have greater capacity (and can expand into AWS as required for linux/mobile/b2g) and are usually over 95% unless there is an outage.
      • Many Test jobs are triggered per build/try job, and the current slave pool is finite, so it is rare for us to meet our turnaround commitment for test jobs.
        • Fixing errant test slaves is hence more important fixing build slaves. See Slave Management below.
    • Wait times are available either from the buildAPI wait times report or the daily emails that go to dev.tree-management (un-filter them in Zimbra). Respond to any unusually long wait times in email, preferably with a reason.
      • wait times emails are run via crontab entries setup on relengwebadm.private.scl3.mozilla.com under the buildapi user
  • Slave management
    • Bad slaves can burn builds and hung slaves can cause bad wait times. These slaves need to be rebooted, or handed to IT for recovery. Recovered slaves need to be tracked on their way back to operational status.
    • The Nagios wiki has more information about finding problem slaves using nagios.
  • dev.tree-management
    • Monitor dev.tree-management newsgroup (by email or by nntp).
  • Watch for long running builds that are holding on to slave, i.e. >1 day.

Others

You should keep on top of:

Useful Links

Standard Bugs

  • For IT bugs that are marked "infra only", yet still need to be readable by RelEng, it is not enough to add release@ alias - people get updates but not able to comment or read prior comments. Instead, cc the following:
    •  :aki, :armenzg, :bhearsum, :catlee, :coop, :hwine, :jhopkins, :kmoir, :nthomas, :rail
    •  :edmorley, :Tomcat, :RyanVM, :KWierso

Meeting Notes