ReleaseEngineering/Buildduty

From MozillaWiki
Jump to: navigation, search

What is buildduty?

Every month, there is one person from the Release Engineering (releng) team dedicated to helping out developers with releng-related issues. This person will be available during his or her regular work hours for the whole month. This is similar to the sheriff role that rotates through the sheriffing team . To avoid confusion, the releng sheriff position is known as "buildduty."

Who is on buildduty? (schedule)

The person on buildduty should have 'buildduty' appended to their IRC nick, and should be available in the #developers, #releng, and #buildduty IRC channels.

Mozilla Releng Buildduty Schedule (Google Calendar|iCal|XML)

Buildduty not around?

It happens, especially outside of standard North American working hours (0600-1800 PT). Please open a bug under these circumstances.

Buildduty priorities

How should I make myself available for duty?

  • Add 'buildduty' to your IRC nick
  • Be available in the following IRC channels (at least): #developers, #releng, and #buildduty (as well as #mozbuild of course)

What should I take care of?

Outages

Things fail. It's sad. Getting systems and services stood back up again is buildduty's top priority. Note: this doesn't mean you need to do all the work yourself. For big outages, rope in whatever help you need: domain experts from releng, managers, netops, relops...whoever.

The Dealing with Outages wiki has more instructions.

Daily

Buildduty Triage

The Buildduty report (generated hourly) should be your starting point for triage.

Note: Use the "View list in bugzilla" links in the buildduty report to navigate the bugs more easily.

At the top, it lists unassigned bugs for loan requests. You should try to keep this queue empty to make sure developers are unblocked. The wiki has instructions for how to loan a slave.

After loans are taken care of, make sure that bugs in the "No dependencies" section get dependencies filed, e.g. diagnosis bug, decomm bug, etc.

Do the same for bugs in the "All dependencies resolved" section to make sure the next action is taken (re-image, decomm, return to production, etc).

Note: systemic issues (e.g. test failures that require further investigation) should *not* stay in the buildduty bugzilla component. It may be OK for you to take the bug and work on it depending on how much time you have, but generally these types of bugs should be moved to a more-appropriate component (e.g. General Automation) once buildduty has triaged them.

Aside from the buildduty report, there may also be unacknowledged nagios alerts in the #buildduty IRC channel. Deal with them, filing bugs as needed.

Infrastructure performance

In addition to the individual slave bugs tackled in triage above, there may be systemic issues that need investigating. The Infrastructure performance wiki has more details about how to do this, and links to the wiki page for how to deal with high pending counts.

Semi-Daily

  • Reconfigs
    • Run reconfigs (every day or two days) for other relengers
  • Review "long running" and "lazy" AWS instances
    • When: 2-3 times a week (eg: Mondays or after weekends/holidays, Wednesdays, and Fridays)
    • How: use aws sanity check email (sent daily):
      • Email filter => to: release+aws-sanity-check@mozilla.com, subject: [cron] aws sanity check
      • for each host under heading "Long running instances", follow steps in dealing with long running instances
      • for each host under heading "Lazy long running instances", figure out why they're still up and not taking jobs
        • twistd.log, uptime, reboot history on the slave health page et al

Weekly

  • Review AWS instances that have 'Unknown State/Type or have stopped for a while'
    • When:
      • Once a week. Preferably, this would be evenly spaced out between tackling this so let's say Fridays if possible.
    • How:
      • use aws sanity check email (sent daily):
        • Email filter => to: release+aws-sanity-check@mozilla.com, subject: [cron] aws sanity check
      • for each host under heading "Unknown State", "Unknown Type"
      • for each host under heading "Stopped For A While"

Others

You should keep on top of:

Useful Links

Standard Bugs

  • For IT bugs that are marked "infra only", yet still need to be readable by RelEng, it is not enough to add release@ alias - people get updates but not able to comment or read prior comments. Instead, cc the following:
    •  :bhearsum, :Callek, :catlee, :coop, :hwine, :jlund, :kmoir, :mrrrgn, :nthomas, :rail
    •  :Tomcat, :RyanVM, :KWierso

Meeting Notes