Sheriffing/Deciding To Close A Tree
Deciding to close a tree
Many objective and subjective criteria are part of the decision to close a tree. Tree closure means that developers are prevented from pushing or merging code to a codebase. Later, sheriffs will reopen the trees when the problem appears to be resolved.
Some of the criteria used include:
- Broken build on an integration or main tree (e.g. mozilla-inbound, mozilla-central, autoland)
- Excessive backlog for builds or tests in any platform (Grafana monitoring dashboard). Example:
The upper graph shows the count of active workers for a worker type, the lower one the number of jobs which are pending and waiting to run. In a normal situation, the number of active workers would increase to reduce the backlog. If that is not possible (in the example after 20:00), e.g. because the limit for the number of workers has been reached or there is an infrastructure issue, the trees monitored by sheriffs must be regularly checked if builds start in less than 15 minutes and tests in 30 minutes, else trees must be closed (category "infrastructure" if not using the full capacity, "backlog" if taskcluster uses machines up to the capacity limit). #ci on IRC should be notified about the issue and a bug should be created independent from the need to close the trees.
- Infrastructure or systems failures that affect a significant number of tests or builds (e.g. AWS, data center, networking issues)
- Mass "bustage" that could hide other test failures (this is when code lands and causes multiple tests to fail across multiple chunks of tests or suites of tests, making it harder to catch further failures if something else lands *during* the period in which these tests are failing from the original code landing)
- Infrastructure failure that affects our ability to see what's happening (e.g. Treeherder being down, not ingesting jobs, the data it consumes not being updated, or treestatus being broken so we're closed by default)
In short, if the state of the tree and the surrounding systems is such that things are going to get worse if the trees stay open, it is time to close the tree.
Closing a tree
- Open Treeherder
- Click Infra > TreeStatus > Login > Treestatus (https://mozilla-releng.net/treestatus)
- Check the tree(s) you want to be closed > Update tree(s)
- Status = closed; Tags = select the reason why you’re closing the tree. If “Other”, write the reason, e.g.: failures on bug 123456
- Make sure the “Remember change” option is checked
- click Update
How to re-open tree(s):
- From Treestatus’ main page, you should see the closed tree(s) on the Recent Changes section.
- Click the green Restore button in order to re-open the trees.
Actions to take
Once you've decided to close the tree, you need to take the following steps:
- Decide which trees are affected
- If the cause is not infrastructure- or load-related, you can probably leave the Try branch and only close the affect tree, e.g. mozilla-inbound
- If the cause *is* infrastructure- or load-related, you should close all trees, including Try.
- Use the TreeStatus tool to close the affected trees.
- If a bug doesn't already exist, create a bug for the tree closure.
- Communicate the tree closure to developers. Announce the closure in IRC in #developers and change the channel topic to point to the tree closure bug. This avoid several unnecessary frustrations:
- developers want to push and notice it's not possible.
- developers spending time investigating failures in pending/running testruns (especially Try) that are not caused by her/his changes
- minimizes repeated inquiries to the sheriffs about why a tree is closed, what's the ETA, etc.
- engage the people needed to fix the issue. This could be:
- the developer(s) who landed the suspected code (if this is known)
- domain experts for the module where the builds/tests are failing. The Module owner list can help track people down.
- the ciduty, releng, and/or the taskcluster teams if it's an infrastructure issue
- If the tree closure is expected to be a longer problem, post a short mail to the mozilla.dev.platform newsgroup, e.g. https://groups.google.com/forum/#!topic/mozilla.dev.platform/Kzd1es4KiYA
In case it wasn't clear from the previous section, COMMUNICATION is the most important thing during any outage. When engaging others to help, letting them know that the trees are closed usually encourages prompt cooperation. Developers or service teams may need your help to test fixes efficiently on Try, or to back out particular changesets once they are implicated. Make yourself available to these people as required.
For everyone else not involved in trying to fix the issue, they're simply waiting for the trees to reopen and are effectively blocked. Developers and service teams may not have the time or experience to reliably update the tree closure bugs (or IRC) with status. Sheriffs should take the lead on this and keep other developers updated, both in the bug and in IRC.
In the event of longer tree closures, you may hit the end of your workday. If this happens, or better still, *before* it happens, find someone who can continue acting overseeing the tree closure. If you've followed all the above steps, there should be an adequate papertrail for someone to follow and continue.
Note: in the event of more systemic failures, e.g. major infrastructure failures or AWS outages, it is best to escalate the issue to the MOC (Mozilla Operations Center, #moc on IRC). They have 24/7 support and much experience dealing with outages.