Sheriffing/Deciding To Close A Tree: Difference between revisions

update worker grafana dashboard url
No edit summary
(update worker grafana dashboard url)
 
(6 intermediate revisions by 4 users not shown)
Line 1: Line 1:
{{Sheriffing How To|Deciding to close a tree}}
== Deciding to close a tree ==
== Deciding to close a tree ==
Many objective and subjective criteria are part of the decision to close a tree. Tree closure means that developers are prevented from pushing or merging code to a codebase. Later, sheriffs will reopen the trees when the problem appears to be resolved.
Many objective and subjective criteria are part of the decision to close a tree. Tree closure means that developers are prevented from pushing or merging code to a codebase. Later, sheriffs will reopen the trees when the problem appears to be resolved.
Line 4: Line 5:
Some of the criteria used include:  
Some of the criteria used include:  
* Broken build on an integration or main tree (e.g. mozilla-inbound, mozilla-central, autoland)
* Broken build on an integration or main tree (e.g. mozilla-inbound, mozilla-central, autoland)
* Excessive backlog for builds or tests in any platform
* Excessive backlog for builds or tests in any platform ([https://yardstick.mozilla.org/d/ieg6Sho5/workers?orgId=1&from=now-24h&to=now&timezone=browser&var-provisioner=$__all&var-workerType=$__all&var-Adhoc=&var-Filters=&refresh=5m Grafana monitoring dashboard]). Example:
[[File:Sheriffing workers vs queue.png|center]]
The upper graph shows the count of active workers for a worker type, the lower one the number of jobs which are pending and waiting to run. In a normal situation, the number of active workers would increase to reduce the backlog. If that is not possible (in the example after 20:00), e.g. because the limit for the number of workers has been reached or there is an infrastructure issue, the trees monitored by sheriffs must be regularly checked if builds start in less than 15 minutes and tests in 30 minutes, else trees must be closed (category "infrastructure" if not using the full capacity, "backlog" if taskcluster uses machines up to the capacity limit). #ci on IRC should be notified about the issue and a bug should be created independent from the need to close the trees.
* Infrastructure or systems failures that affect a significant number of tests or builds (e.g. AWS, data center, networking issues)
* Infrastructure or systems failures that affect a significant number of tests or builds (e.g. AWS, data center, networking issues)
* Mass "bustage" that could hide other test failures (this is when code lands and causes multiple tests to fail across multiple chunks of tests or suites of tests, making it harder to catch further failures if something else lands *during* the period in which these tests are failing from the original code landing)
* Mass "bustage" that could hide other test failures (this is when code lands and causes multiple tests to fail across multiple chunks of tests or suites of tests, making it harder to catch further failures if something else lands *during* the period in which these tests are failing from the original code landing)
Line 10: Line 13:


In short, if the state of the tree and the surrounding systems is such that things are going to get worse if the trees stay open, it is time to close the tree.
In short, if the state of the tree and the surrounding systems is such that things are going to get worse if the trees stay open, it is time to close the tree.
== Closing a tree ==
# Open Treeherder
# Click '''Infra > TreeStatus > Login > Treestatus''' (https://lando.services.mozilla.com/treestatus/)
# Check the tree(s) you want to be closed > '''Update tree(s)'''
# Status = closed; Tags = select the reason why you’re closing the tree. If “Other”, write the reason, e.g.: failures on bug 123456
# Make sure the “Remember change” option is checked
# click '''Update'''
== How to re-open tree(s): ==
# From Treestatus’ main page, you should see the closed tree(s) on the '''Recent Changes''' section.
# Click the green  <span style="color:#FF0000">'''Restore'''</span> button in order to re-open the trees.


== Actions to take ==
== Actions to take ==
Line 17: Line 32:
** If the cause is not infrastructure- or load-related, you can probably leave the Try branch and only close the affect tree, e.g. mozilla-inbound
** If the cause is not infrastructure- or load-related, you can probably leave the Try branch and only close the affect tree, e.g. mozilla-inbound
** If the cause *is* infrastructure- or load-related, you should close all trees, including Try.
** If the cause *is* infrastructure- or load-related, you should close all trees, including Try.
* Use the [https://mozilla-releng.net/treestatus TreeStatus] tool to close the affected trees.
* Use the [https://lando.services.mozilla.com/treestatus/ TreeStatus] tool to close the affected trees.
* If a bug doesn't already exist, create a bug for the tree closure.
* If a bug doesn't already exist, create a bug for the tree closure.
* Communicate the tree closure to developers. Announce the closure in IRC in #developers and change the channel topic to point to the tree closure bug. This avoid several unnecessary frustrations:
* Communicate the tree closure to developers. Announce the closure in IRC in #developers and change the channel topic to point to the tree closure bug. This avoid several unnecessary frustrations:
Line 26: Line 41:
** the developer(s) who landed the suspected code (if this is known)
** the developer(s) who landed the suspected code (if this is known)
** domain experts for the module where the builds/tests are failing. The [[Modules/All|Module owner list]] can help track people down.
** domain experts for the module where the builds/tests are failing. The [[Modules/All|Module owner list]] can help track people down.
** the buildduty, releng, and/or the taskcluster teams if it's an infrastructure issue
** the ciduty, releng, and/or the taskcluster teams if it's an infrastructure issue
* If the tree closure is expected to be a longer problem, post a short mail to the mozilla.dev.platform newsgroup, e.g. https://groups.google.com/forum/#!topic/mozilla.dev.platform/Kzd1es4KiYA
* If the tree closure is expected to be a longer problem, post a short mail to the mozilla.dev.platform newsgroup, e.g. https://groups.google.com/forum/#!topic/mozilla.dev.platform/Kzd1es4KiYA


Line 38: Line 53:


Note: in the event of more systemic failures, e.g. major infrastructure failures or AWS outages, it is best to [[Sheriffing/How:To:Escalate|escalate the issue to the MOC]] (Mozilla Operations Center, #moc on IRC). They have 24/7 support and much experience dealing with outages.
Note: in the event of more systemic failures, e.g. major infrastructure failures or AWS outages, it is best to [[Sheriffing/How:To:Escalate|escalate the issue to the MOC]] (Mozilla Operations Center, #moc on IRC). They have 24/7 support and much experience dealing with outages.
== Tree closing policies by other teams ==
* [https://moz-releng-docs.readthedocs.io/en/latest/procedures/TCW_Process.html Tree Closing Window (TCW) planning by Release Engineering (RelEng)]
* [https://mozilla-version-control-tools.readthedocs.io/en/latest/hgmo/ops.html#scheduled-outage-policy hg.mozilla.org maintenance outage policy]
Confirmed users
600

edits