Confirmed users
599
edits
m (update treestatus url) |
(update worker grafana dashboard url) |
||
Line 5: | Line 5: | ||
Some of the criteria used include: | Some of the criteria used include: | ||
* Broken build on an integration or main tree (e.g. mozilla-inbound, mozilla-central, autoland) | * Broken build on an integration or main tree (e.g. mozilla-inbound, mozilla-central, autoland) | ||
* Excessive backlog for builds or tests in any platform ([https:// | * Excessive backlog for builds or tests in any platform ([https://yardstick.mozilla.org/d/ieg6Sho5/workers?orgId=1&from=now-24h&to=now&timezone=browser&var-provisioner=$__all&var-workerType=$__all&var-Adhoc=&var-Filters=&refresh=5m Grafana monitoring dashboard]). Example: | ||
[[File:Sheriffing workers vs queue.png|center]] | [[File:Sheriffing workers vs queue.png|center]] | ||
The upper graph shows the count of active workers for a worker type, the lower one the number of jobs which are pending and waiting to run. In a normal situation, the number of active workers would increase to reduce the backlog. If that is not possible (in the example after 20:00), e.g. because the limit for the number of workers has been reached or there is an infrastructure issue, the trees monitored by sheriffs must be regularly checked if builds start in less than 15 minutes and tests in 30 minutes, else trees must be closed (category "infrastructure" if not using the full capacity, "backlog" if taskcluster uses machines up to the capacity limit). #ci on IRC should be notified about the issue and a bug should be created independent from the need to close the trees. | The upper graph shows the count of active workers for a worker type, the lower one the number of jobs which are pending and waiting to run. In a normal situation, the number of active workers would increase to reduce the backlog. If that is not possible (in the example after 20:00), e.g. because the limit for the number of workers has been reached or there is an infrastructure issue, the trees monitored by sheriffs must be regularly checked if builds start in less than 15 minutes and tests in 30 minutes, else trees must be closed (category "infrastructure" if not using the full capacity, "backlog" if taskcluster uses machines up to the capacity limit). #ci on IRC should be notified about the issue and a bug should be created independent from the need to close the trees. |