CIDuty: Difference between revisions

Jump to navigation Jump to search
87 bytes removed ,  1 April 2015
Line 60: Line 60:


== Infrastructure performance ==
== Infrastructure performance ==
* '''Pending builds'''
=== Pending Jobs ===
** A high number of pending build can indicate a problem with the scheduler, (a set of) buildbot-masters, or a particular pool of slaves (and hence possibly puppet)
We will sometimes be starved for capacity on one or more platforms. Because there are multiple potential causes, and hence multiple possible paths to resolution, the steps for [[ReleaseEngineering/How_To/Dealing_with_high_pending_counts|dealing with high pending counts]] are on their own page.
** The number of pending builds is available in [http://builddata.pub.build.mozilla.org/reports/pending/pending.html graphs].  The graphs are helpful for noticing anomalous behavior.
 
* '''Wait times'''
=== Wait times ===
** This can be related to pending builds above.
This can be related to pending builds above.
** Releng has made a commitment to developers that 95% or more of their jobs will start within 15 minutes of submission.
 
*** Build and Try (Build) slave pools have greater capacity (and can expand into AWS as required for linux/mobile/b2g) and are usually over 95% unless there is an outage.
Releng has made a commitment to developers that 95% or more of their jobs will start within 15 minutes of submission.
*** Many Test jobs are triggered per build/try job, and the current slave pool is finite, so it is rare for us to meet our turnaround commitment for test jobs.
 
**** Fixing errant test slaves is hence more important fixing build slaves. See '''Slave Management''' below.  
Build and Try (Build) slave pools have greater capacity (and can expand into AWS as required for linux/mobile/b2g) and are usually over 95% unless there is an outage.
** Wait times are available either from [https://secure.pub.build.mozilla.org/buildapi/reports/waittimes the buildAPI wait times report] or the daily emails that go to dev.tree-management (un-filter them in Zimbra).  Respond to any unusually long wait times in email, preferably with a reason.
 
*** wait times emails are run via crontab entries setup on relengwebadm.private.scl3.mozilla.com under the buildapi user
Many Test jobs are triggered per build/try job, and the current slave pool is finite, so it is rare for us to meet our turnaround commitment for test jobs.
* '''Slave management'''
 
** Bad slaves can burn builds and hung slaves can cause bad wait times. These slaves need to be rebooted, or handed to IT for recovery. Recovered slaves need to be tracked on their way back to operational status.
Fixing errant test slaves is hence more important fixing build slaves. See '''Slave Management''' below.  
** The [[ReleaseEngineering/Buildduty/Nagios|Nagios wiki]] has more information about finding problem slaves using nagios.
 
*** See the [[ReleaseEngineering/Buildduty/Slave_Management|Slave Management wiki]] for more information about fixing those slaves.
Wait times are available either from [https://secure.pub.build.mozilla.org/buildapi/reports/waittimes the buildAPI wait times report] or the daily emails that go to dev.tree-management (un-filter them in Zimbra).  Respond to any unusually long wait times in email, preferably with a reason.
* '''dev.tree-management'''
 
** Monitor dev.tree-management newsgroup (by [https://lists.mozilla.org/listinfo/dev-tree-management email] or by [nntp://mozilla.dev.tree-management nntp]).
Wait times emails are run via crontab entries setup on relengwebadm.private.scl3.mozilla.com under the buildapi user.
* Watch for long running builds that are holding on to slave, i.e. >1 day.
 
** See the [https://secure.pub.build.mozilla.org/buildapi/running buildAPI list of running builds].
=== Slave management ===
Bad slaves can burn builds and hung slaves can cause bad wait times. These slaves need to be rebooted, or handed to IT for recovery. Recovered slaves need to be tracked on their way back to operational status.
 
The [[ReleaseEngineering/Buildduty/Nagios|Nagios wiki]] has more information about finding problem slaves using nagios.
 
See the [[ReleaseEngineering/Buildduty/Slave_Management|Slave Management wiki]] for more information about fixing those slaves.
=== dev.tree-management ===
Monitor dev.tree-management newsgroup (by [https://lists.mozilla.org/listinfo/dev-tree-management email] or by [nntp://mozilla.dev.tree-management nntp]).
 
Watch for long running builds that are holding on to slave, i.e. >1 day.
 
See the [https://secure.pub.build.mozilla.org/buildapi/running buildAPI list of running builds].


== Others ==
== Others ==
canmove, Confirmed users
2,850

edits

Navigation menu