CIDuty/Infrastructure Performance

From MozillaWiki
Jump to: navigation, search

Infrastructure performance

Pending Jobs

We will sometimes be starved for capacity on one or more platforms. Because there are multiple potential causes, and hence multiple possible paths to resolution, the steps for dealing with high pending counts are on their own page.

Wait times

This can be related to pending builds above.

Releng has made a commitment to developers that 95% or more of their jobs will start within 15 minutes of submission.

Build and Try (Build) slave pools have greater capacity (and can expand into AWS as required for linux/mobile/b2g) and are usually over 95% unless there is an outage.

Many Test jobs are triggered per build/try job, and the current slave pool is finite, so it is rare for us to meet our turnaround commitment for test jobs.

Fixing errant test slaves is hence more important fixing build slaves. See Slave Management below.

Wait times are available either from the buildAPI wait times report or the daily emails that go to dev.tree-management (un-filter them in Zimbra). Respond to any unusually long wait times in email, preferably with a reason.

Wait times emails are run via crontab entries setup on relengwebadm.private.scl3.mozilla.com under the buildapi user.

Slave management

Bad slaves can burn builds and hung slaves can cause bad wait times. These slaves need to be rebooted, or handed to IT for recovery. Recovered slaves need to be tracked on their way back to operational status.

The Nagios wiki has more information about finding problem slaves using nagios.

See the Slave Management wiki for more information about fixing those slaves.

dev.tree-management

Monitor dev.tree-management newsgroup (by email or by nntp).

Watch for long running builds that are holding on to slave, i.e. >1 day.

See the buildAPI list of running builds.