How to add/modify releng nagios alerts

The releng nagios alerts live in the sysadmins svn repo.

svn co svn+ssh://svn.mozilla.org/sysadmins/puppet/trunk/modules/nagios/manifests/releng

When adding new alerts, it's preferable to create hostgroups to define a class of machines that will share alerting characteristics, rather than adding alerts for single machines.

Processing existing alerts

Backlog Age

Affects: end-to-end time for developers. When we hit our warning threshold (currently 6hr), there have been builds waiting to *start* for that long.
Runs on: nagios server, checking https://secure.pub.build.mozilla.org/builddata/buildjson/builds-pending.js
Possible solutions:
- kill off unnecessary jobs
- make sure build-pending.js isn't stale
- restart buildbot masters if they are slow
- for full options: Dealing with high pending counts

builds-4hr

Affects: treeherder. This data is used to provide job history in treeherder. The specific file can be found here: http://builddata.pub.build.mozilla.org/buildjson/builds-4hr.js.gz
Runs on: relengwebadm host, as a cronjob under the buildapi user
Possible solutions: usually this script fails or runs slowly when there are problems with the buildbot status database, either a lock, another long-running query, or simply load. Killing off the offending query and re-running the report-4hr script will fix this but be aware that the report-4hr script can take a while to run, especially on a cold cache.

Command Queue

Affects: buildbot masters. These are jobs that become wedged (possibly failed) in the queue and need to be resubmitted or deleted.
See ReleaseEngineering/Queue_directories for debugging instructions.

ReleaseEngineering/How To/Process nagios alerts

Contents

How to add/modify releng nagios alerts

Processing existing alerts

Backlog Age

builds-4hr

Command Queue

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

How to Contribute

MozillaWiki

Around Mozilla

Tools