ReleaseEngineering/How To/Process nagios alerts

From MozillaWiki
Jump to: navigation, search


This page covers various nagios alerts, and gives pointers to how to resolve them. It also marks alerts that should now be automatically handled by the new automation infrastructure, so bugs can be filed as appropriate.

How to add/modify releng nagios alerts

The releng nagios alerts live in the sysadmins svn repo.

svn co svn+ssh://svn.mozilla.org/sysadmins/puppet/trunk/modules/nagios/manifests/releng

When adding new alerts, it's preferable to create hostgroups to define a class of machines that will share alerting characteristics, rather than adding alerts for single machines.

Processing existing alerts

Backlog Age

builds-4hr

  • Affects: treeherder. This data is used to provide job history in treeherder. The specific file can be found here: http://builddata.pub.build.mozilla.org/buildjson/builds-4hr.js.gz
  • Runs on: relengwebadm host, as a cronjob under the buildapi user
  • Possible solutions: usually this script fails or runs slowly when there are problems with the buildbot status database, either a lock, another long-running query, or simply load. Killing off the offending query and re-running the report-4hr script will fix this but be aware that the report-4hr script can take a while to run, especially on a cold cache.

Command Queue

  • Affects: buildbot masters. These are jobs that become wedged (possibly failed) in the queue and need to be resubmitted or deleted.
  • See ReleaseEngineering/Queue_directories for debugging instructions.