ReleaseEngineering/How To/Process nagios alerts: Difference between revisions
< ReleaseEngineering | How To
Jump to navigation
Jump to search
ChrisCooper (talk | contribs) No edit summary |
ChrisCooper (talk | contribs) |
||
| Line 19: | Line 19: | ||
== builds-4hr == | == builds-4hr == | ||
* Affects: treeherder. This data is used to provide job history in treeherder. | * Affects: treeherder. This data is used to provide job history in treeherder. The specific file can be found here: http://builddata.pub.build.mozilla.org/buildjson/builds-4hr.js.gz | ||
* Runs on: relengwebadm host, as a cronjob under the buildapi user | * Runs on: relengwebadm host, as a cronjob under the buildapi user | ||
* Possible solutions: usually this script fails or runs slowly when there are problems with the buildbot status database, either a lock, another long-running query, or simply load. Killing off the offending query and re-running the report-4hr script will fix this but be aware that the report-4hr script can take a while to run, especially on a cold cache. | * Possible solutions: usually this script fails or runs slowly when there are problems with the buildbot status database, either a lock, another long-running query, or simply load. Killing off the offending query and re-running the report-4hr script will fix this but be aware that the report-4hr script can take a while to run, especially on a cold cache. | ||
Revision as of 02:50, 26 April 2016
This page covers various nagios alerts, and gives pointers to how to resolve them. It also marks alerts that should now be automatically handled by the new automation infrastructure, so bugs can be filed as appropriate.
How to add/modify releng nagios alerts
The releng nagios alerts live in the sysadmins svn repo.
svn co svn+ssh://svn.mozilla.org/sysadmins/puppet/trunk/modules/nagios/manifests/releng
When adding new alerts, it's preferable to create hostgroups to define a class of machines that will share alerting characteristics, rather than adding alerts for single machines.
Processing existing alerts
Command Queue
- Affects: buildbot masters. These are jobs that become wedged (possibly failed) in the queue and need to be resubmitted or deleted.
- See ReleaseEngineering/Queue_directories for debugging instructions.
builds-4hr
- Affects: treeherder. This data is used to provide job history in treeherder. The specific file can be found here: http://builddata.pub.build.mozilla.org/buildjson/builds-4hr.js.gz
- Runs on: relengwebadm host, as a cronjob under the buildapi user
- Possible solutions: usually this script fails or runs slowly when there are problems with the buildbot status database, either a lock, another long-running query, or simply load. Killing off the offending query and re-running the report-4hr script will fix this but be aware that the report-4hr script can take a while to run, especially on a cold cache.