CIDuty/Nagios

From MozillaWiki
Jump to: navigation, search

Nagios

Which nagios alerts are important?

Nagios alerts pop up for critical services that may be legitimately down, e.g. due to slave reboots. Nagios alerts in a HARD failure state (i.e. has hit the retry attempt ceiling) should be acted on (see the nagios links below). If all slaves of a given class/OS are alerting the same way in rapid succession, this should also be acted on

Other one-off alerts can safely be ignored. Usually the machine is simply rebooting. It will hit the HARD failure state eventually if it's a real failure. Use your triage skills here to find the burning fires.

What's the difference between a downtime and an ack?

Both will make nagios stop alerting, but there's an important difference: acks are forever. Never ack an alert unless the path to victory for that alert is tracked elsewhere (in a bug, probably). For example, if you're annoyed by tinderbox alerts every 5 minutes, which you can't address, and you ack them to make them disappear, then unless you remember to unack them later, nobody will ever see that alert again. For such a purpose, use a downtime of 12h or a suitable interval until someone who *should* see the alert is available.

How do I interact with the nagios IRC bot?

Corey kindly provided the following:

   help; this will give the authoritative list of current commands
   ack <number> <message>; will acknowledge the nagios alert
   unack <number> <message>; will unacknowledge the nagios alert
   
   downtime <host> <X[s,m,h,d]> <comments>; will schedule downtime for this hostname
   
   downtime <host>:<service> <X[s,m,h,d]> <comments>; will schedule downtime for this service on this hostname
   
   downtime <alert_id> <X[s,m,h,d]> <comments>; will schedule downtime for this alert?
   status; will report the nagios host status on that nagios server
   status <servername>; will report the nagios host status on that server
   status <servername>:*; will report all service statuses for <servername>
   status <servername>:<service_name>; will report the nagios host status on that server
   oncall; will report who is currently on call

Also read the code

How do I scan all problems Nagios has detected?

Once bug 927941 is fixed, we'll no longer be seeing alerts for individual slaves in IRC. We still need to deal with these alerts, however, which means turning to the nagios web interface: http://nagios1.private.releng.scl3.mozilla.com/releng-scl3/

Here are some useful direct links:

Note - values for status.cgi query parameters can be found at http://roshamboot.org/main/?p=74.

How do I deal with Nagios problems?

Note that most of the nagios alerts are slave-oriented, and the slave duty person should take care of them. If you see something that needs to be rectified immediately (e.g., a slave burning builds), do so, and hand off to slave duty as soon as possible.

Nagios will alert every 2 hours for most problems. This can get annoying if you don't deal with the issues. However: do not ever disable notifications.

You can acknowledge a problem if it's tracked to be dealt with elsewhere, indicating that "elsewhere" in the comment. Nagios will stop alerting for ack'd services, but will continue monitoring them and clear the acknowledgement as soon as the service returns to "OK" status -- so we hear about it next time it goes down.

For example, this can point to a bug (often the reboots bug) or to the slave-tracking spreadsheet. If you're dealing with the problem right away, an ACK is not usually necessary, as Nagios will notice that the problem has been resolved. Do *not* ack a problem and then leave it hanging - when we were cleaning out nagios we found lots of acks from 3-6 months ago with no resolution to the underlying problem.

You can also mark a service or host for downtime. You will usually do this in advance of a planned downtime, e.g., a mass move of slaves. You specify a start time and duration for a downtime, and nagios will silence alerts during that time, but begin alerting again when the downtime is complete. Again, this avoids getting us in a state where we are ignoring alerts for months at a time.