ReleaseEngineering/How To/Process release email

This is a list of automatically generated emails you should expect to receive as a release engineer at mozilla. It is not complete

Index

Zimbra glob/wildcard syntax, in alpha order by field then match string

Field Wildcard Further Notes
Subject collapse report #Performance Metrics
Subject Humpty Dumpty Error * #Puppet failing too many times on a slave
Subject idle kittens report #briar-patch idle kittens reporting
Subject [puppet-monitoring]* #Puppet Log Monitoring
Subject Suspected machine issue (* Not an actionable email at this point. (from: nobody@cruncher - s/a bug 825625
Subject Talos Suspected machine issue * if you don't know, you don't care
Subject Try submission * to: autolanduser@mozilla.com
Subject [vcs2vcs] alert_major_errors* major processing error make sure build duty and/or hwine know details
Subject [vcs2vcs] process delays* major processing error make sure build duty and/or hwine know details
To release+aws@mozilla.com AWS admin email (contact catlee for now), both important & marketing
To release+bitbucket@mozilla.com Mozilla Bitbucket Admin email (contact hwine for now)
To release+vcs2vcs@mozilla.com Output from vcs2vcs hg<->git conversion (details)


briar-patch idle kittens reporting

Why we get them

Email report outlining the status of any host that has been flagged as "idle"

What is sending them

A cron job that is running the kittenreaper.py task with the following parameters

 python kittenreaper.py -w 1 -e

It pulls the list of hosts to check from http://build.mozilla.org/builds/slaves_needing_reboot.txt

What to do when one is recieved

not sure yet, unless your buildduty - then you should be watching it

Future plans

This will be replaced by the briar-patch dashboard

How to best filter these emails

Filtering can be done by matching the subject line which will not change


Puppet Log Monitoring

Why we get them

There are messages in the puppet master logs that indicate something is wrong with a slave or master. Since we have no other master monitoring tools, we are defaulting to sending email.

What is sending them

scl-production-puppet and soon all puppet masters have an instance of 'watch-puppet.py' running under screen as root.

The code for this script is stored here

What to do when one is recieved

  • if the title contains "[puppet-monitoring][master_name] <slavename> is waiting to be signed", this is for information and requires no immediate action
  • if the title contains "[puppet-monitoring][master_name] <slavename> has invalid cert", the script will try once to clean the cert before sending the email once there is a waiting signing request. If this is successful, you'll see a matching "<slavename> is waiting to be signed" email. The key will be automatically signed by a cronjob

How to silence or acknowledge this alert

It is not currently possible to silence this email. This script will send email each time the corresponding line pattern is seen in /var/log/messages. This means that most likely, each time a slave tries to puppet, an email will be sent.

Future plans

In the short term, we'd like to have this script monitor the puppet logs for more error conditions. It would also make sense to monitor all puppet masters

How to best filter these emails

  • subject includes [puppet-monitoring]

Puppet failing too many times on a slave

Why we get them

We have no other monitoring for slaves failing to run puppet successfully. This became a large issue with the rev4 talos machines due to bug 700672. We are now doing an exponential back off on these slaves with a set number of iterations. Once the maximum number of iterations is reached, the slave will send this email then reboot. This helps us avoid puppet master load as well as allowing the machines try to fix themselves by rebooting.

What is sending them

Each machine that has these emails enabled will send the email itself when it fails to puppet the last time, and right before it reboots.

The code that sends them is unversioned, but is deployed to the slaves from

scl-production-puppet:/N/production/darwin10-i386/test/usr/local/bin/run-puppet.sh 

What to do when one is recieved

  • either ignore the email or find the root of the problem and fix it.

How to silence or acknowledge this alert

This email is a temporary workaround until we get a real puppet client monitoring tool. This email we be sent each time the maximum number of retires is reached, which is every couple hours.

Future plans

Would really like to replace these emails with real puppet monitoring.

How to best filter these emails

These emails are best filtered by having "Humpty Dumpty Error" in their subject. Becuase the hostname on the slave might not be correct every/all the time, filtering on domain names might not catch all cases.

Performance Metrics

Why we get them

We get various emails containing raw data that relates to a performance bottle neck at some point in time. Typically these are produced by cron jobs, and so received regularly regardless of metric status. (I.e. they may not require any action.)

What is sending them

Since this is a "catch all" category, various tools send them. Check the full headers for information on sender and source machine as needed.

What to do when one is recieved

If you don't know what it's about, you don't need to deal with it beyond setting up a filter to ignore it.

How to silence or acknowledge this alert

It's not an alert, so they'll keep coming until the end of time. Filter them if you're not involved with them.

Future plans

Adhoc, so varies by email. Theoretically, these should be transitional, and moved into automation and alerting as soon as the metric is understood.

How to best filter these emails

Since these are adhoc, you'll need adhoc filters. It would be nice if folks used a common prefix on subjects, such as "[releng metrics]".

vcs2vcs System

Why we get them

These emails are the interim notification for vcs2vcs system, and indicate an error that must be addressed. The b2g project is dependent upon parts of the vcs2vcs system, as are other developers.

What is sending them

All emails are sent (perhaps indirectly) by a script from vcs2vcs tools. The hosts sending the email will be one of the ones listed in the configs. Full details of how each script is run, including trouble shooting tips, are in the docs (a formatted copy may be online here).

What to do when one is received

  • if the subject contains "[vcs2vcs] process delays", this is a service outage - one or more repositories are no longer being updated. The email contents will give specific errors. Consult the trouble shooting section of the docs (above) for guidance and/or PAGE hwine.
    • The most common cause of this is "socket hang" which is a quick fix. Please add to bug 829025 if you fix, or block that with a new bug.
  • if the subject contains "[vcs2vcs] alert_major_errors alert", this is a major problem - one or more repositories are no longer being updated. The email contents will give specific errors. Consult the trouble shooting section of the docs (above) for guidance and/or contact hwine.
    • The most common cause of this is hg repo corruption, the recovery is scripted, but can take some time. Please add to bug 808129 if you fix, or block that bug with a new bug.
    • NOTE: you may receive an additional email after the root cause is resolved. (The alert checks on the hour for problems in the prior hour.)
  • if the subject contains "TERMINATED", this is a major problem with the update of gecko.git (interim version). Contact (page) hwine (note: this comes from the interim script "keep_clean_room_updated" running in vcs2vcs@github-sync2.dmz.scl3:/opt/vcs2vcs/b2g/wip/" as of 2012-11-10)
  • if the subject is something else, this is likely unexpected output from a cron job. Judge the severity and escalate to hwine appropriately. File a bug to get better diagnosis of this error condition in the future.

How to silence or acknowledge this alert

Resolving the root cause will stop the emails.

Future plans

The system will eventually be transitioned to IT for operations. Specific email will be converted to nagios alerts before then.

How to best filter these emails

All of these emails are sent to the addresses of the form: release+vcs2vcs*@mozilla.com. Common sub addresses are:

release+vcs2vcs
mail that will a specifics in the Subject line.
release+vcs2vcs+forward
mail to vcs2vcs user, forwarded via ~/.forward file.

Sample

Why we get them

Give a brief explanation of why this email is for, what it helps us do and why it should be watched

What is sending them

Include a link to the source of the program sending the email. Include information on which hosts are sending the email, and give information on how program runs. Is it a daemon? Does it have an init script? Do you run it under screen?

What to do when one is recieved

  • if the title contains "[scl-production-puppet-new] <slavename> is waiting to be signed", this is for information and requires no immediate action
  • if the title contains "[scl-production-puppet-new] <slavename> has invalid cert", the script will try once to clean the cert before sending the email. If this is successful, you'll see a matching "<slavename> is waiting to be signed" email. The key will be automatically signed

How to silence or acknowledge this alert

Include information on how to make the emails stop

Future plans

provide any future plans for this email. Is it temporary? Is it going to be replaced by a real dashboard? Are you going to add/change things people filter on?

How to best filter these emails

provide insight on how to filter these emails. Is there a distinguishing header? Is it always from a specifc host, or family of hosts? Is there a distinctive subject?