Release:Release Automation on Mercurial:Troubleshooting

From MozillaWiki
Jump to: navigation, search

<< Documentation

Paperwork

Some releases require creation of build notes when failures occur, others should already have a page created at release start - requirements. Either way, document all problems and their solutions.

How to investigate release runner failures

"[release-runner] failed"

Release runner can fail to start a release for many reasons (eg, release sanity failures, network issues). Unless something very unusual happens, you will receive an e-mail with the subject line "[release-runner] failed" when it encounters an issue. The e-mail should have brief details on the failure - for example, it may contain an excerpt from release sanity. If this doesn't give you enough information to debug the problem, you can get more detailed information by logging onto buildbot-master81 and inspecting /builds/releaserunner/release-runner.log.

If you're unable to resolve the issue on your own ask someone for help. Once you believe the issue has been resolved you need to mark the release as "ready" again on Ship It and restart the release runner process. This can be done with the following command on buildbot-master81 (as root):

supervisorctl restart releaserunner

The release might fail again, with an error similar to:

subprocess.CalledProcessError: Command '['hg', 'commit', '-m', 'Update release config for Fennec-27.0b9-build1', '-u', 'ffxbld']' returned non-zero exit status 1

This is because the prior failure happened after the release configs were already updated. The solution is to:

  1. revert the changes on both default & production branches of "buildbot-configs"
  2. re-mark the releases as "ready" again, and hit "do eeet"

"[release-runner] WARNING: Reconfig exceeded (time)"

If release runner is unable to reconfig the required masters after 15min you'll receive a mail like this. This initial mail is just a heads up that something may need some intervention. If after 30min the reconfig still isn't complete, you should have a look at buildbot-master81:/builds/releaserunner/release-runner.log and see which master(s) to see what's stuck, and go deal with it as you would if you were doing a reconfig by hand.

Addressing Disk Space Issues

Each builder should ensure adequate disk space before starting a job, so this "shouldn't happen" unless the build's needs have grown. In that case, the logs will not give a good clue, and the job will be retried. Clean space on the build slave by deleting older builds, AND file a bug to have the configs fixed. (No, you can't just run purge_builds.py -- at least until we have local "tools" checkouts)

You can confirm disk space is the issue via browsing graphite.

Restarting failed builders without patching the config

Most builders can be safely rebuilt with the "rebuild" button. Individual locales can be triggered through the standalone repack builders. Ask someone if you don't know if it's safe to restart something or not.

Tagging failed out part way through

Don't try to recover from this, just do a new build.

Re-spinning a single locale

WARNING: These instructions are extremely out of date. Consider them nothing more than a vague guideline. Please think through this yourself if you need to do it. We sometimes run into the case where individual locales need to be re-spun. Some reasons this might happen include network timeouts or build slave failures. When this happens, you can follow the following steps to recover and re-spin the missing locale(s):

  1. Manually re-tag the locale's repository (if necessary). You can find the appropriate revision to use for tagging from the shipped-locales file, or from the l10n shipping dashboard
  2. Delete the current build of the locale and cleanup l10n build dirs on build slaves.
  3. Manually force the repack on each of the "$platform_standalone_repack" builders. See the Standalone_Repack_Builders section for details.
  4. Manually sign it and update the *SUMS files
    • You need to download the new locale builds to the signing machine, but you also need the SUM files, en-US Windows build (used for caching) and the zh-TW builds (monitoring tools check for this locale to know when the directory has changed).
    • Run sign-files on the new builds.
    • Manually update the SUMS files with new md5/sha1 sums.
    • Remove the .asc file for the en-US Windows build
    • Push the signed builds back to stage.
    • The Build Notes for 3.6b4 show an example of how this is done.
  5. Manually create a partial MAR for the locale
    • use an appropriate patcher_config file for your release.
    • on a linux slave (preferably a fast one), download the builds with patcher2.pl
    • use patcher2.pl to create the update MAR files and snippets.
    • ensure file (755) and directory (644) modes are correct for your created files.
    • transfer the MAR files to stage
    • transfer the update snippets to the aus2 server(s) <- there may be more than one
      • it is good practice to use a new directory name on the aus2 server to mark the new snippets as part of a distinct respin, e.g. 20091125-Firefox-3.6b4-fr-respin-test. Please also add that new directory name to the list of directories to be run through backupsnip/pushnip in the build notes.
    • The Build Notes for 3.6b4 show an example of how this is done.
  6. Re-run the update verify builder from the waterfall.

Overwriting files that have been pushed to releases/

If a rebuild happens after an earlier build has been pushed to mirrors already, a few steps need to be taken to make sure that the files can be pushed and that the CDN serves the content. (This is always the case for beta respins, as the prior build will have pushed to mirrors as part of automation.) The following should happen before "push to mirrors" runs in the new build. (If you're not in a rush, it's best to do these before kicking off the new release to make sure it does in fact happen in time):

  • Delete the directory from releases. For example:
# from any master...
ssh -i ~/.ssh/ffxbld_rsa ffxbld@stage.mozilla.org
# ffxbld@upload1
rm -rf /pub/mozilla.org/firefox/releases/19.0b20
  • File an IT bug to have the CDN caches purged. These should generally be filed as critical or blocker. Definitely file as a blocker if you're under time pressure.

If you don't delete the releases directory prior to "push to mirrors" running you'll end up with that builder and "check permissions" failing. These should be re-run after you delete the existing contents of the directory.

It's a good idea to verify that everything has been purged correctly, too. You can test the individual CDNs with the script (providing a current url). A sample run showing a stale file on one CDN error:

   $ ./check_cdn thunderbird/releases/34.0b1/update/linux-x86_64/zh-TW/thunderbird-33.0b1-34.0b1.partial.mar
   http://ftp.mozilla.org/pub/thunderbird/releases/34.0b1/update/linux-x86_64/zh-TW/thunderbird-33.0b1-34.0b1.partial.mar
   < Last-Modified: Thu, 13 Nov 2014 14:47:30 GMT
   < Content-Length: 16759161
   http://wildcard.cdn.mozilla.net.edgesuite.net/pub/thunderbird/releases/34.0b1/update/linux-x86_64/zh-TW/thunderbird-33.0b1-34.0b1.partial.mar
   < Last-Modified: Tue, 11 Nov 2014 00:00:54 GMT
   < Content-Length: 16759952
   http://cds.d6b5y3z2.hwcdn.net/pub/thunderbird/releases/34.0b1/update/linux-x86_64/zh-TW/thunderbird-33.0b1-34.0b1.partial.mar
   < Last-Modified: Thu, 13 Nov 2014 14:47:30 GMT
   < Content-Length: 16759161
   http://wpc.1237.edgecastcdn.net/pub/thunderbird/releases/34.0b1/update/linux-x86_64/zh-TW/thunderbird-33.0b1-34.0b1.partial.mar
   < Last-Modified: Thu, 13 Nov 2014 14:47:30 GMT
   < Content-Length: 16759161

The "final verify" builder should be rerun after the CDN is cleared. If the final verify fails again, it could be that the CDNs did not finish purging. Using the script above with the failing url's will show when that url is again valid. The current final-verify builder will pull from "random" CDNs, so a pass of final verify doesn't mean all files have been purged successfully. (Note that individual CDNs may not be consistent - see bug 1099048 for an example of that.)