CIDuty: Difference between revisions
ChrisCooper (talk | contribs) (→Mobile) |
ChrisCooper (talk | contribs) |
||
| Line 84: | Line 84: | ||
* Linux (32-bit + 64-bit) updates on Aurora: | * Linux (32-bit + 64-bit) updates on Aurora: | ||
chmod 700 /opt/aus2/incoming/2/Firefox/mozilla-aurora/Linux_x86-gcc3 /opt/aus2/incoming/2/Firefox/mozilla-aurora/Linux_x86_64-gcc3 | chmod 700 /opt/aus2/incoming/2/Firefox/mozilla-aurora/Linux_x86-gcc3 /opt/aus2/incoming/2/Firefox/mozilla-aurora/Linux_x86_64-gcc3 | ||
= Nagios = | = Nagios = | ||
Revision as of 20:10, 29 January 2013
What is buildduty?
Each week there is one person from the Release Engineering team dedicated to helping out developers with releng-related issues. This person will be available during his or her regular work hours for the whole week. This is similar to the sheriff role that rotates through the developer community. To avoid confusion, the releng sheriff position is known as "buildduty."
Who is on buildduty? (schedule)
Check the tree-info dropdown on tbpl. The person on buildduty should also have 'buildduty' appended to their IRC nick, and should be available in the #developers, #releng, and #buildduty IRC channels.
Mozilla Releng Sheriff Schedule (Google Calendar|iCal|XML)
Buildduty not around?
It happens, especially outside of standard North American working hours (0600-1800 PST). Please open a bug under these circumstances.
Buildduty priorities
How should I make myself available for duty?
- Add 'buildduty' to your IRC nick
- be in at least #developers, #releng, and #buildduty (as well as #mozbuild of course)
- also useful to be in #mobile, #planning, #release-drivers, and #ateam
What should I take care of?
You should keep on top of:
- Requests from release-drivers (if no one else in the releng is available to pick it up)
- Developer requests in IRC.
- Direct people to http://mzl.la/tryhelp for self-serve documentation when appropriate.
- Pending builds - available in graphs or in the "Infrastructure" pulldown on TBPL. The graphs are helpful for noticing anomalous behavior.
- Wait times - either this page or the emails (un-filter them in Zimbra). Respond to any unusually long wait times (hopefully with a reason)
- wait times emails are run via crontab entries setup on buildapi01.build.scl1.mozilla.com under the buildapi user
- Slave management:
- Bad slave can burn builds and hung slaves can cause bad wait times. These slaves need to be rebooted, or handed to IT for recovery. Recovered slaves need to be tracked on their way back to operational status.
- See the section below on #Slave_Management
- Run reconfigs (scheduled or otherwise) for releng
- All bugs tagged with [buildduty] in the whiteboard:
- Monitor dev.tree-management newsgroup (by email or by nntp)
- Watch for long running builds that are holding on to slaves ie > 1 day. See the build api page.
- You may need to plan a downtime. Coordinate with IT to send downtime notices with enough advance notice.
- See the section below on #Downtimes
- You may need to promote unagi builds to beta - see ReleaseEngineering/How_To/Promote_Unagi_to_beta
Tree Maintenance
Repo Errors
If a dev reports a problem pushing to hg (either m-c or try repo) then you need to do the following:
- File a bug (or have dev file it) and then poke in #ops noahm
- If he doesn't respond, then escalate the bug to page on-call
- Follow the steps below for "How do I close the tree"
How do I see problems in TBPL?
All "infrastructure" (that's us!) problems should be purple at http://tbpl.mozilla.org. Some aren't, so keep your eyes open in IRC, but get on any purples quickly.
How do I close the tree?
See ReleaseEngineering/How_To/Close_or_Open_the_Tree
How do I claim a rentable project branch?
See ReleaseEngineering/DisposableProjectBranches#BOOKING_SCHEDULE
Re-run jobs
How to trigger Talos jobs
see ReleaseEngineering/How_To/Trigger_Talos_Jobs
How to re-trigger all Talos runs for a build (by using sendchange)
see ReleaseEngineering/How_To/Trigger_Talos_Jobs
How to re-run a build
Do not go to the page of the build you'd like to re-run and cook up a sendchange to try to re-create the change that caused it. Changes without revlinks trigger releases, which is not what you want.
Find the revision you want, find a builder page for the builder you want (preferably, but not necessarily, on the same master), and plug the revision, your name, and a comment into the "Force Build" form. Note that the YOU MUST specify the branch, so there's no null keys in the builds-running.js.
Nightlies
How do I re-spin mozilla-central nightlies?
To rebuild the same nightly, buildbot's Rebuild button works fine.
To build a different revision, Force build all builders matching /.*mozilla-central.*nightly/, on any of the regular build masters. Set revision to the desired revision. With no revision set, the tip of the default branch will be used, but it's probably best to get an explicit revision from hg.mozilla.org/mozilla-central.
You can use https://build.mozilla.org/buildapi/self-serve/mozilla-central to do initiate this build and use the changeset at the tip of http://hg.mozilla.org/mozilla-central. Sometimes the developer will request a specific changeset in the bug.
To respin just the android nightlies, find the revisions in the fennec*txt file here and here. Then kick off a build (specifying the revision in the revision field) for armv6 and armv7.
Disable updates
If you're requested to disable updates for whatever reasons you can log on to aus3-staging to do it. Depending what you're asked to shut off, you'll have to chmod a different directory (or directories) to 700. You can logon to aus3-staging.mozilla.org through ldap account and use 'sudo su - ffxbld' (or tbirdbld) to gain the correct privileges. Some examples of shutting off different updates are below:
- 64-bit Windows on the ux branch:
chmod 700 /opt/aus2/incoming/2/Firefox/ux/WINNT_x86_64-msvc
- All updates on Nightly:
chmod 700 /opt/aus2/incoming/2/Firefox/mozilla-central
- Linux (32-bit + 64-bit) updates on Aurora:
chmod 700 /opt/aus2/incoming/2/Firefox/mozilla-aurora/Linux_x86-gcc3 /opt/aus2/incoming/2/Firefox/mozilla-aurora/Linux_x86_64-gcc3
Nagios
What's the difference between a downtime and an ack?
Both will make nagios stop alerting, but there's an important difference: acks are forever. Never ack an alert unless the path to victory for that alert is tracked elsewhere (in a bug, probably). For example, if you're annoyed by tinderbox alerts every 5 minutes, which you can't address, and you ack them to make them disappear, then unless you remember to unack them later, nobody will ever see that alert again. For such a purpose, use a downtime of 12h or a suitable interval until someone who *should* see the alert is available.
How do I interact with the nagios IRC bot?
Corey kindly provided the following:
help; this will give the authoritative list of current commands ack <number> <message>; will acknowledge the nagios alert unack <number> <message>; will unacknowledge the nagios alert downtime <host> <X[s,m,h,d]> <comments>; will schedule downtime for this hostname downtime <host>:<service> <X[s,m,h,d]> <comments>; will schedule downtime for this service on this hostname downtime <alert_id> <X[s,m,h,d]> <comments>; will schedule downtime for this alert? status; will report the nagios host status on that nagios server status <servername>; will report the nagios host status on that server status <servername>:*; will report all service statuses for <servername> status <servername>:<service_name>; will report the nagios host status on that server oncall; will report who is currently on call
Also read the code
How do I scan all problems Nagios has detected?
- All unacknowledged problems:
- All unacknowledged problems with notifications enabled with HARD failure states (i.e. have hit the retry attempt ceiling), longest duration first:
- Group hosts check
Note - values for status.cgi query parameters can be found at http://roshamboot.org/main/?p=74.
How do I deal with Nagios problems?
Note that most of the nagios alerts are slave-oriented, and the slave duty person should take care of them. If you see something that needs to be rectified immediately (e.g., a slave burning builds), do so, and hand off to slave duty as soon as possible.
Nagios will alert every 2 hours for most problems. This can get annoying if you don't deal with the issues. However: do not ever disable notifications.
You can acknowledge a problem if it's tracked to be dealt with elsewhere, indicating that "elsewhere" in the comment. Nagios will stop alerting for ack'd services, but will continue monitoring them and clear the acknowledgement as soon as the service returns to "OK" status -- so we hear about it next time it goes down.
For example, this can point to a bug (often the reboots bug) or to the slave-tracking spreadsheet. If you're dealing with the problem right away, an ACK is not usually necessary, as Nagios will notice that the problem has been resolved. Do *not* ack a problem and then leave it hanging - when we were cleaning out nagios we found lots of acks from 3-6 months ago with no resolution to the underlying problem.
You can also mark a service or host for downtime. You will usually do this in advance of a planned downtime, e.g., a mass move of slaves. You specify a start time and duration for a downtime, and nagios will silence alerts during that time, but begin alerting again when the downtime is complete. Again, this avoids getting us in a state where we are ignoring alerts for months at a time.
Downtimes
The downtimes section had grown quite large. If you have questions about how to schedule a downtime, who to notify, or how to coordinate downtimes with IT, please see the Downtimes page.
Talos
Note because a change to the Talos bundle always causes changes in the baseline times, the following should be done for *any* change...
- close all trees that are impacted by the change
- ensure all pending builds are done and GREEN
- do the update step below
- send a Talos changeset to all trees to generate new baselines
How to update the talos zips
NOTE: Deploying talos.zip is not scary anymore as we don't replace the file anymore and the a-team has to land a change in the tree.
You may need to get IT to turn on access to build.mozilla.org.
# on your localhost
export URL=http://people.mozilla.org/~jmaher/taloszips/zips/talos.07322bbe0f7d.zip
export TALOS_ZIP=`basename $URL`
wget $URL
# wget from people doesn't work anymore
export RELENGWEB_USER=`whoami`
scp ${TALOS_ZIP} ${RELENGWEB_USER}@relengweb1.dmz.scl3.mozilla.com:/var/www/html/build/talos/zips
ssh ${RELENGWEB_USER}@build.mozilla.org "chmod 644 /var/www/html/build/talos/zips/${TALOS_ZIP}"
ssh ${RELENGWEB_USER}@build.mozilla.org "sha1sum /var/www/html/build/talos/zips/${TALOS_ZIP}"
curl -I http://build.mozilla.org/talos/zips/${TALOS_ZIP}
Note that you can get to root by running |sudo su -|
For talos.zip changes: Once deployed, notify the a-team and let them know that they can land at their own convenience.
Updating talos for Tegras
To update talos on Android,
# for foopy05-11
csshX --login=cltbld foopy{05,06,07,08,09,10,11,12,13,14,15,16,17,18,19,20,22,23,24}
cd /builds/talos-data/talos
hg pull -u
This will update talos on each foopy to the tip of default.
B2G Emulator
How to update the emulator
- The password is on our intranet
# build.mozilla.org
# The password is on our intranet
curl -u b2g -o emulator.zip http://ec2-107-20-108-245.compute-1.amazonaws.com/jenkins/job/b2g-build/ws/package.zip
# enter password, wait for download to finish
export SHA512=`openssl sha512 emulator.zip | cut -d' ' -f2`
sudo mv ~/emulator.zip /var/www/html/runtime-binaries/tooltool/sha512/${SHA512}
chmod 644 /var/www/html/runtime-binaries/tooltool/sha512/${SHA512}
ls -l /var/www/html/runtime-binaries/tooltool/sha512/${SHA512}
# copy and save the filesize (from ls -l) and sha512
- Test that you can is readable from your localhost:
curl -I http://runtime-binaries.pvt.build.mozilla.org/tooltool/sha512/${SHA512}
- There will need to be an in-tree patch like this one to update the emulator; a-team will probably handle this.
TBPL
How to deploy changes
RelEng no longer has access to do this. TBPL devs will request a push from Server Ops.
How to hide/unhide builders
- In the 'Tree Info' menu select 'Open tree admin panel'
- Filter/select the builders you want to change
- Save changes
- Enter the sheriff password and a description (with bug number if available) of your changes
- CC :edmorley & :philor on the relevant bug so that they know what to expect when sheriffing.
Useful Links
- Build Dashboard Main Page
- You can get JSON dumps for people to analyze by adding
&format=json - You cam see all build and test jobs for a certain branch for a certain revision by appending branch/revision to this link (e.g. revision/places/c4f8232c7aef)
- You can get JSON dumps for people to analyze by adding
- http://cruncher.build.mozilla.org/~bhearsum/cgi-bin/missing-slaves.py -- a list of slaves which are known on production masters but are not connected to any production masters. Note that this includes preprod and staging slaves, as well as some slaves that just don't exist. Use with care.
- http://build.mozilla.org/builds/last-job-per-slave.html (replace html with txt for text only version)
- L10n Nightly Dashboard
- public How To documents
- private How To documents
Standard Bugs
- Reboots bugs have the Bugzilla aliases shown above.
- For IT bugs that are marked "infra only", yet still need to be readable by RelEng, it is not enough to add release@ alias - people get updates but not able to comment or read prior comments. Instead, cc the following:
- :aki, :armenzg, :bhearsum, :catlee, :coop, :hwine, :jhopkins, :joduinn, :joey, :kmoir, :nthomas, :rail, :edmorley, :Tomcat
Ganglia
- if you see that a host is reporting to ganglia in an incorrect manner it might just take this to fix it (e.g. bug 674233):
switch to root, service gmond restart
Queue Directories
If you see this in #build:
<nagios-sjc1> [54] buildbot-master12.build.scl1:Command Queue is CRITICAL: 4 dead items
It means that there are items in the "dead" queue for the given master. You need to look at the logs and fix any underlying issue and then retry the command by moving *only* the json file over to the "new" queue. See the Queue directories wiki page for details.
Cruncher
If you get an alert about cruncher running out of space it might be a sendmail issue (backed up emails taking up too much space and not getting sent out):
<nagios-sjc1> [07] cruncher.build.sjc1:disk - / is WARNING: DISK WARNING - free space: / 384 MB (5% inode=93%):
As root:
du -s -h /var/spool/* # confirm that mqueue or clientmqueue is the oversized culprit # stop sendmail, clean out the queues, restart sendmail /etc/init.d/sendmail stop rm -rf /var/spool/clientmqueue/* rm -rf /var/spool/mqueue/* /etc/init.d/sendmail start
hg<->git conversion
This is a production system RelEng built, but has not yet transitioned to full IT operation. As a production system, it is supported 24x7x365 - escalate to IT oncall (who can page) as needed.
We'll get problem reports from 2 sources:
- via email from vcs2vcs user to release+vcs2vcs@m.c - see email handling instructions for those.
- via a bug report for a customer visible condition - this should only be if there is a new error we aren't detecting ourselves. See the resources below and/or page hwine.
Documentation for this system:
- recent docs
- source code: http://hg.mozilla.org/users/hwine_mozilla.com/repo-sync-tools/
- config files: http://hg.mozilla.org/users/hwine_mozilla.com/repo-sync-configs/
All services run as user vcs2vcs on one of the following hosts (as of 2013-01-07): github-sync1-dev.dmz.scl3.mozilla.com, github-sync1.dmz.scl3.mozilla.com, github-sync2.dmz.scl3.mozilla.com, github-sync3.dmz.scl3.mozilla.com
disable/reenable aurora updates
After merge day.
Disable
We need to disable aurora updates on merge day until aurora builds pass QA.
- RelMan sends email
- Write a patch like this; get review; land.
- reconfig
Reenable
After QA signs off, we'll get an email/bug about reenabling.
- To enable the previous nightly:
# ffxbld@aus3-staging cd /opt/aus2/incoming/2/Firefox rsync -av mozilla-aurora-test/ mozilla-aurora/ cd /opt/aus2/incoming/2/Fennec rsync -av mozilla-aurora-test/ mozilla-aurora/
- Then, to reenable updates for futher nightlies, revert the previous patch and reconfig.
- Update bouncer links for stub installer (increment the major version in each of these):