How do I schedule a downtime?
Whenever RelEng/IT/WebDev wants a tree-closing downtime for work on any systems, they should contact the RelEng BuildDuty person of the day in #infra, by email or preferably by putting exact wording they'd like in the downtime notice as it relates to their bug, and then nominating the bug using the "needs-treeclosure?" flag.
- NOTE: All ServerOps components now have "needs-treeclosure" flag enabled (previously only some ServerOps components had the flag).
- NOTE: ServerOps "infra only" bugs are not visible to RelEng buildduty by default. If the bug can be changed to "Moco specific", then it will be visible to all of RelEng, and show up in all the queries. If the bug needs to remain "infra only", then the bug assignee needs to file, and closely track all changes to, a new separate bug that RelEng buildduty can use for scheduling.
When planning a downtime, RelEng buildduty should consider:
- the urgency of the work
- what other work, if any, can be safely done in the same downtime
...and propose a time that:
- is low impact to developers
- is low impact to releases (assessed by asking a list of pre-approvers).
- is not scheduled near any other downtimes / planned outages
- fits the schedule of the person who understands the work
- has RelEng buildduty available to handle tree closing/opening and field questions in #developers (other channels as needed)
- gives >1 day notice to newsgroups if at all possible
Preparing for the downtime
- review all bugs nominated with "needs-treeclosure?"
- for bugs approved to land in the downtime, RelEng buildduty will:
- verify that the bug is assigned to the person who will actually be doing the work in the downtime
- set the "needs-treeclosure+" flag in the bugs
- set the whiteboard field with the proposed time/date.
- write a downtime notice that includes:
- bug# and description for each item.
- for security sensitive work, be vague, but still include the bug# and vague description. (this reduces confusion about whether an item of work is in/out of a downtime).
- boilerplate text about the timing of the downtime and how it will affect developers. I've included the boilerplate below:
* When can I push? We pride ourselves on having the self-serve tools to make it easier to recover from build failures caused by a downtime. However, we understand that some developers may not be available to re-trigger failed runs after a downtime is done, or may not want to incur that hassle. Some would rather push early enough to receive all their results before the downtime starts, others would rather wait until the downtime is complete. If you have LDAP access to Mozilla servers, which if you're landing code you likely do, you can check the current end-to-end times for your chosen development branch. Compare your end-to-end time with the declared start of the downtime in order to make an informed decision about whether you really want to push _now_. 1. https://build.mozilla.org/buildapi/self-serve 2. https://build.mozilla.org/buildapi/reports/endtoend
Who do I notify, and when?
- etherpad/email everyone who will be working in the downtime, and ask to approve the draft downtime notice
- BEFORE posting to the newsgroups, send draft copy of the downtime notice to the list of pre-approvers. Ask for any objections/questions. As of 2012/04/25, the current list of pre-approvers is:
- Bob Moss <email@example.com>, Chris Hofmann <firstname.lastname@example.org>, Alex Keybl <email@example.com>, Damon Sicore <firstname.lastname@example.org>, Johnathan Nightingale <email@example.com>, JP Rosevear <firstname.lastname@example.org>, Sheila Mooney <email@example.com>
- cc firstname.lastname@example.org and email@example.com
- post the downtime notice to the dev.planning & dev.tree-management newsgroups, and send a copy of the notice to firstname.lastname@example.org.
- err on the side of over-communication, i.e. play it safe: if you think a group will be impacted by a downtime and they are not included in the lists above, contact them.
- All of the above notifications should go out *at least 24 hours* before the planned downtime.
Running the downtime
- Be sure to check dev.planning, dev.tree-management newsgroups and planet.m.o regularly to ensure nothing comes up in response that would require changes to, or outright cancelling of, the downtime. A standout example here would be a chem-spill release.
- Before starting the downtime, RelEng buildduty will notify sheriff in #developers, and close trees.
After the downtime
RelEng buildduty will:
- reopen the trees
- verify with sheriff that trees open, all ok?
- update the bugs with status (landed-and-stuck", rolled-back) - it's possible this will be done first by the person who attempted the downtime. However, if not, RelEng buildduty should update the bugs for the record.
- send "TREE OPEN" newsgroup post/email, listing what did / didn't get done
How do I coordinate downtimes with IT?
Some IT maintenance requires tree closure. Maintaining or rebooting any of the following systems needs coordinated downtimes with RelEng and IT. It usually also needs advance notice of Tree Closures posted to the [usual sources:
- build.m.o (clobberer, build data, tryserver symbols)
- cruncher.build.m.o (graphs dashboard, dumping build data, dashboards, pulse)
- cvs.mozilla.org (talos)
- hg.mozilla.org (firefox source repos, build repos)
- tinderbox.mozilla.org (central reporting point for all builds)
- ftp.mozilla.org (release updates on beta channel)
- stage.mozilla.org (publishing builds, downloading builds for talos/unittest)
- graphs.mozilla.org (performance tracking)
- buildbot-rw-vip.db.scl3.mozilla.com (buildbot scheduler db, graphserver?)
- buildbot-ro-vip.db.scl3.mozilla.com (used by cruncher)
- mail.build.mozilla.org - currently dm-mail03.m.o (build mail to tinderbox)
- aus3-staging.mozilla.org (update snippets)
- nm-ops03.build.mozilla.org (releng VMs)
- nagios.mozilla.org (monitoring)
- relengweb1.dmz.scl3.mozilla.com (replacement for build.m.o)
- tbpl.mozilla.org (build status)
For any questions, or if you're not sure about a particular server, please check with buildduty in #infra.
If possible, consolidate RelEng and IT downtimes that need tree closures to avoid the disruption of having two tree closures soon after each other. This is "nice to do", not a "requirement"; if it reduces risk by doing two separate downtimes, that's fine!