CIDuty: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
(Moved 2 links.)
 
(66 intermediate revisions by 17 users not shown)
Line 1: Line 1:
__TOC__
__TOC__


= What is buildduty? =
= What is CIDuty? =
Each week there is one person from the Release Engineering team dedicated to helping out developers with releng-related issues.  This person will be available during his or her regular work hours for the whole week. This is similar to the sheriff role that rotates through the developer community. To avoid confusion, the releng sheriff position is known as "'''buildduty'''."
CiDuty (formerly BuildDuty) is a team dedicated to helping out developers with Firefox continuous integration infra issues and enquiries. We currently have six people based in Romania that provide 24/7 supportCiDuty complement the [[Sheriff|sheriffing team]] where sheriffs respond to Firefox code regressions, CiDuty respond to the infrastructure that builds and tests Firefox code.


= Who is on buildduty? (schedule) =
Have a question or issue with Firefox, build and test infrastructure? ciduty can help and ensure your inquiry gets answered.
Check the tree-info dropdown on [https://tbpl.mozilla.org/ tbpl]. The person on buildduty should also have 'buildduty' appended to their IRC nick, and should be available in the #developers, #releng, and #buildduty IRC channels.


Mozilla Releng Sheriff Schedule ([http://www.google.com/calendar/embed?src=aelh98g866kuc80d5nbfqo6u54%40group.calendar.google.com&ctz=America/New_York Google Calendar]|[http://www.google.com/calendar/ical/aelh98g866kuc80d5nbfqo6u54%40group.calendar.google.com/public/basic.ics iCal]|[http://www.google.com/calendar/feeds/aelh98g866kuc80d5nbfqo6u54%40group.calendar.google.com/public/basic XML])
= Communication =


== Buildduty not around? ==
As a 24/7 support team, ciduty are available via irc, email, and bugzilla.
It happens, especially outside of standard North American working hours (0600-1800 PST). Please [https://bugzilla.mozilla.org/enter_bug.cgi?product=Release%20Engineering&component=Buildduty open a bug] under these circumstances.


= Buildduty priorities =
irc:
== How should I make myself available for duty? ==
* #ci - look for 'ciduty' in nick (monitors other channels as well)
* Add 'buildduty' to your IRC nick
 
* Be available in the following IRC channels (at least): [irc://irc.mozilla.org/#developers #developers], [irc://irc.mozilla.org/#releng #releng], and [irc://irc.mozilla.org/#buildduty #buildduty] (as well as #mozbuild of course)
bugzilla:
** also useful to be in [irc://irc.mozilla.org/#mobile #mobile] and [irc://irc.mozilla.org/#ateam #ateam]
* needinfo or assign ciduty@mozilla.com
* file under [https://bugzilla.mozilla.org/enter_bug.cgi?product=Infrastructure%20%26%20Operations&component=CIDuty CIDuty] component if you are not sure where to file your CI related ticket
 
email:
* ciduty@mozilla.com
 
= Manifesto =
The [[ReleaseEngineering/Buildduty_manifesto| CIDuty manifesto]] describes the team responsibilities in a nutshell.
 
= Team =


== What should I take care of? ==
=== Outages ===
Things fail. It's sad. Getting systems and services stood back up again is buildduty's top priority. Note: this doesn't mean you need to do all the work yourself. For big outages, rope in whatever help you need: domain experts from releng, managers, netops, relops...whoever.
* [[ReleaseEngineering/Buildduty/Dealing With Outages|Dealing with Outages]]
=== Daily ===
* '''Triage'''
** Move bugs to the right component
** Grab them if you're going to work on them
** Find owners to the bugs if you're swamped
** If the bug is not that important to fix, mark it with P1/P2 to indicate that they have been triaged and that we will deal with them in the future
*** Mark them as "enhancement" if they're bugs that will help buildduty initiatives
{| border=1
{| border=1
| '''Explanation + hyperlink'''
| '''Name'''
| '''Actions'''
| '''Profile'''
| '''Social'''
| '''Blog'''
|-
|-
| [https://bugzilla.mozilla.org/buglist.cgi?cmdtype=dorem&remaction=run&namedcmd=releng-loan-requests-triage&sharer_id=30066&list_id=7937951 Loan requests triage]
| Jordan Lund
| During your buildduty week check this daily and keep it closer to zero.  [[ReleaseEngineering/How_To/Loan_a_Slave|Instructions]]
| [https://mozillians.org/u/jlund jlund]
| [https://github.com/lundjordan github]
| [http://jordan-lund.ghost.io/ blog]
|-
|-
| [https://secure.pub.build.mozilla.org/builddata/reports/slave_health/buildduty_report.html Buildduty report (generated once a day)]
| Zsolt Fay
| Make sure that bugs with no dependencies get dependencies filed, e.g. diagnosis bug, decomm bug, etc. Do the same for bugs in the "All dependencies resolved" section to make sure the next action is taken (re-image, decomm, etc). Use the "View list in bugzilla" links to navigate the bugs more easily.
| [https://mozillians.org/en-US/u/zfay/ zfay]
| [https://github.com/Rivulu5 github]
| N/A
|-
|-
| [https://nagios.mozilla.org/releng-scl3/cgi-bin/status.cgi?host=all&servicestatustypes=28&hoststatustypes=15&serviceprops=270346&hostprops=270346 Not acknowledged nagios alerts]
| Radu Iman
| Deal with them. File bugs if needed.
| [https://mozillians.org/en-US/u/riman/ riman]
| [https://github.com/raduiman github]
| N/A
|-
| Bogdan Crisan
| [https://mozillians.org/en-US/u/bcrisan/ bcrisan]
| [https://github.com/bccrisan github]
| N/A
|-
| Danut Labici
| [https://mozillians.org/en-US/u/dlabici/ dlabici]
| [https://github.com/akhliskun github]
| N/A
|-
| Roland Mutter
| [https://mozillians.org/en-US/u/rmutter/ rmutter]
| [https://github.com/mutterroland github]
| N/A
|-
| Adrian Pop
| [https://mozillians.org/en-US/u/apop/ apop]
| [https://github.com/popadrianc github]
| N/A
|}
|}
** Reboot <font color="red">red</font> [https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slavetype.html?class=test&type=t-xp32-ix t-xp32-ix slaves] until {{Bug|977341}} is fixed.  You can do this right from the slave health page.


=== Semi-Daily ===
= CiDuty priorities =
* '''Reconfigs'''
The [[ReleaseEngineering/Buildduty_actionable| CiDuty actionable]] enumerates their daily/weekly sanity job.
** [[ReleaseEngineering/Buildduty/Reconfigs|Run reconfigs (every day or two days) for other relengers]]
*'''Review "long running" and "lazy" AWS instances'''
** ''When'': 2-3 times a week (eg: Mondays or after weekends/holidays, Wednesdays, and Fridays)
** ''How'': use aws sanity check email (sent daily):
*** Email filter => to: release+aws-sanity-check@mozilla.com, subject: [cron] aws sanity check
*** for each host under heading "Long running instances", follow steps in dealing with [https://wiki.mozilla.org/ReleaseEngineering/How_To/Manage_AWS_slaves#Long_Running_Instances long running instances]
*** for each host under heading "Lazy long running instances", figure out why they're still up and not taking jobs
**** twistd.log, uptime, reboot history on the slave health page et al
* [https://bugzilla.mozilla.org/buglist.cgi?priority=--&list_id=9730542&short_desc=.*problem%20tracking.*&bug_severity=blocker&bug_severity=critical&bug_severity=major&bug_severity=normal&bug_severity=minor&bug_severity=trivial&columnlist=assigned_to%2Cshort_desc%2Cstatus_whiteboard%2Cchangeddate&resolution=---&emailtype1=exact&query_format=advanced&emailassigned_to1=1&bug_status=UNCONFIRMED&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&short_desc_type=notregexp&email1=nobody%40mozilla.org&component=Buildduty&product=Release%20Engineering Buildduty triage]


=== Weekly ===
= Documentation =
*'''Review 'Unknown State/Type or have stopped for a while'''
There's a [https://wiki.mozilla.org/CIDuty/How_To HowTo wiki page] that aggregates useful info related to the tasks CiDuty is taking care of (as of January 2019).  
** ''When'':
 
*** Once a week. Preferably, this would be evenly spaced out between tackling this so let's say Fridays if possible.
= Useful Links =
** ''How'':
* [[ReleaseEngineering/Buildduty/day_1_checklist|Day 1 checklist]]
*** use aws sanity check email (sent daily):
* [https://tools.taskcluster.net/provisioners Provision Explorer]
**** Email filter => to: release+aws-sanity-check@mozilla.com, subject: [cron] aws sanity check
* [https://wiki.mozilla.org/Buildduty/How_To Public "How To" documents]
*** for each host under heading "Unknown State", "Unknown Type"
* [https://mana.mozilla.org/wiki/dosearchsite.action?queryString=title%3A%22How%20To%22&where=RelEng Private "How To" documents]
**** follow steps in dealing with [https://wiki.mozilla.org/ReleaseEngineering/How_To/Manage_AWS_slaves#Unknown_Type_Or_State_Instances unknown state or type] instances
*** for each host under heading "Stopped For A While"
**** follow steps in dealing with [https://wiki.mozilla.org/ReleaseEngineering/How_To/Manage_AWS_slaves#Stopped_For_A_While_Instances stopped for a while instances]


== Infrastructure performance ==
= Deprecated / Archived =
* '''Pending builds'''
The following links and pages are out-of-date or not used anymore. They are still here for historical reasons.
** A high number of pending build can indicate a problem with the scheduler, (a set of) buildbot-masters, or a particular pool of slaves (and hence possibly puppet)
** The number of pending builds is available in [http://builddata.pub.build.mozilla.org/reports/pending/pending.html graphs] or in the "Infrastructure" pulldown on TBPL.  The graphs are helpful for noticing anomalous behavior.
* '''Wait times'''
** This can be related to pending builds above.
** Releng has made a commitment to developers that 95% or more of their jobs will start within 15 minutes of submission.
*** Build and Try (Build) slave pools have greater capacity (and can expand into AWS as required for linux/mobile/b2g) and are usually over 95% unless there is an outage.
*** Many Test jobs are triggered per build/try job, and the current slave pool is finite, so it is rare for us to meet our turnaround commitment for test jobs.
**** Fixing errant test slaves is hence more important fixing build slaves. See '''Slave Management''' below.
** Wait times are available either from [https://secure.pub.build.mozilla.org/buildapi/reports/waittimes the buildAPI wait times report] or the daily emails that go to dev.tree-management (un-filter them in Zimbra).  Respond to any unusually long wait times in email, preferably with a reason.
*** wait times emails are run via crontab entries setup on buildapi01.build.scl1.mozilla.com under the buildapi user
* '''Slave management'''
** Bad slaves can burn builds and hung slaves can cause bad wait times. These slaves need to be rebooted, or handed to IT for recovery. Recovered slaves need to be tracked on their way back to operational status.
** The [[ReleaseEngineering/Buildduty/Nagios|Nagios wiki]] has more information about finding problem slaves using nagios.
*** See the [[ReleaseEngineering/Buildduty/Slave_Management|Slave Management wiki]] for more information about fixing those slaves.
* '''dev.tree-management'''
** Monitor dev.tree-management newsgroup (by [https://lists.mozilla.org/listinfo/dev-tree-management email] or by [nntp://mozilla.dev.tree-management nntp]).
* Watch for long running builds that are holding on to slave, i.e. >1 day.
** See the [https://secure.pub.build.mozilla.org/buildapi/running buildAPI list of running builds].


== Others ==
== Others ==
You should keep on top of:
* '''[[CIDuty/Other_Duties|other, less-frequent duties]]''' that CiDuty can assist with.
* '''Developer requests in IRC'''
* [[ReleaseEngineering/How_To|Old/Deprecated Public "How To" documents]]
** Direct people to [http://mzl.la/tryhelp http://mzl.la/tryhelp] for self-serve documentation when appropriate.
* '''[[ReleaseEngineering/Buildduty/Other_Duties|Other less-frequent duties]]'''
 
= Useful Links =
* [https://secure.pub.build.mozilla.org/builddata/reports/slave_health/ Slave Health]
* [https://secure.pub.build.mozilla.org/buildapi/ Build Dashboard Main Page]
** You can get JSON dumps for people to analyze by adding <code>&format=json</code>
* [[ReleaseEngineering/How_To|Public "How To" documents]]
* [https://intranet.mozilla.org/RelEngWiki/index.php/Category:HowTo Private "How To" documents]
 
= Standard Bugs =
* For IT bugs that are marked "infra only", yet still need to be readable by RelEng, it is not enough to add release@ alias - people get updates but not able to comment or read prior comments. Instead, cc the following:
** :aki, :armenzg, :bhearsum, :catlee, :coop, :hwine, :jhopkins, :kmoir, :nthomas, :rail
** :edmorley, :Tomcat, :RyanVM, :KWierso


= Meeting Notes =
== Meeting Notes ==
* [https://releng.etherpad.mozilla.org/buildduty buildduty log]
Old meeting docs from BuildDuty era.
* [[ReleaseEngineering/Buildduty/Meetings|Meetings Notes]]
* [https://etherpad.mozilla.org/buildduty-notes Daily buildduty stand-up notes]
* [[ReleaseEngineering/Buildduty/Meetings|Old buildduty weekly meetings notes]]
* [[ReleaseEngineering/Buildduty/SVMeetings| SoftVision buildduty stand-up notes]]

Latest revision as of 15:09, 15 January 2019

What is CIDuty?

CiDuty (formerly BuildDuty) is a team dedicated to helping out developers with Firefox continuous integration infra issues and enquiries. We currently have six people based in Romania that provide 24/7 support. CiDuty complement the sheriffing team where sheriffs respond to Firefox code regressions, CiDuty respond to the infrastructure that builds and tests Firefox code.

Have a question or issue with Firefox, build and test infrastructure? ciduty can help and ensure your inquiry gets answered.

Communication

As a 24/7 support team, ciduty are available via irc, email, and bugzilla.

irc:

  • #ci - look for 'ciduty' in nick (monitors other channels as well)

bugzilla:

  • needinfo or assign ciduty@mozilla.com
  • file under CIDuty component if you are not sure where to file your CI related ticket

email:

  • ciduty@mozilla.com

Manifesto

The CIDuty manifesto describes the team responsibilities in a nutshell.

Team

Name Profile Social Blog
Jordan Lund jlund github blog
Zsolt Fay zfay github N/A
Radu Iman riman github N/A
Bogdan Crisan bcrisan github N/A
Danut Labici dlabici github N/A
Roland Mutter rmutter github N/A
Adrian Pop apop github N/A

CiDuty priorities

The CiDuty actionable enumerates their daily/weekly sanity job.

Documentation

There's a HowTo wiki page that aggregates useful info related to the tasks CiDuty is taking care of (as of January 2019).

Useful Links

Deprecated / Archived

The following links and pages are out-of-date or not used anymore. They are still here for historical reasons.

Others

Meeting Notes

Old meeting docs from BuildDuty era.