CiDuty/actionable: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
(Updated information on page.)
Line 1: Line 1:
== Intro ==
== Intro ==
CiDuty is responsible for routine keep-the-lights-on (KTLO) waterline tasks. These include but are not limited to providing loaners to developers, triaging urgent break-fix issues, handling trees closures, and maintaining documentation. If any of these come up early in CiDuty morning, they take priority over anything else. CiDuty should continue performing checks on a regular basis throughout  
CiDuty is responsible for routine keep-the-lights-on (KTLO) waterline tasks. These include but are not limited to providing loaners to developers, triaging urgent break-fix issues, handling trees closures, and maintaining documentation. If any of these come up early in CiDuty morning, they take priority over anything else. CiDuty should continue performing checks on a regular basis throughout the day to make sure things are in a good state.  
the day to make sure things are in a good state.  


== Continuously ==
== Continuously ==
'''1.''' Nagios and SNS Alerts - Monitor alerts from [https://nagios1.private.releng.mdc1.mozilla.com/releng-mdc1/cgi-bin/status.cgi?host=all&servicestatustypes=28/ MDC1] Nagios instances and SNS alerts from [https://papertrailapp.com/dashboard/ papertrail] in the #ci IRC channel. Triage unacknowledged alerts and file/fix bugs as necessary according to the [https://wiki.mozilla.org/ReleaseEngineering/How_To/ Release Engineering How-Tos]. Make sure that all CiDuty bugs have the correct priority set according to the [[ReleaseEngineering/Buildduty_actionable#Buildduty_Bugzilla_Priority_levels/pending_counts| priority list]] below. A few examples of such alerts include:
'''1.''' Nagios and SNS Alerts - Monitor alerts from [https://nagios1.private.releng.mdc1.mozilla.com/releng-mdc1/ MDC1] / [https://nagios1.private.releng.mdc2.mozilla.com/releng-mdc2/ MDC2] Nagios instances and SNS alerts from [https://papertrailapp.com/dashboard/ papertrail] in the #ci IRC channel. Triage unacknowledged alerts and file/fix bugs as necessary according to the [https://wiki.mozilla.org/CIDuty/How_To CiDuty How-Tos]. Make sure that all CiDuty bugs have the correct priority set according to the [[ReleaseEngineering/Buildduty_actionable#Buildduty_Bugzilla_Priority_levels| priority list]] below. A few examples of such alerts include:
* High CI wait times
* High CI wait times
* CI pending job backlogs
* CI pending job backlogs
* Relengbot failures
* Golden AMI generation failures
* Unresponsive machines
* Unresponsive machines
* Disk/RAM/CPU issues
* Disk/RAM/CPU issues
* Failed processes
* Failed processes


'''2.''' Monitor the #releng and #taskcluster irc channels for requests/questions from developers and other ops teams.
'''2.''' Monitor the #ci, #platform-ops-alerts, #releaseduty and #taskcluster irc channels for requests/questions from developers and other ops teams.


'''3.''' Bug triaging - The [https://secure.pub.build.mozilla.org/builddata/reports/slave_health/buildduty_report.html CiDuty report] (generated hourly) should be your starting point for bug triaging. Click on the bug number link in the CiDuty report to see the specified bug in bugzilla. You can also view the list of bugs with this [https://mzl.la/2wDHhDZ bugzilla search].
'''3.''' Bug triaging - The [https://bugzilla.mozilla.org/buglist.cgi?quicksearch=CiDuty&list_id=14513240 CiDuty report] should be your starting point for bug triaging. Click on the bug number link in the CiDuty report to see the specified bug in bugzilla.
* At the top, it lists unassigned bugs for loan requests. Keep this queue empty and make sure developers are unblocked. The wiki has instructions for [https://wiki.mozilla.org/ReleaseEngineering/How_To/Loan_a_Slave/ how to loan a machine].
* After loans are taken care of, make sure that bugs in the "No dependencies" section get dependencies filed, e.g. diagnosis bug, decomm bug, etc. The specific next steps will depend on the issue; [https://wiki.mozilla.org/ReleaseEngineering/How_To/ Release Engineering How-Tos] has details covering each case.
* Do the same for bugs in the "All dependencies resolved" section to make sure the next action is taken (re-image, decomm, return to production, etc). Again, the specific next steps will depend on the issue.
* Systemic issues (e.g. test failures that require further investigation) should not stay in the CiDuty bugzilla component. It may be OK for you to take the bug and work on it depending on how much time you have, but generally these types of bugs should be escalated to the appropriate team and moved to their component (e.g. General Automation) once CiDuty has triaged them.


== Daily ==
== Daily ==
Line 26: Line 19:
'''2.''' Check for and terminate long-running/outdated AWS instances.
'''2.''' Check for and terminate long-running/outdated AWS instances.


'''3.''' Sanity checks in TaskCluster making sure that the workers in /provisioner/releng-hardware are working as expected.
'''3.''' Sanity checks in [https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/ TaskCluster Provisioner]making sure that the workers are working as expected.


== Weekly ==
== Weekly ==
Line 49: Line 42:
This includes onboarding and runbook documentation. Creating/expanding this documentation will help us deliver faster on other bugs.
This includes onboarding and runbook documentation. Creating/expanding this documentation will help us deliver faster on other bugs.


'''P4.''' TC migration cleanup.
'''P4.''' Operational development.
This includes all tasks related to decommissioning, capacity and load reduction, etc as we transition from Buildbot to Taskcluster.
 
'''P5.''' Operational development.
This includes adding new scripts or improving existing scripts, e.g. to reduce running times, improve logging and notifications system, etc.
This includes adding new scripts or improving existing scripts, e.g. to reduce running times, improve logging and notifications system, etc.

Revision as of 14:37, 15 January 2019

Intro

CiDuty is responsible for routine keep-the-lights-on (KTLO) waterline tasks. These include but are not limited to providing loaners to developers, triaging urgent break-fix issues, handling trees closures, and maintaining documentation. If any of these come up early in CiDuty morning, they take priority over anything else. CiDuty should continue performing checks on a regular basis throughout the day to make sure things are in a good state.

Continuously

1. Nagios and SNS Alerts - Monitor alerts from MDC1 / MDC2 Nagios instances and SNS alerts from papertrail in the #ci IRC channel. Triage unacknowledged alerts and file/fix bugs as necessary according to the CiDuty How-Tos. Make sure that all CiDuty bugs have the correct priority set according to the priority list below. A few examples of such alerts include:

  • High CI wait times
  • CI pending job backlogs
  • Unresponsive machines
  • Disk/RAM/CPU issues
  • Failed processes

2. Monitor the #ci, #platform-ops-alerts, #releaseduty and #taskcluster irc channels for requests/questions from developers and other ops teams.

3. Bug triaging - The CiDuty report should be your starting point for bug triaging. Click on the bug number link in the CiDuty report to see the specified bug in bugzilla.

Daily

1. Review email from AWS looking for announced maintenance, degraded instances, etc. Resolve any issues with instances specified in the AWS email, and notify the appropriate groups of any planned maintenance.

2. Check for and terminate long-running/outdated AWS instances.

3. Sanity checks in TaskCluster Provisionermaking sure that the workers are working as expected.

Weekly

1. Review AWS instances. All those AWS that have 'Unknown State/Type' or have been 'stopped for a while'.

2. Run the aws_manage_routingtables script to make sure that the AWS routing tables are up to date and error free. Fix any reported errors

3. Review the loaner email report and verify that loaners are still required. Reimage/terminate returned loaners as necessary

CiDuty Bugzilla Priority levels

P1: Waterline KTLO work. This includes developer loaners, urgent break-fix, trees closures, etc.

P2: Projects above waterline that take high priority and have hard deadlines. This includes various fixes that we need to take care of once, not on a recurring basis. Examples of this are 1363897 (legacy extensions disabling prior to 57) and 1315977 (upgrading python to 2.7.6 on mac builders)

P3. Training/Daily Documentation. This includes onboarding and runbook documentation. Creating/expanding this documentation will help us deliver faster on other bugs.

P4. Operational development. This includes adding new scripts or improving existing scripts, e.g. to reduce running times, improve logging and notifications system, etc.