CiDuty/actionable

From MozillaWiki
Jump to: navigation, search

Intro

CiDuty is responsible for routine keep-the-lights-on (KTLO) waterline tasks. These include but are not limited to providing loaners to developers, triaging urgent break-fix issues, handling trees closures, and maintaining documentation. If any of these come up early in CiDuty morning, they take priority over anything else. CiDuty should continue performing checks on a regular basis throughout the day to make sure things are in a good state.

Continuously

1. Nagios and SNS Alerts - Monitor alerts from MDC1 / MDC2 Nagios instances and SNS alerts from papertrail in the #ci IRC channel. Triage unacknowledged alerts and file/fix bugs as necessary according to the CiDuty How-Tos. Make sure that all CiDuty bugs have the correct priority set according to the priority list below. A few examples of such alerts include:

  • High CI wait times
  • CI pending job backlogs
  • Unresponsive machines
  • Disk/RAM/CPU issues
  • Failed processes

2. Monitor the #ci, #platform-ops-alerts, #releaseduty and #taskcluster irc channels for requests/questions from developers and other ops teams.

3. Bug triaging - The CiDuty report should be your starting point for bug triaging. Click on the bug number link in the CiDuty report to see the specified bug in bugzilla.

Daily

1. Review email from AWS looking for announced maintenance, degraded instances, etc. Resolve any issues with instances specified in the AWS email, and notify the appropriate groups of any planned maintenance.

2. Check for and terminate long-running/outdated AWS instances.

3. Sanity checks in TaskCluster Provisionermaking sure that the workers are working as expected.

Weekly

1. Review AWS instances. All those AWS that have 'Unknown State/Type' or have been 'stopped for a while'.

2. Run the aws_manage_routingtables script to make sure that the AWS routing tables are up to date and error free. Fix any reported errors

3. Review the loaner email report and verify that loaners are still required. Reimage/terminate returned loaners as necessary

CiDuty Bugzilla Priority levels

P1: Waterline KTLO work. This includes developer loaners, urgent break-fix, trees closures, etc.

P2: Projects above waterline that take high priority and have hard deadlines. This includes various fixes that we need to take care of once, not on a recurring basis. Examples of this are 1363897 (legacy extensions disabling prior to 57) and 1315977 (upgrading python to 2.7.6 on mac builders)

P3. Training/Daily Documentation. This includes onboarding and runbook documentation. Creating/expanding this documentation will help us deliver faster on other bugs.

P4. Operational development. This includes adding new scripts or improving existing scripts, e.g. to reduce running times, improve logging and notifications system, etc.