CiDuty is responsible for routine keep-the-lights-on (KTLO) waterline tasks. These include but are not limited to providing loaners to developers, triaging urgent break-fix issues, handling trees closures, and maintaining documentation. If any of these come up early in CiDuty morning, they take priority over anything else. CiDuty should continue performing checks on a regular basis throughout the day to make sure things are in a good state.
1. Nagios and SNS Alerts - Monitor alerts from SCL3 and MDC1 Nagios instances and SNS alerts from papertrail in the #ci IRC channel. Triage unacknowledged alerts and file/fix bugs as necessary according to the Release Engineering How-Tos. Make sure that all CiDuty bugs have the correct priority set according to the priority list below. A few examples of such alerts include:
- High CI wait times
- CI pending job backlogs
- Buildbot misconfigurations
- Relengbot failures
- Golden AMI generation failures
- Unresponsive machines
- DIsk/RAM/CPU issues
- Failed processes
- Buildbot master process age
2. Monitor the #releng and #taskcluster irc channels for requests/questions from developers and other ops teams.
3. Bug triaging - The CiDuty report (generated hourly) should be your starting point for bug triaging. Click on the bug number link in the CiDuty report to see the specified bug in bugzilla. You can also view the list of bugs with this bugzilla search.
- At the top, it lists unassigned bugs for loan requests. Keep this queue empty and make sure developers are unblocked. The wiki has instructions for how to loan a machine.
- After loans are taken care of, make sure that bugs in the "No dependencies" section get dependencies filed, e.g. diagnosis bug, decomm bug, etc. The specific next steps will depend on the issue; Release Engineering How-Tos has details covering each case.
- Do the same for bugs in the "All dependencies resolved" section to make sure the next action is taken (re-image, decomm, return to production, etc). Again, the specific next steps will depend on the issue.
- Systemic issues (e.g. test failures that require further investigation) should not stay in the CiDuty bugzilla component. It may be OK for you to take the bug and work on it depending on how much time you have, but generally these types of bugs should be escalated to the appropriate team and moved to their component (e.g. General Automation) once CiDuty has triaged them.
1. Review email from AWS looking for announced maintenance, degraded instances, etc. Resolve any issues with instances specified in the AWS email, and notify the appropriate groups of any planned maintenance.
2. Check for and terminate long-running/outdated AWS instances.
3. Check slave health for errored instances.
4. Perform buildbot reconfigs if needed.
5. Perform buildbot master restarts as needed.
1. Review AWS instances. All those AWS that have 'Unknown State/Type' or have been 'stopped for a while'.
- check the AWS sanity logs in papertrail
- for each host under heading "Unknown State", "Unknown Type", follow this guide
- for each host under heading "Stopped For A While", follow this guide
2. Run the aws_manage_routingtables script to make sure that the AWS routing tables are up to date and error free. Fix any reported errors
3. Review the loaner email report and verify that loaners are still required. Reimage/terminate returned loaners as necessary
CiDuty Bugzilla Priority levels
P1: Waterline KTLO work. This includes developer loaners, urgent break-fix, trees closures, etc.
P2: Projects above waterline that take high priority and have hard deadlines. This includes various fixes that we need to take care of once, not on a recurring basis. Examples of this are 1363897 (legacy extensions disabling prior to 57) and 1315977 (upgrading python to 2.7.6 on mac builders)
P3. Training/Daily Documentation. This includes onboarding and runbook documentation. Creating/expanding this documentation will help us deliver faster on other bugs.
P4. TC migration cleanup. This includes all tasks related to decommissioning, capacity and load reduction, etc as we transition from Buildbot to Taskcluster.
P5. Operational development. This includes adding new scripts or improving existing scripts, e.g. to reduce running times, improve logging and notifications system, etc.