Buildduty/manifesto

From MozillaWiki
Jump to: navigation, search

Intro

CiDuty is an operational support team dedicated to monitoring and maintaining the health of Firefox’s continuous integration (CI) infrastructure. Employees are contractors located in Romania that provide 24/7 support. The team's responsibilities include but are not limited to:

The team’s responsibilities cover a wide variety of tasks, however they are not deeply knowledgeable about any particular tool, worker, or task running in our infra. Therefore, They should be treated as quick res ponders who are able to assess state in a timely manner, and escalate issues and inquiries to the appropriate person.

Things CIDuty can help with

Track Firefox CI infrastructure changes

CIDuty track and help publish all CI related changes. While this would be public, they can point people to this and help correlate regressions

Firefox CI support and case management

First and foremost, ciduty are "case managers" to your CI developer needs. They have escalation paths and a well defined knowledge of the CI system as a whole. Given that, they are excellent at responding to issues and inquiries, and making sure anything Firefox CI related is triaged and managed appropriately.

Firefox CI infrastructure outage coordination and investigation

When the Firefox CI system fails, getting services online again is ciduty's top priority. They are the initial point of contact for outages, but will likely escalate to additional teams with subject matter experts for resolution.

Monitoring, investigating, and debugging issues with the Linux, Windows, and OS X Firefox CI infrastructure

CiDuty monitors the Firefox CI infrastructure using the Nagios GUI and irc alerts in the #ci irc channel. They routinely look for system issues, resolve them using our automation tooling, or work with datacenter staff to repair offline or degraded hardware. They also monitor email from AWS about infrastructure that is degraded or requires maintenance.

Monitoring Firefox CI backlog/pending counts

CiDuty is the first point of contact for monitoring the load on the Firefox CI system and determining the cause of any high backlog or pending job counts. If they are unable to determine the root cause and solve the issue, CiDuty escalates to other teams who have subject matter experts.

Tree closing and opening

Closing and opening the trees (denying and allowing code checkins to our mercurial repos) are typically handled by the Mozilla Code Sheriffs, but CiDuty can also help out with this if needed.

Loaning Firefox build/test instances to developers

CiDuty processes bugzilla requests from developers for Firefox CI build or test loaners. To obtain a loaner, submit a request to bugzilla under CiDuty and expect a response in less than one working day (UTC+2).

Upload new packages or Python modules to our internal mirrors

CiDuty can help a developer who needs a new software package uploaded to tooltool or a Python package uploaded to our internal PyPi mirror. They can also grant other developers access to upload packages to tooltool, for a given paths subset, to allow for future self-service.

Routine maintenance of the Firefox CI configuration

While most of the Taskcluster configuration is handled by the end-developer, we still have infrastructure using the Buildbot CI infrastructure as well. CiDuty has the knowledge and capability to modify the Firefox buildbot-configs and perform general maintenance of the Buildbot systems. Maintenance includes tasks such as retasking machines from one platform to another as capacity requirements demand, decommissioning machines, updating keys and secrets, etc.

PyUp PR work

CiDuty is in charge of keeping few repos, such as build-puppet, addonscript and treescript, always up-to-date with PRs that come from PyUp. In a general sense, they know how to investigate why tests are failing and know how to do the appropriate changes.

Things ciduty are not responsible for

Fixing Firefox build and test tasks

While ciduty have the skills to diagnose CI infra health and make sure that the workers are in a good state, they are not knowledgeable about build and test internal logic. They do however know who owns what and can help you escalate to the appropriate team