CloudServices/Principles

From MozillaWiki
Jump to: navigation, search

Services Operations is guided by several "First Principles". These ensure we can deliver the best services with the highest availability possible. Asking (or forcing) us to deviate from any of these principles means we cannot deliver a top-notch service. Not to say everything needs "five 9s", just realize there are trade-offs.

Careful change control

This is the idea that change is an expected part of Operations and should be managed in a sane and complete manner. These guidelines are for "big" changes, not day-to-day procedures.

Change is managed through these policies:

  • complex changes require a step-by-step plan, including a back-out plan (though, if "perform roll-out steps in reverse" is correct, that's fine).
    • All plans are to undergo a step-by-step "dry run" before executing live.
  • a secondary person helps with / checks plans for complex changes (buddy system)
  • points of commitment (or "no return") have go/no-go calls with a clear owner / deterministic datapoint
  • all change plan worthy events must have a clearly defined owner who is responsible for "heads-up", "starting", & "all-clear" email communications (and twitter, IRC, Slack, etc as needed), as well as any write-up on deviations from the plan or other problems.

Rules of engagement

Call them "design principles" if you want. The point is basic risk reduction guidelines.

  • design and code reviews are standard, expected tasks
  • systems and environments have a production look-alike to test complex changes
  • applications may not run as root
  • Non-Ops users do not and will not have shell access to application or administrative accounts

Close the loop

This is the idea that for every action, event, or condition that leads to a "bad result" we will find a way to either prevent it, or correct it at the earliest point possible to minimize impact.

That is reflected in all our processes, for example:

  • monitoring and alerting
  • log scraping & analysis
  • metrics (both automated and human analysis)
  • continuous process refinement

No SPOFs

(SPOF == Single Point of Failure)

We are very careful to design systems that have inherent redundancy. Systems that depend on a single data-path or host are [usually] excluded from production.

To avoid critical failures:

  • redundant multi-host & round-robin systems are preferred over cold fail-over systems
  • the single host is the smallest unit of allowed (and expected) failure
  • solutions for redundancy are evaluated purely on merit -- not hardware vs software or by vendor / project
  • reduced MTBF (mean time between failure) is evaluated and a key driver of system choices

Close contact

With constant change and high demands on our staff and systems we believe in openness and free exchange of information. If nothing else, being open is easier so that you don't have to track down the right person to ask!

  • early involvement with Eng project staff
  • designated Lead and Backup Ops Eng for each project
  • weekly(ish) project sync-ups for active projects, including Eng, Ops, & QA
  • internally open request queue, logs, procedures, documents, and schedules
  • public exposure where possible (should be the default unless there is an explicit concern)

Operational readiness

All our systems (hosts, networks, software and staff) should be ready to serve their function as required. Expected failures should have well understood and limited impact.

  • "ad hoc" assistance is never required for the normal operation. Any discovered requirement for ad hoc assistance is a Blocker-level problem
  • the system should be secure from external tampering
  • the system is efficient in terms of resources used
  • data is protected from expected failure (e.g. disk failure, network congestion)
  • any system can be reconstructed procedurally (meaning, no having to "figure out" how a system was built/config'd)

Privilege and Responsibility are coupled

This links the ability to ask for a change with the consequences of that change.

  • access to privileged commands is limited to responsible parties. (i.e. "If you can't fix a meltdown, you can't press the Big Red Button")
  • All privileged access must be logged, and provide an audit trail
  • Engineering resources in the escalation tree
  • production checkout after planned (or unplanned) changes
  • direct pager coverage