CloudServices/Principles

Services Operations is guided by several "First Principles". These ensure we can deliver the best services with the highest availability possible. Asking (or forcing) us to deviate from any of these principles means we cannot deliver a top-notch service. Not to say everything needs "five 9s", just realize there are trade-offs.

Careful change control

This is the idea that change is an expected part of Operations and should be managed in a sane and complete manner. These guidelines are for "big" changes, not day-to-day procedures.

Change is managed through these policies:

complex changes require a step-by-step plan, including a back-out plan (though, if "perform roll-out steps in reverse" is correct, that's fine).
- All plans are to undergo a step-by-step "dry run" before executing live.
a secondary person helps with / checks plans for complex changes (buddy system)
points of commitment (or "no return") have go/no-go calls with a clear owner / deterministic datapoint
all change plan worthy events must have a clearly defined owner who is responsible for "heads-up", "starting", & "all-clear" email communications (and twitter, IRC, Slack, etc as needed), as well as any write-up on deviations from the plan or other problems.

Rules of engagement

Call them "design principles" if you want. The point is basic risk reduction guidelines.

design and code reviews are standard, expected tasks
systems and environments have a production look-alike to test complex changes
applications may not run as root
Non-Ops users do not and will not have shell access to application or administrative accounts

Close the loop

This is the idea that for every action, event, or condition that leads to a "bad result" we will find a way to either prevent it, or correct it at the earliest point possible to minimize impact.

That is reflected in all our processes, for example:

monitoring and alerting
log scraping & analysis
metrics (both automated and human analysis)
continuous process refinement

No SPOFs

(SPOF == Single Point of Failure)

We are very careful to design systems that have inherent redundancy. Systems that depend on a single data-path or host are [usually] excluded from production.

To avoid critical failures:

redundant multi-host & round-robin systems are preferred over cold fail-over systems
the single host is the smallest unit of allowed (and expected) failure
solutions for redundancy are evaluated purely on merit -- not hardware vs software or by vendor / project
reduced MTBF (mean time between failure) is evaluated and a key driver of system choices

Close contact

With constant change and high demands on our staff and systems we believe in openness and free exchange of information. If nothing else, being open is easier so that you don't have to track down the right person to ask!

early involvement with Eng project staff
designated Lead and Backup Ops Eng for each project
weekly(ish) project sync-ups for active projects, including Eng, Ops, & QA
internally open request queue, logs, procedures, documents, and schedules
public exposure where possible (should be the default unless there is an explicit concern)

Operational readiness

All our systems (hosts, networks, software and staff) should be ready to serve their function as required. Expected failures should have well understood and limited impact.

"ad hoc" assistance is never required for the normal operation. Any discovered requirement for ad hoc assistance is a Blocker-level problem
the system should be secure from external tampering
the system is efficient in terms of resources used
data is protected from expected failure (e.g. disk failure, network congestion)
any system can be reconstructed procedurally (meaning, no having to "figure out" how a system was built/config'd)

Privilege and Responsibility are coupled

This links the ability to ask for a change with the consequences of that change.

access to privileged commands is limited to responsible parties. (i.e. "If you can't fix a meltdown, you can't press the Big Red Button")
All privileged access must be logged, and provide an audit trail
Engineering resources in the escalation tree
production checkout after planned (or unplanned) changes
direct pager coverage

CloudServices/Principles

Contents

Careful change control

Rules of engagement

Close the loop

No SPOFs

Close contact

Operational readiness

Privilege and Responsibility are coupled

Navigation menu

CloudServices/Principles

Careful change control

Rules of engagement

Close the loop

No SPOFs

Close contact

Operational readiness

Privilege and Responsibility are coupled

Navigation menu

Search