User:Djmitche/New Infrastructure Provisioning Cloud Synergy

From MozillaWiki
Jump to: navigation, search

This describes where we want to get to.

Domains

We will separate the domains of infrastructure (compute resource provision) and applications (doing builds, tests, releases, etc.). Examples of each:

RelOps

  • Network Support - Firewalls, DNS, DHCP, Nagios, NTP
    • I would like to use split-horizon DNS to direct staging systems to different hosts from production
    • Some of this is and will remain completely the domain of netops
  • Provisioning Support - Puppet, OPSI, etc., plus any cloud and virtualization software we use
  • Internal Services - RabbitMQ
  • Slavealloc

The last is an example of a service that is arguably in the application category, but which we would like to have span the buildbot environments. I think we will see more of these - releng-specific infrastructural tools that need sit above the level of a single releng application or environment.

RelApps

(see also https://wiki.mozilla.org/ReleaseEngineering/Applications#App_Store)

  • Buildbot (masters and slaves)
  • Clobberer
  • BuildAPI
  • Regression Detection
  • Talos Web Servers
  • Signing Systems

Environments

We will need to supply infrastructure consistently to all buildbot environments - staging, preproduction, and production. But we will also need to be able to stage infrastructural changes without affecting any of the buildbot environments.

So we will have an RelOps-staging environment, and an RelOps-production environment. RelOps-staging will run a *very* scaled-down version of each of the RelApps applications (maybe one master and one of each type of slave, for example). RelOps-production will contain applications' production and, where applicable, staging and preproduction systems.

That means a purely Buildbot change can be staged entirely in RelOps-production; similarly, puppet, DNS, or slave allocator changes can bake in RelOps-staging before affecting anything in RelOps-production. In general, RelOps tools should not distinguish application environments -- for example, all machines in RelOps-production talk to production puppet masters and the production slave allocator.

Cases where changes at two levels absolutely must be coordinated are rare, and can usually be handled by configuring changes to the production infrastructure that temporarily treat staging slaves and production slaves differently.

It's quite possible that we'll have multiple isolated production RelOps environments at some point - for example, one for add-ons testing.

Isolation

We will isolate environments using DNS scoping and, where necessary, VLANs, but not split-horizon DNS. That will require changing everything to use unqualified hostnames, and making those hostnames resolve properly based on the current DNS search path.

Resource Provisioning

RelOps will provide compute resources to RelApps using a "black-box" model. The ideal is that machines are never manually touched. If three new w7 talos slaves are required, a request for three systems running the most recent "releng-talos-w7" image goes into the black box, and the three slaves pop out the other end in some reasonably short interval (minutes if we can automate it fully) and start doing builds. The same can happen for the "releng-buildmaster" image or the "releng-buildapi-webhead" image. Machines can be decommissioned, too.

The provisioning system should handle all image management. That means making sure that machines are running the latest and greatest configurations as often as possible - at boot for slaves, and periodically for masters. It also means balancing incremental versus pave-over-from-bare-metal methods of upgrading machine images. RelApps should not have any visibility into whether a system that was shut down a few minutes ago came back up on the same hardware, nor whether it was re-imaged while it was down.

This provisioning system will become our backup process and our failover process. If the BuildAPI webheads's hard drive explodes, we provision a new system with the "releng-buildapi-webhead" image, then fix the hard drive at our leisure. In fact, the automation may be able to do the first part automatically.

This process precludes *any* hand-crafting of any system - master, slave, whatever.

Source of Truth

The source of truth for what each environment should look like is embodied in the inventory (succinct description of each system) and the puppet manifests (detailed description of each image), along with a well-defined set of binary inputs (packages, ISOs, etc.).

Historical

We sometimes find the need to build particularly old branches - it's probably going to take a long time to kill 3.5.x, for example - so it would be good to be able to reproduce old images without too much trouble. That means keeping base OS images around, even if they are unused. It also means constructing the puppet manifests in such a way that we can still set up old systems - for example, installing a 3-year-old compiler toolchain, while still installing a modern copy of buildbot.

Externally Reproducible

This system needs to be at least basically reproducible for people outside of releng and even outside of MoCo. That means that all, or a large part of it, should be publicly available and documented well enough to redeploy and keep up to date outside of our systems. Of course, passwords, keys, and licensed software need to be kept private.

Open Questions

This is still vague on a lot of points:

  • How do we enforce location diversity (e.g., different racks, different rows)
    • Should a system request specify a datacenter?
  • How automated can we be?
    • Probably different for different OS's
    • Some cloud-like apps are available, e.g., Nebula
  • How much virtualization does this involve?
    • None for macs, maybe some (one VM per host) for linux/windows talos
  • How dynamic is this?
    • Not very - we anticipate all inputs to the black box are made by humans, at least in the short term, and only a few requests per week
  • systems need to be re-imaged periodically, but also need to have some persistent data (objdirs, hg caches, ccache on slaves; master basedirs on buildbot masters)
    • model on Amazon's EBS: some partitions are persistent, but root is built from scratch on every boot?
    • If we can reimage easily on every boot, then maybe we don't need to distinguish build from try slaves -> bigger silos
  • Mobile? (bear can probably answer this once the desktop stuff is rolling)

Terminology

Yeah, RelOps and RelApps sound similar. We can work that out.