Web Operations/Reference Specification

From MozillaWiki
Jump to: navigation, search

Working Model

This is currently a rough draft document that is being developed by a small working group within the WebOps team. We are working on an iteration model where each iteration will open the discussion to a wider audience. As soon as we have progressed to a defined point we will open this discussion up to the next wider audience based on an iteration plan. Please keep in mind that this is a fluid plan and quite subject to change. Finally, we do not currently have a time-line other than we are working as fast as we can.

Iteration plan

  1. Working group
    • We will move to the next iteration as soon as we have is a working, repeatable prototype application.
  2. The WebOps team
    • We will move on as soon as the team reaches a consensus on the general layout and direction.
    • This should be a quick iteration as the team will be consulted heavily during the first iteration.
  3. Other teams within IT
    • During this iteration we will be working with various teams within IT to help us solve technical problems for which they are the subject matter experts.
    • We will be solving numerous technical problems such as:
      • Automatic provisioning of VMs
      • Automation of firewall rules (netflows)
      • Designing a backup solution (or working with an existing solution if it is defined by that time)
  4. Web developers within MoCo
    • Our primary goal here is to ensure that we are developing a system that will serve the needs of our developers while providing them a framework they are willing (and hopefully happy) to work in.
  5. All Mozillians
    • We will open the entire plan for comment to the community. While we are currently unaware of how our decisions and actions here will affect community contributers we want to be sure that we are not overlooking anything.
    • Further, we are hopeful that this project might provide some in-rodes for community contributers to take a more active roll within IT. Specifically:
      • We are striving to use open source solutions
      • We are developing everything possible in the open
      • This entire project is documented on the Mozilla wiki (where you are now!)
      • The entire Puppet tree is available on Github here (will be relocated to Mozilla GitHub URL pre-production)

Overview

This document is designed to construct a framework which will drive automation and address the process and technology issues faced by Mozilla IT in general and the WebOps team specifically. This document is broken into three distinct sections; section one identifies the problems we are trying to solve; section two delineates decisions around how we intend to solve these problems; section three identifies the technical solutions we are choosing to implement the decisions. This document is primarily concerned with the first two sections, however there are a number of problems with ready made technical solutions. Therefore there will be a number of technical solutions presented herein.

The overarching issues we are addressing are:

  • It takes to long to deploy new applications.
  • There is no clear line dividing what is part of the application and what is part of the underlying infrastructure.
  • Application requirements are delivered in a non-standard fashion leading to hard-to-support snowflakes.
  • Developers do not have easy access (self-service) to deploying on our infrastructure.
  • There is no well defined IT process for evaluating new technology to be offered to and used by developers.

Notes and Definitions

  • Private vs. Public cloud - This document is not attempting to take a stance on this debate. Though, this document is built on the assumption that any implementation of this specification will be implemented in a service (??aaS) based environment.
  • Standard Cloud - This is the the idea that all clouds (on/off prem) essentially offer the same set of services. Further, these services are typically offered behind nearly identical cloud API schemas.

Reference Specification Parts

This Reference Specification contains three parts: definition of an overall strategy (this document), a best practice reference, and a service matrix.

The Cloud API

We are dividing the entire IT/Application stack into two distinct sub stacks divided by a single API. The bottom part of the stack is all of the infrastructure managed by IT. The upper half of the stack is the application and the application’s configuration. We are calling the division between these two halves the “cloud API” which will eventually manifest itself as a programmatic communication channel between the lower and upper half of the stack.This communication channel will be used by consumers (developers) to describe their applications. It will also be used by IT to expose the infrastructure for consumption. This creates a clean line under which IT can develop service offerings without the constraints implied when designing for a particular application.

A note about open source software and cloud standards

We hold as a basic tenant of this design process that we should not reinvent any existing solutions. This assumes that a given solution is open source and has an active community around it. Given that these two criteria are met, it would be contrary to progress and reusability to not take advantage of them. As such we are proposing that we use, to the greatest extent possible, ready-made technologies. This will enable us to reduce our “time to market” as well as allow us to give back to the community at large.

Additionally we believe that we should develop our new system in the open. By default all documentation, puppet modules, configuration systems, and the like should be open. For example, we will be hosting all of our puppet modules on github. The only exceptions to this will be for secrets and PII (Personally Identifiable Information), however these should be defined in such a way that the items requiring these secrets can themselves be open.

Discussion Pages

We are collecting various technical discussions in a series of Discussion Pages. The intent of these pages is to capture pros and cons of various technical pieces for future reference. Please feel free to add to existing pages or create new discussion pages as topics warrant.

What problems are we trying to solve?

  • Little use of standards based technologies
  • Practically no usage of pre-developed open source puppet modules
  • Shared infrastructure which leads to issues such as:
    • Security: The more applications on the infra the larger the attack surface
    • Upgrades: Upgrading a library for one application may mean upgrading for all apps
    • Scalability: If one application needs to scale, all other applications on the shared infra will scale too
    • Uptime: If one application brings down the infrastructure, all applications come down
  • Lack of standardization leading to deployment difficulties
  • No service catalog for developers to plan against
  • No easily deployable development environments which mimic production
  • Lack of best practices around multi tenant applications
  • Puppet modules do not work in a distributed environment
    • No development environment for module testing
    • No per module permissions model
    • Workflow leads to unusable revision history
    • Lack of template usage leads to duplication of work and errors
  • No centralized deployment point (lack of a “cloud API” like model)
  • No standardized boundary between what developers manage and what IT manages.
  • Manually creating things like netflows and vlans (we call this the Bugzilla or Human API)
  • Custom compiled packages (RPMs) are difficult to maintain
  • No simple deployment mechanism for applications
  • Things are manually configured on production instances
  • Where configuration information is stored is ambiguous
    • application configuration / application passwords
    • system configuration / system passwords
  • Lack of cohesive monitoring system
  • Developers purchase new domains which leads to delays during deployment
  • No log aggregation system / developers have poor access to logs
  • Backups are not automatic and are often missing altogether
  • Lack of or out of date documentation

How will we solve the problems?

Provide service matrix

The service matrix is a table which describes the services that WebOps provides as standard. This implies that these services are all available through the cloud API. This will be a simple listing where developers can go to help guide their technology decisions. This list will link to the git repository for the back end puppet module (BEM), the front end puppet module (FEM) and the virtual front end module (VFEM) which collectively represent a service.

If a technology is not listed in the service matrix it will not be supported. The technology can be submitted for consideration or run unsupported. If the unsupported option is chosen a discussion should be initiated with WebOps prior to developing on that technology to discuss potential pitfalls.

Provide a centralized point for deploying Applications

We will provide a centralized place for describing and deploying applications. This will be made up of two pieces: the cloud API and the PaaS. The intent of the PaaS is to provide a simple, low barrier of entry for rapidly deploying simple applications. The cloud API will be used to describe any application that is too complex for the PaaS or where the developers desire a higher level of control. It is worth noting that the PaaS will be built on top of the cloud API.

Follow puppet roles and profiles model

This is covered at length in a series of blog posts by Gary Larizza[1][2][3], however it is summarized below. We intend to take this model wholesale and rework our puppet layout to fit. The one roll per machine classification is critical for enabling our applications to autoscale and be cloud deployable.

  • Roles abstract profiles
    • Enables the ability to autoscale
    • Provides easy host classification
    • Each node is classified by a single role
    • Example: project_a_web_node (includes python_App & mediawiki profiles)
  • Profiles abstract component modules
    • Logically group and configure isolated and re-usable component modules
    • Should only include user defined modules and no puppet built-ins
    • Hiera lookups occur only in profile definitions
    • Many profiles can compose a single role
    • Example: mediawiki (Includes apache & mysql & memcache modules)
    • Example: python_app (Includes apache & mysql & memcache modules)
  • Component modules abstract resources
    • Contains built-in resource definitions and no modules
    • Must be parameterized
    • Many component modules can compose a single profile
    • Example: apache (includes package & file resources)
    • Example: mysql (includes package & file resources)
  • Resources abstract the underlying OS implementation
    • Many resources compose a component module
    • Example: user, file, package, etc…
  • Hiera abstracts configuration data
    • Some may come from the MetaData API
    • Example: A mysql password

The following are puppet requirements that must be followed for the this specification to function:

  • Everything must be configured in puppet, without exception
  • Modules should not define other modules but they *can* depend on other modules
  • No modules should be defined in the root puppet repository
  • Each module should should have its own repository
  • Modules will set all defaults in the ::params class

Use librarian-puppet & r10k for managing puppet modules and environments

We will use librarian-puppet for managing all puppet modules. Librarian puppet uses a “Puppetfile” that lists which puppet modules are to be installed into the “modules/” directory. Modules themselves are stored individually in remote repositories (Github, puppetforge, GitMo) and are linked to explicitly in the Puppetfile. Since modules themselves can depend on other modules listed in their own Puppetfile, librarian-puppet is in charge of resolving dependencies.

Puppet environments are critical for obtaining the ability to test changes in our puppet code. To manage environments we are using r10k to automatically create an environment for every git branch in our puppet repository. Each environment is available to be tested on a specific and targeted node.

Provide application isolation

We will provide a vlan or network for each application which will address the shared infrastructure problem mentioned above. This, in conjunction with automated firewall rules, will enable us to autoscale applications and provide increased security.

Use API compliant YAML file to describe applications

Every application should be delivered with a YAML file describing the underlying infrastructure it requires. This file will contain descriptions of every machine an application needs where each instance stanza has a value describing its "puppet role" (i.e. project_a_web_node). By tagging the machine with a "puppet role" the machine will know how to configure itself and whether or not the machine instance's role is eligible for auto scaling.

Implement everything on an compliant cloud API

We must ensure that anything we build on premises must be cloud API complaint. Most cloud service providers are providing a standards compliant API. This is a new and evolving standard, however we should strive to remain compliant with everything we build. This will enable portability between all the technologies we intend to utilize, including PaaS (Stackato), IaaS (private cloud), AWS, etc.

Applicationss will have the ability to autoscale

We will provide a method by which applications can transparently autoscale. Applications will be able to describe their requirements for both resource minimum and maximum. Autoscaling itself will be triggered by an underlying monitoring system. We will use the puppet roles and profiles pattern so that new nodes can easily be turned up or down.

There will be no SLAs for applications that do not align with this specification

Going forward we will not provide support or SLAs for any technology that is not defined in our service matrix. This does not mean that this technology is in any way banned or discouraged, it only means that IT will not be providing any resources for support or troubleshooting. This goes hand in hand with the process for submitting new technology for matrix inclusion consideration. If you are deviating from this specification and knowingly incurring technical debt then it will require a conversation.

Provide best practice advice for deploying applications in our environment

We will create a wiki page on wiki.m.o with handy tips for developing applications. This document is intended to be collaborative and be updated by developers as well as IT. This document will primarily be targeted at new developers, but should also be useful as a reference for seasoned developers.

Provide a deployment system that is easy to use and configure

We will provide a simple framework by which Developers can push code updates to their applications. Ideally this system will work across all cloud environments. There will be a centralized web portal through which developers can trigger updates.

The configuration for this system will consist of two files: a first-run script and an update script. Applications should supply a single first-run script that initializes any application specific data (i.e. initial db data). Applications must supply a single update script that will be run every time the application is updated which will be run once per deploy for the entire application.

Provide segregation strategy for application and system configuration

There are numerous types of configuration data that need to be stored throughout the entire stack. We have divided these into four types: static configs, system level configs, application configs, and application secrets. We will provide specific ways to store each of these types different data types.

Provide standard monitoring suite

We will provide standard surface monitoring of all underlying technology in a standard format. For example when including a MySql service monitoring appropriate for this service will automatically be enabled. There will be escalation roles and structure built in to the auto deployed tests. We will additionally provide hooks which will enable users to attach monitoring to their test suite to provide deep, functional monitoring. This also provides easy access for QA integrated testing. Users will be able to customize where the alerts go and this may eventually be tied into SLA levels of service.

Provide users ready access to logs

We will provide users a simple way to view logs.

Provide an automatic backup solution

We will provide an automatic backup system which will backup user data automatically. This will assume that users are deploying in accordance with the best practices guidelines. Additionally there will be a way for users to describe custom backup locations and schedules.

Submitting new technology for consideration

We will provide a way for users to submit new technologies for inclusion in the service matrix. There will need to be a demonstrated business case and likely multiple requesters. The technology would then need to be built in a repeatable test environment where it could be evaluated for maintainability and cost of operation at scale. We could then have a discussion around whether we would support the technology with a standard SLA. The new service will need to be provided with a puppet role, likely a single module, which can be included in a profile. This will need to be a repository on git that can be pulled in and tested in a development environment.

What technical solutions are we proposing?

By the virtue of making decisions in section two it is necessary to propose certain technology solutions. Nonetheless, section three is by no means complete, there are many technical solutions yet to be selected. This will require additional time in discovery and testing of solutions before final recommendations can be made.

A number of technology decisions come with discussions and rationalizations. In an attempt to capture this we have created a collection of Discussion Pages. Many are also linked as footnotes herein.

We have started a Platform Blueprint document that begins to describe the architecture in detail.

  • Librarian-puppet & r10k
  • Use Puppet Roles and Profiles model
  • Cloud based infrastructure
  • HOT (Heat Orchestration Template) for the cloud API YAML file
    • Should work for managing AWS resources as well as OpenStack resources
    • Solves autoscaling
    • Centralized service description for each application
  • Dreadnot [4]
  • Captin / Shove
    • Likely replacement for Chief
  • Logging?
  • Monitoring?
    • Easy to set up basic Nagios and/or New Relic monitoring… is that enough?
    • Nimsoft?
    • How would we tie into MOC monitoring? AFAIK they don’t have a dashboard yet. :(
    • Laura would like to integrate QA testing with monitoring somehow
    • Best Practice for creating status pages that Nagios (or Nimsoft) can check
  • Backups?
    • Existing tape-backup solution is not automated and ownership is poorly defined
    • AWS S3/Glacier?
    • NetApp is HA, but very expensive- not cost-effective by itself
  • Configuration settings for applications will be??
    • ex: GA API keys, ADMINS blocks, etc… stuff that can’t go in public github, but should not be IT-managed. That is, stuff where we are currently just keyboard-monkeys for them.
    • whatever it is, devs need to have the ability to add settings themselves
    • ZooKeeper [5]
    • Metadata API
    • Other ideas?
  • The HOT template from production will be used to deploy development environments
  • PaaS??
    • Plan to move Stackato over (and potentially auto-scale DEAs)
    • Potential for drop-in OpenShift
    • Can we figure out how to replace this with an easy-to-use API that uses OpenStack directly? RH/Docker are looking into this already.
  • There will be a puppet root repository. It will contain:
    • Contain all modules (managed by librarian-puppet)
    • Contain all profile definitions
    • Contain all role definition
    • Will have tagged releases
  • A golden image will be built that contains:
    • The base OS and everything that comes with that
    • Nothing that would be installed via the puppet tree
      • Except for git (needed for bootstrapping)
    • The checked out contents of the root puppet tree (at a certain tag)
      • The tag needs to be specified when the image is being built
    • A sane version of hiera
    • A puppet client
      • It doesn't matter which version as the root puppet repository will install its own version
    • An init script that (described below)
  • When a machine is spun up it will do the following via an init script:
    • Introspect the nova-metadata-api to find:
      • It's puppet roll
      • Where ZooKeper is (Possibly? Maybe we should use a global DNS name)
      • Might use hira backed by ZooKeeper
    • Run puppet applying the correct puppet roll
      • (v1) Heira will be configured to use a yaml file pulled in via the clone of the root puppet tree. Secrets will be stored in a git submodule
      • (v2) Heira will look to ZooKeper for configuration data
  • The yaml file will look something like this::
    • resources:
      • foo_instance:
        • machine_role: foo_webserver
        • image: ubuntu_14.04_x86_3.0.2
  • The image id follows the following format::
    • <distro>_<distro-version>_<arch>_<root-puppet-version>_b<build-number>
    • This format would allow us to easily write a wrapper script that first checks for whether this image exists in glance, and if it doesn't, ask an api to build the image and upload it to glance.
  • Questions:
    • What prevents a node from applying a role from a different project
    • Profiles have heira lookups in them which makes them only work for one project at a time.
      • To reuse profiles we should proxy them to mozprofile