Community Ops/paas/Backups

From MozillaWiki
Jump to: navigation, search

Infrastructure backup

Assumptions

  • The majority of our infrastructure is based in the idea to have as many immutable parts as possible.
    • Docker images
    • Marathon deployed apps
    • Software stack
      • Mesos
      • Marathon
      • Zookeeper
      • Consul checks
  • We should be comfortable with the loss of some of our EC2 instances
  • Our EC2 based infra is HA
  • "Backups" refer to a point-in-time copy of a service or resource
  • We should utilize AWS hosted services to avoid maintenance overhead
  • All the backups should be encrypted

Mutable part

  • Persistent storage
    • EFS
  • Marathon app definitions
  • Chronos task definitions
  • Databases
  • Consul KV
  • WP sites

External dependencies & redundancy

At deploy time we should not rely to a single external (3rd party) service because it’s a SPOF that we don’t control. We need to have redundant access to data living in external dependencies.

  • Docker images

Backup implementation

EFS

  • Backup is going to live S3/Glacier
  • Implement a script to do scheduled backups based on a backup tool
  • Deploy it in chronos
  • Schedule policy
    • 7 times a week
      • Lives in S3
    • 4 times per month
      • Lives in Glacier
    • 12 times per year
      • Lives in Glacier

Marathon/Chronos definitions

  • Backup is going to live in a versioned S3 bucket
  • Implement a script to do scheduled backups using marathon/chronos HTTP API
  • Deploy it in chronos
  • Schedule policy
    • 7 times a week
    • 4 times per month
    • 12 times per year

Databases

  • Already backed by RDS
  • Current policy
    • 7 times a week
  • Future policy
    • 7 times a week on RDS
    • 12 times a year on S3/Glacier

Consul K/V

  • Backup is going to live in a versioned S3 bucket
  • Implement a script to do scheduled backups using consul HTTP API
  • Deploy it in chronos
  • Schedule policy
    • 7 times a week
    • 4 times per month
    • 12 times per year

WP sites

  • Backup is going to live in S3
  • Use MainWP native backup functionality
  • Schedule policy
    • Once per week

3rd party services

Docker

  • Docker registry mirror
  • Maybe a hosted one
  • EC2 container registry is not the best one but it’s hosted by AWS

Restoring from backup

Infrastructure

  • Ansible playbooks for config management
  • Terraform for resources management

Storage

  • Use the backup tool to revert to a point in time
  • Implementation
    • Native tool functionality

Marathon/Chronos/Consul

  • Redeploy the definition
  • Implementation
    • Write a script to populate the service definitions using HTTP API

WP Sites

  • Native restore functionality in MainWP
  • Implementation
    • Native tool functionality

Databases

  • Restore from snapshot
  • Implementation
      • Native tool functionality