Identity/AttachedServices/StorageServiceArchitecture: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
 
(39 intermediate revisions by the same user not shown)
Line 1: Line 1:
== Summary ==
== Summary ==


This is a working proposal for the backend storage architecture of PiCL server.  It's based on a massively-sharded and cross-DC-replicated MySQL installation, and is far from final.  All feedback welcome!
This is a working proposal for the backend storage architecture of PiCL server.  It tries to take some of the good bits from the Firefox Sync backend, add in some lessons learned from running that in the field, simplify things a little, and make some adjustments towards stronger durability.  It is far from final.  All feedback welcome!


The goals are:
=== Goals ===


* Scale to billions of users.  Quickly.  Easily.
* Scale to billions of users.  Quickly.  Easily.
* Provide a simple programming model to the application.
* Don't lose user data.  Even if a machine dies.  Mostly even if a meteor hits a data-center.
* Provide a simple programming model to the client, and to the web application.
* Provide a relatively simple and well-understood Ops environment.
* Provide a relatively simple and well-understood Ops environment.
* Don't lose user data. Even if a machine dies.  Even if a meteor hits a data-center.
* Try to be low-cost, while maintaining acceptable levels of durability and availability.
* Provide for on-going infrastructure experiments, refinements and upgrades




The basic principles:
=== Boundary Conditions ===


* Each user's data is completely independent, there's no need for queries that cross multiple user accounts.
* Each user's data is completely independent, there's no need for queries that cross multiple user accounts.
** This means that our data storage problem is ''embarrassingly shardable''.  Good times!
** This means that our data storage problem is ''embarrassingly shardable''.  Good times!


* Each user account is assigned to a particular '''shard''', identified by an integer.
* The client-facing API is strongly consistent, and exposes an atomic check-and-set operation.
** Their shard assignment will never change unless they delete and re-create their account.


* All reads and writes for a shard go to a single '''master''' MySQL database.
* Initial deployment will be into AWS.
** This saves us having to deal with conflicting writes and othe multi-master headaches.
** To keep things simple, there are no read slaves.  The sharding is the only thing responsible for distributing server load.
** A single DB host machine might hold multiple shards.


* Each master synchronously replicates to a '''hot standby''' in the same DC, to guard against individual machine failure.
* It's OK to have brief periods of unavailability
** This may not be necessary, depending on instance reliability and our uptime requirements.
** This is, after all, a background service.  There's no user in the loop most of the time.
** The user-agent will be expected to deal gracefully with server unavailability.


* Each master asynchronously replicates to a '''warm standby''' in a separate DC, to guard against whole-DC failure.
* Ops would like the ability to move users onto different levels of infrastructure, depending on their usage profile
** Or maybe to multiple independent DCs.  Same same.
** For example, moving highly active users out of AWS and onto bare metal hardware.
** The warm standby should probably be a synchronous replication pair.  Because symmetry.  And failover.
** Or, moving inactive users off onto lower-cost storage.
** Or, just experimenting with a new setup for a select subset of users.


* All sharding logic and management lives in a stand-alone proxy process, so that it's transparent to the application.


== What the App Sees ==
=== Overview ===


From the POV of the application code, it's just talking to a regular old MySQL database:
Each user account will be assigned an opaque, immutable, numeric userid.  This is only for internal reference and client applications are not required to know it.  It will only change if they user completely deletes and then re-creates their account.


    +---------+          +--------------+
We run one or more independent '''storage clusters'''.  Each cluster is identified by a URL at which it speaks a common storage-server protocol.  Different clusters may be implemented in vastly different ways and have different operational properties.  For example, one might be using Cassandra, another might be using MySQL.  But they all look the same from the outside.
    | Web App |--------->| MySQL Server |
    +---------+          +--------------+


It has to respect a couple of restrictions though:
Each user account is explicitly assigned to a particular cluster.  This mapping is managed in a separate, high-availability system called the '''userdb'''.


* No DDL statements are allowed, only regular queriesSchema changes and sharding don't mix.
A user's cluster assignment might change over time, due to evolving infrastructure needsFor example, we might decommission a cluster and migrate all its users to a shiny new oneWe will take responsibility for moving the data around during a migration.
* All queries must specify a fixed userid, e.g. by having a "userid = XXX" component in the WHERE clause, or by inserting rows with a known useridThis enables us to do the sharding transparently.


Clients are responsible for discovering their assigned cluster and communicating with it using the common storage protocol.  They must be prepared to re-discover their cluster URL, if we happen to migrate the user to a different cluster.


== Transparent Sharding Proxy ==
Architecturally, the system winds up looking something like this:


The application code is actually talking to a proxy server that speaks the MySQL wire protocol.  In turn, the proxy is talking to the individual MySQL servers that are hosting each shard:


                                                  +---------------------------+
            login handshake      +--------+
                                            +----->| MySQL Server for Shard #1 |
        +----------------------->| UserDB |<-------------------+
    +---------+        +----------------+  |      +---------------------------+
        |+-----------------------| System |  management api  |
    | Web App |------->| Sharding Proxy |---+
        ||    cluster URL        +--------+                   |
    +---------+       +----------------+   |     +---------------------------+
        ||                                                    |
                                            +----->| MySQL Server for Shard #2 |
        ||                                                    |
                                                  +---------------------------+
        |v                                                    |
   +--------+  storage protocol  +----------------------+     |
  | client |<-------------------->| MySQL-Backed Cluster |<-----+
  +--------+                     +----------------------+     |
                                                                |
                                  +----------------------+      |
                                  | MySQL-Backed Cluster |<-----+
                                  +----------------------+      |
                                                                |
                                  +-------------------------+   |
                                  | Casandra-Backed Cluster |<--+
                                  +-------------------------+




The proxy process will:
Making explicit allowance for different clusters gives us a lot of operational flexibility.  We can transparently do things like:


* Receive, parse and validate each incoming query
* Experiment with new storage backends in relative safely
* Extract the target userid, erroring out if the query does not have one.
* Move heavy users onto a special cluster thats running on real hardware rather than AWS
* Look up the shard number and corresponding database for that userid.
* Move light or inactive users onto a special cluster using slower-but-cheaper infrastructure
* Forward the query to the appropriate database host, and proxy back the results.




The particulars of shard selection/lookup are not defined in this proposal, and are orthogonal to the rest of the setup.  :rfkelly likes the consistent-hashing-plus-vbucket approach taken by couchbase, but it could be as simple as a lookup table.  We assume that the proxy implements this appropriately and efficiently.
Having the client explicitly discover their cluster via a handshake means that we don't have to look up that information on every request, and don't have to internally route things to the correct location.


Handling all the sharding logic in a separate proxy process gives us a number of advantages:
== What the Client Sees ==


* Application code is greatly simplified.
To begin a syncing session, the user-agent first "logs in" to the storage system, performing a handshake to exchange its BrowserID assertion for some short-lived Hawk access credentials.  As part of this handshake, it will be told the base_url to which it should direct its storage operations.
** The same code paths are exercised in deployment, testing, and third-party deployments against a single database machine.


* The total number of connections is reduced, along with various connection-related overheads.
For simple third-party deployments, the base_url will point back to the originating server.  For at-scale Mozilla deployments, it will point into the user's assigned cluster.


* The proxy can do centralized health monitoring of the individual servers.
In this example, the user has id "12345" and is assigned to the "mysql3" cluster:


    >  POST https://storage.picl.services.mozilla.com HTTP/1.1
    >  {
    >  "assertion": <browserid assertion>,
    >  "device": <device UUID>
    >  }
    .
    .
    <  HTTP/1.1 200 OK
    <  Content-Type: application/json
    <  {
    <  "base_url": "https://mysql3.storage.picl.services.mozilla.com/storage/12345",
    <  "id": <hawk auth id>,
    <  "key": <hawk auth secret key>
    <  }
    <  }


== Intra-DC Redundancy ==


We need to guard against the loss of any individual server within a DC.  There are separate redundancy schemes for the MySQL servers, and for the other supporting services.
The client then syncs away by talking to this base_url via the as-yet-undefined sync protocol:


=== MySQL Redundancy ===
    >  GET https://mysql3.storage.picl.services.mozilla.com/storage/12345 HTTP/1.1
    >  Authorization:  <hawk auth parameters>
    .
    .
    <  HTTP/1.1 200 OK
    <  Content-Type: application/json
    <  {
    <  "collections": {
    <    "XXXXX": 42,
    <    "YYYYY": 128
    <  }
    <  }


To guard against the loss of any individual database server, each shard will also have a hot standby database, living in the same DC and configured for synchronous (semi-synchronous?) replication.  The proxy monitors the health of the standby database, but does not forward it any queries.  Its only job is to serve as a backup for the active master:


                                                  +---------------------+
When the Hawk credentials expire, or when the user's cluster assignment is changed, it will receive a "401 Unauthorized" response from the storage server.  To continue syncing, it will have to perform a new handshake and get a new base_url.  In this example, the user has been re-assigned to the "cassandra1" cluster:
                                            +----->| Master for Shard #1 |
                                            |      +----------+----------+
                                            |                | (replication)
    +---------+        +----------------+  |      +----------V---------------+
    | Web App |------->| Sharding Proxy |---+----->| Hot Standby for Shard #1 |
    +---------+        +----------------+  |      +--------------------------+
                                            |
                                            |      +---------------------+
                                            +----->| Master for Shard #2 |
                                            |      +----------+----------+
                                            |                | (replication)
                                            |      +----------V---------------+
                                            +----->| Hot Standby for Shard #2 |
                                                  +--------------------------+


    >  GET https://mysql3.storage.picl.services.mozilla.com/storage/12345 HTTP/1.1
    >  Authorization:  <hawk auth parameters>
    .
    .
    <  HTTP/1.1 401 Unauthorized
    <  Content-Length: 0
    .
    .
    >  POST https://storage.picl.services.mozilla.com HTTP/1.1
    >  {
    >  "assertion": <fresh browserid assertion>,
    >  "device": <device UUID>
    >  }
    .
    .
    <  HTTP/1.1 200 OK
    <  Content-Type: application/json
    <  {
    <  "base_url": "https://cassandra1.storage.picl.services.mozilla.com/storage/12345",
    <  "id": <hawk auth id>,
    <  "key": <hawk auth secret key>
    <  }
    <  }


The proxy process is responsible for monitoring the health of these machines and sounding the alarm if something goes wrong.  If the active master appears to be down, the proxy will transparently promote the hot standby and start sending queries to it.  When the downed master comes back up, it is demoted to being the new standby.


'''TODO:''' The failover could be performed manually, if we're a bit leery of infrastructure being too clever for its own good.


'''TODO:''' Just one standby?  Two?  The principle should be the same regardless of how many we have.  Star Topology FTW.
== The UserDB System ==


'''TODO:'''  We could use the standby as a read slave, but I don't see the point.  In a failure scenario the master needs to be able to handle the entire read load on its own, so it might as well do that all the time.
The UserDB system contains the mapping of user account emails to userids, and mapping of userids to clusters.


=== Other Service Redundancy ===
This component has a lot of similarity to the TokenServer from the Sync2.0 architecture:


We don't want any single-point-of-failures, so we'll have to have multiple instances of the webapp talking to multiple instances of the proxy. These are connected via loadbalancing, virtual IPs, and whatever Ops wizardry is required to make single-machine failures in each tier be a non-event:
  https://wiki.mozilla.org/Services/Sagrada/TokenServer
  https://docs.services.mozilla.com/token/index.html


  +--------------+    +------------------+
However, we intend for it to manage a relatively small number of clusters, which each have their own internal sharding or other scaling techniques, rather than managing a large number of service node shards. We're also going to simplify some of the secrets/signing management, and are not trying to support multiple services from a single user account.
  | Web App Tier |    | Shard Proxy Tier |        +---------------------+
  |              |    |                  |    +-->| Master for Shard #N |
  | +---------+ |    | +-------------+  |    |  +----------+----------+
  |  | Web App | |--->| | Shard Proxy |  |-----+              | (replication)
  |  +---------+ |    | +-------------+  |    |  +----------V---------------+
  |  +---------+ |    | +-------------+  |    +-->| Hot Standby for Shard #N |
  |  | Web App | |    | | Shard Proxy |  |        +--------------------------+
  |  +---------+ |    | +-------------+  |
  +--------------+    +------------------+


This system is not terribly write-heavy, but contains very valuable data that must be kept strongly consistent - if we lose the ability to direct a user to the correct cluster, or send different devices to different clusters, the user is not going to be happy.


Note that we're '''not''' doing this to the MySQL servers.  There's too many of them and we already have a custom redundancy scheme from the hot standby.
It also needs to be highly available for reads, since if UserDB read capability goes down, then we lose the ability for clients to "log in" across all clusters.


With multiple Shard Proxy processes, we run into the problem of shared stateThey must all agree on the current mapping of userids to shards, of shards to database machines, and which database machines are master versus standby.  They'll operate as a ZooKeeper (or similar) cluster to store this state in a consistent and highly-available fashion:
To keep things simple and reliable and available, this will use a Multi-DC Replicated MySQL setupIt would be awesome if the write load is small enough to do '''synchronous''' replication here, using something like Galera cluster:


  +----------------------------------------------+
  http://codership.com/content/using-galera-cluster
  | Shard Proxy Tier                            |
  |                                              |
  |  +------------------+    +-----------------+ |
  |  | Shard Proxy:     |    | Shard Proxy:    | |
  |  |  ZooKeeper Node <+----+> ZooKeeper Node | |
  |  |  Proxy Process  |    |  Proxy Process  | |
  |  +----|-------------+    +----|------------+ |
  |      |                      |              |
  |      +-----------+-----------+              |
  +-------------------|--------------------------+
                      V
              ...................
              : MySQL Instances :
              :.................:


If not, then a standard master/slave setup should be OK.  As long as we're careful no to give users stale cluster assignments.


Note that this shard-state metadata will be very small and be updated very infrequently, which should make it very friendly to a local zookeeper installation.
Example schema:


    CREATE TABLE users
      userid INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY
      email VARCHAR(128) NOT NULL UNIQUE
      clusterid INTEGER NOT NULL
      previous_clusterid INTEGER


== Inter-DC Redundancy ==
Each user is assigned to a particular cluster.  We can also track the cluster to which they were previously assigned, to help with managing migration of users between clusters.


We'll replicate the entire stack into several data-centers, each of which will maintain a full copy of all shards.
    CREATE TABLE clusters
      clusterid INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY
      base_url VARCHAR(128) NOT NULL
      assignment_weight INTEGER NOT NULL


One DC will be the active master for each shard.  All reads and writes for that shard will be forwarded into that DC and routed to the master(This will save us a ''world of pain'' by not having multiple conflicting writes going into different DCs).  Other DCs are designated as warm-standby hosts for that shard, configured for asynchronous WAN replicationThey can be failed-over to if there is a serious outage in the master DC, but this will almost certainly result in the loss of some recent transactions:
Each cluster as a base_url and an assignment_weightWhen a new user account gets created, we randomly assign them to a cluster with probability proportional to the assignment_weightSet it to zero to stop sending new users to a particular cluster.


  +----------------------------------------------------------------------------------+
This service will need to have a user-facing API to support the login handshake dance, and some private management APIs for managing clusters, assignments, etc. Maybe even a nice friendly admin UI for the ops folks to use.
  | US-East Data Center                                                              |
  |                                                                                  |
  |  +--------------+    +------------------+                                      |
  |  | Web App Tier |    | Shard Proxy Tier |        +---------------------+      |
  |  |              |    |                  |    +-->| Master for Shard #N |-------+-----+
  |  |  +---------+ |    | +-------------+  |    |  +----------+----------+      |    |
  |  |  | Web App | |--->| | Shard Proxy |  |-----+              | (replication)    |    |
  |  |  +---------+ |    | +-------------+  |    |  +----------V---------------+ |    |
  |  |  +---------+ |    | +-------------+  |    +-->| Hot Standby for Shard #N |  |    | (very slow replication)
  |  |  | Web App | |    | | Shard Proxy |  |        +--------------------------+  |    |
  |  |  +---------+ |    | +-------------+  |                                      |    |
  |  +--------------+    +------------------+                                      |    |
  |                                |                                                |    |
  +--------------------------------+-------------------------------------------------+    |
                                  |                                                      |
                                  | (very slow replication)                              |
                                  |                                                      |
  +--------------------------------+---------------------------------------------------+  |
  | US-West Data Center            |                                                  |  |
  |                                V                                                  |  |
  |  +--------------+    +------------------+                                        |  |
  |  | Web App Tier |    | Shard Proxy Tier |        +---------------------------+  |  |
  |  |              |    |                  |    +-->| Warm Standby for Shard #N |<--|---+
  |  |  +---------+ |    | +-------------+  |    |  +----------+----------------+  |
  |  |  | Web App | |--->| | Shard Proxy |  |-----+              | (replication)      |
  |  |  +---------+ |    | +-------------+  |    |  +----------V-----------------+  |
  |  |  +---------+ |    | +-------------+  |    +-->| Tepid Standby for Shard #N |  |
  |  |  | Web App | |    | | Shard Proxy |  |        +----------------------------+  |
  |  |  +---------+ |    | +-------------+  |                                        |
  |  +--------------+    +------------------+                                        |
  +------------------------------------------------------------------------------------+


== Types of Cluster ==


For this scheme to work, every Shard Proxy process in every DC needs to agree on where the master is for every shard.  That's not trivial.  But it should be pretty doable with some replication between the respective ZooKeeper masters in each DC.
We'll likely start with a single cluster into which all users are assigned.  But here are some ideas for how we could implement different types of cluster with different performance, costs, tradeoffs, etc.


We would probably not try to automate whole-DC failover, so that Ops can ensure consistent state before anything tries to send writes to a new location.
=== Massively-Shared MySQL ===


'''TODO:''' How many DCs?  The principle should be the same regardless of how many we haveNested Star Topology FTW.
One of the leading options for storage is a massively-sharded MySQL setup, taking advantage of the highly shardable nature of the data setThis essentially the storage architecture underlying Firefox Sync, but we could make a lot of operational improvements.


'''TODO:''' Will we wind up in a situation where different shard have their master db in different DCs? Does it matter?
Details here[[Identity/AttachedServices/StorageServiceArchitecture/MySQLStorageCluster|MySQL Storage Cluster]]


'''TODO:''' Do we want to try for any sort of geo-locality with data storage, e.g. putting the master for European users in Europe?  This could greatly complicate the assignment of users to shards.
Basic principles:


* Each user is transparently mapped to a shard via e.g. consistent hashing
* All reads and writes for a shard go to a single '''master''' MySQL database, so avoid consistency headaches.
* Each master synchronously replicates to one or more '''hot standby''' dbs in the same DC, to guard against individual machine failure.
* One of the standby dbs is periodically snapshotted into S3, to guard agaist data loss if the whole DC goes down.
* There is no cross-DC replication; if the DC goes down, the cluster becomes unavailable and we might have to restore from S3.
* All sharding logic and management lives in a stand-alone "db router" process, so that it's transparent to the webapp code.


== Implementing the Proxy ==


The Sharding Proxy process is obviously key here, and looks like a reasonably complex beast.  Can we use an off-the-shelf solution for this? There are some that have most of the required features, e.g. ScaleBase.
There's a commercial software product called "ScaleBase" that implements much of this functionality off the shelf. We should start there, but keep in mind the possibility of a custom dbrouter process.


On the other hand, the feature set seems small enough that we could realistically implement the proxy in-house, with the benefit of tighter focus and greater control over the details of monitoring, failover, replication etc.
'''Pros''':  Well-known and well-understood technology.  No-one ever got fired for choosing MySQL.


'''Cons''':  Lots of moving parts.  MySQL may not be very friendly to our write-heavy performance profile.
=== Cassandra Cluster ===
Another promising storage option is Cassandra.  It provides a rich-enough data model and automatic cluster management, at the cost of eventual consistency and the vague fear that it will try to do something "clever" when you really don't want it to.  To get strong consistency back, we'd use a locking layer such as Zookeeper or memcached.
Details here:  [[Identity/AttachedServices/StorageServiceArchitecture/CassandraStorageCluster|Cassandra Storage Cluster]]
Basic principles:
* There is a single Cassandra storage node cluster fronted by the usual array of webhead machines.
* It uses a replication factor of 3, QUORUM reads and writes, and all notes live in a single datacenter.
* The webheads also have a shared ZooKeeper or memcached install, which they use to serialize operations on a per-user basis
* Cassandra is periodically snapshotted into S3 for extra durability.
'''Pros''':  Easy management and scalability.  Very friendly to write-heavy workloads.
'''Cons''':  Unknown and untrusted.  Harder to hire expertise.  Eventual consistency scares me.
=== Hibernation Cluster ===
If a user doesn't use the service in, say, six months, then we could migrate them out of one of the active clusters and into a special "hibernation cluster".
Data that is moved into this cluster might simply be snapshoted into low-cost storage such as S3.  Or it might get put onto a very crowded, very slow MySQL machine that can only handle a trickle of user requests.
If they come back and try to use their data again, we immediately trigger a migration back to one of the active clusters.
'''Pros''':  Massive cost savings.
'''Cons''':  Have to actually monitor usage and implement this.


== Things To Think About ==
== Things To Think About ==


I've tried to strike a balance between operational simplicity, application simplicity, and functionality hereWe pay a price for it:
* There's a bit of management overhead in the API, with the handshake etc.  We could consider factoring that out and just doing the routing internallyBut there's something to be said for explicitness.
 
* There's quite a few moving parts here.  Particularly the proxy process has a few different, interacting responsibilities.


* There's the potential for severe cross-DC latency if you receive a HTTP request in one data-center, but have to forward all the MySQL queries over to the master in another data-centerI don't think there's any way around this without going to an eventually-consistent model, which will complicate the client API.
* We could avoid the client having to be "cluster aware" by caching the cluster-assignment details in their Hawk Auth credentials.  This would simplify the client somewhat, but complicate the server because we'd have to route each request to its appropriate end-point internally.   


* Are our uptime requirements so tight that we really need a warm-standby in a separate DC? Could we get away with just the hot standby and periodic database dumps into S3, with which we can (slowly) recover from meteor-hit-the-data-center scale emergencies?
* Needs a detailed and careful plan for how we would migrate users from one cluster to another. Very doable, just fiddly and potentially quite slow.

Latest revision as of 06:18, 11 June 2013

Summary

This is a working proposal for the backend storage architecture of PiCL server. It tries to take some of the good bits from the Firefox Sync backend, add in some lessons learned from running that in the field, simplify things a little, and make some adjustments towards stronger durability. It is far from final. All feedback welcome!

Goals

  • Scale to billions of users. Quickly. Easily.
  • Don't lose user data. Even if a machine dies. Mostly even if a meteor hits a data-center.
  • Provide a simple programming model to the client, and to the web application.
  • Provide a relatively simple and well-understood Ops environment.
  • Try to be low-cost, while maintaining acceptable levels of durability and availability.
  • Provide for on-going infrastructure experiments, refinements and upgrades


Boundary Conditions

  • Each user's data is completely independent, there's no need for queries that cross multiple user accounts.
    • This means that our data storage problem is embarrassingly shardable. Good times!
  • The client-facing API is strongly consistent, and exposes an atomic check-and-set operation.
  • Initial deployment will be into AWS.
  • It's OK to have brief periods of unavailability
    • This is, after all, a background service. There's no user in the loop most of the time.
    • The user-agent will be expected to deal gracefully with server unavailability.
  • Ops would like the ability to move users onto different levels of infrastructure, depending on their usage profile
    • For example, moving highly active users out of AWS and onto bare metal hardware.
    • Or, moving inactive users off onto lower-cost storage.
    • Or, just experimenting with a new setup for a select subset of users.


Overview

Each user account will be assigned an opaque, immutable, numeric userid. This is only for internal reference and client applications are not required to know it. It will only change if they user completely deletes and then re-creates their account.

We run one or more independent storage clusters. Each cluster is identified by a URL at which it speaks a common storage-server protocol. Different clusters may be implemented in vastly different ways and have different operational properties. For example, one might be using Cassandra, another might be using MySQL. But they all look the same from the outside.

Each user account is explicitly assigned to a particular cluster. This mapping is managed in a separate, high-availability system called the userdb.

A user's cluster assignment might change over time, due to evolving infrastructure needs. For example, we might decommission a cluster and migrate all its users to a shiny new one. We will take responsibility for moving the data around during a migration.

Clients are responsible for discovering their assigned cluster and communicating with it using the common storage protocol. They must be prepared to re-discover their cluster URL, if we happen to migrate the user to a different cluster.

Architecturally, the system winds up looking something like this:


            login handshake      +--------+
        +----------------------->| UserDB |<-------------------+
        |+-----------------------| System |   management api   |
        ||    cluster URL        +--------+                    |
        ||                                                     |
        ||                                                     |
        |v                                                     |
 +--------+   storage protocol   +----------------------+      |
 | client |<-------------------->| MySQL-Backed Cluster |<-----+
 +--------+                      +----------------------+      |
                                                               |
                                 +----------------------+      |
                                 | MySQL-Backed Cluster |<-----+
                                 +----------------------+      |
                                                               |
                                 +-------------------------+   |
                                 | Casandra-Backed Cluster |<--+
                                 +-------------------------+


Making explicit allowance for different clusters gives us a lot of operational flexibility. We can transparently do things like:

  • Experiment with new storage backends in relative safely
  • Move heavy users onto a special cluster thats running on real hardware rather than AWS
  • Move light or inactive users onto a special cluster using slower-but-cheaper infrastructure


Having the client explicitly discover their cluster via a handshake means that we don't have to look up that information on every request, and don't have to internally route things to the correct location.

What the Client Sees

To begin a syncing session, the user-agent first "logs in" to the storage system, performing a handshake to exchange its BrowserID assertion for some short-lived Hawk access credentials. As part of this handshake, it will be told the base_url to which it should direct its storage operations.

For simple third-party deployments, the base_url will point back to the originating server. For at-scale Mozilla deployments, it will point into the user's assigned cluster.

In this example, the user has id "12345" and is assigned to the "mysql3" cluster:

   >  POST https://storage.picl.services.mozilla.com HTTP/1.1
   >  {
   >   "assertion": <browserid assertion>,
   >   "device": <device UUID>
   >  }
   .
   .
   <  HTTP/1.1 200 OK
   <  Content-Type: application/json
   <  {
   <   "base_url": "https://mysql3.storage.picl.services.mozilla.com/storage/12345",
   <   "id": <hawk auth id>,
   <   "key": <hawk auth secret key>
   <   }
   <  }


The client then syncs away by talking to this base_url via the as-yet-undefined sync protocol:

   >  GET https://mysql3.storage.picl.services.mozilla.com/storage/12345 HTTP/1.1
   >  Authorization:  <hawk auth parameters>
   .
   .
   <  HTTP/1.1 200 OK
   <  Content-Type: application/json
   <  {
   <   "collections": {
   <     "XXXXX": 42,
   <     "YYYYY": 128
   <   }
   <  }


When the Hawk credentials expire, or when the user's cluster assignment is changed, it will receive a "401 Unauthorized" response from the storage server. To continue syncing, it will have to perform a new handshake and get a new base_url. In this example, the user has been re-assigned to the "cassandra1" cluster:

   >  GET https://mysql3.storage.picl.services.mozilla.com/storage/12345 HTTP/1.1
   >  Authorization:  <hawk auth parameters>
   .
   .
   <  HTTP/1.1 401 Unauthorized
   <  Content-Length: 0
   .
   .
   >  POST https://storage.picl.services.mozilla.com HTTP/1.1
   >  {
   >   "assertion": <fresh browserid assertion>,
   >   "device": <device UUID>
   >  }
   .
   .
   <  HTTP/1.1 200 OK
   <  Content-Type: application/json
   <  {
   <   "base_url": "https://cassandra1.storage.picl.services.mozilla.com/storage/12345",
   <   "id": <hawk auth id>,
   <   "key": <hawk auth secret key>
   <   }
   <  }


The UserDB System

The UserDB system contains the mapping of user account emails to userids, and mapping of userids to clusters.

This component has a lot of similarity to the TokenServer from the Sync2.0 architecture:

 https://wiki.mozilla.org/Services/Sagrada/TokenServer
 https://docs.services.mozilla.com/token/index.html

However, we intend for it to manage a relatively small number of clusters, which each have their own internal sharding or other scaling techniques, rather than managing a large number of service node shards. We're also going to simplify some of the secrets/signing management, and are not trying to support multiple services from a single user account.

This system is not terribly write-heavy, but contains very valuable data that must be kept strongly consistent - if we lose the ability to direct a user to the correct cluster, or send different devices to different clusters, the user is not going to be happy.

It also needs to be highly available for reads, since if UserDB read capability goes down, then we lose the ability for clients to "log in" across all clusters.

To keep things simple and reliable and available, this will use a Multi-DC Replicated MySQL setup. It would be awesome if the write load is small enough to do synchronous replication here, using something like Galera cluster:

 http://codership.com/content/using-galera-cluster

If not, then a standard master/slave setup should be OK. As long as we're careful no to give users stale cluster assignments.

Example schema:

   CREATE TABLE users
     userid INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY
     email VARCHAR(128) NOT NULL UNIQUE
     clusterid INTEGER NOT NULL
     previous_clusterid INTEGER

Each user is assigned to a particular cluster. We can also track the cluster to which they were previously assigned, to help with managing migration of users between clusters.

   CREATE TABLE clusters
     clusterid INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY
     base_url VARCHAR(128) NOT NULL
     assignment_weight INTEGER NOT NULL

Each cluster as a base_url and an assignment_weight. When a new user account gets created, we randomly assign them to a cluster with probability proportional to the assignment_weight. Set it to zero to stop sending new users to a particular cluster.

This service will need to have a user-facing API to support the login handshake dance, and some private management APIs for managing clusters, assignments, etc. Maybe even a nice friendly admin UI for the ops folks to use.

Types of Cluster

We'll likely start with a single cluster into which all users are assigned. But here are some ideas for how we could implement different types of cluster with different performance, costs, tradeoffs, etc.

Massively-Shared MySQL

One of the leading options for storage is a massively-sharded MySQL setup, taking advantage of the highly shardable nature of the data set. This essentially the storage architecture underlying Firefox Sync, but we could make a lot of operational improvements.

Details here: MySQL Storage Cluster

Basic principles:

  • Each user is transparently mapped to a shard via e.g. consistent hashing
  • All reads and writes for a shard go to a single master MySQL database, so avoid consistency headaches.
  • Each master synchronously replicates to one or more hot standby dbs in the same DC, to guard against individual machine failure.
  • One of the standby dbs is periodically snapshotted into S3, to guard agaist data loss if the whole DC goes down.
  • There is no cross-DC replication; if the DC goes down, the cluster becomes unavailable and we might have to restore from S3.
  • All sharding logic and management lives in a stand-alone "db router" process, so that it's transparent to the webapp code.


There's a commercial software product called "ScaleBase" that implements much of this functionality off the shelf. We should start there, but keep in mind the possibility of a custom dbrouter process.

Pros: Well-known and well-understood technology. No-one ever got fired for choosing MySQL.

Cons: Lots of moving parts. MySQL may not be very friendly to our write-heavy performance profile.

Cassandra Cluster

Another promising storage option is Cassandra. It provides a rich-enough data model and automatic cluster management, at the cost of eventual consistency and the vague fear that it will try to do something "clever" when you really don't want it to. To get strong consistency back, we'd use a locking layer such as Zookeeper or memcached.

Details here: Cassandra Storage Cluster

Basic principles:

  • There is a single Cassandra storage node cluster fronted by the usual array of webhead machines.
  • It uses a replication factor of 3, QUORUM reads and writes, and all notes live in a single datacenter.
  • The webheads also have a shared ZooKeeper or memcached install, which they use to serialize operations on a per-user basis
  • Cassandra is periodically snapshotted into S3 for extra durability.


Pros: Easy management and scalability. Very friendly to write-heavy workloads.

Cons: Unknown and untrusted. Harder to hire expertise. Eventual consistency scares me.

Hibernation Cluster

If a user doesn't use the service in, say, six months, then we could migrate them out of one of the active clusters and into a special "hibernation cluster".

Data that is moved into this cluster might simply be snapshoted into low-cost storage such as S3. Or it might get put onto a very crowded, very slow MySQL machine that can only handle a trickle of user requests.

If they come back and try to use their data again, we immediately trigger a migration back to one of the active clusters.

Pros: Massive cost savings.

Cons: Have to actually monitor usage and implement this.

Things To Think About

  • There's a bit of management overhead in the API, with the handshake etc. We could consider factoring that out and just doing the routing internally. But there's something to be said for explicitness.
  • We could avoid the client having to be "cluster aware" by caching the cluster-assignment details in their Hawk Auth credentials. This would simplify the client somewhat, but complicate the server because we'd have to route each request to its appropriate end-point internally.
  • Needs a detailed and careful plan for how we would migrate users from one cluster to another. Very doable, just fiddly and potentially quite slow.