Identity/AttachedServices/StorageServiceArchitecture: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
No edit summary
Line 1: Line 1:
== Summary ==
== Summary ==


This is a working proposal for the backend storage architecture of PiCL server.  It's based on a massively-sharded and cross-DC-replicated MySQL installation, and is far from final.  All feedback welcome!
This is a working proposal for the backend storage architecture of PiCL server.  It tries to take some of the good bits from the Firefox Sync backend, add in some lessons learned from running that in the field, simplify things a little, and make some adjustments towards stronger durability.  It is far from final.  All feedback welcome!


Goals:
Goals:
Line 7: Line 7:
* Scale to billions of users.  Quickly.  Easily.
* Scale to billions of users.  Quickly.  Easily.
* Don't lose user data.  Even if a machine dies.  Even if a meteor hits a data-center.
* Don't lose user data.  Even if a machine dies.  Even if a meteor hits a data-center.
* Maximize uptime, running costs be damned.
* Provide a simple programming model to the client, and to web application.
* Provide a simple programming model to the application.
* Provide a relatively simple and well-understood Ops environment.
* Provide a relatively simple and well-understood Ops environment.
* Try to be low-cost, while maintaining acceptable levels of durability and availability.
* Provide for on-going infrasturcture experiments, refinements and upgrades




Line 18: Line 19:


* The client-facing API is strongly consistent, and exposes an atomic check-and-set operation.
* The client-facing API is strongly consistent, and exposes an atomic check-and-set operation.
** This makes an eventually-consistent NoSQL store rather less attractive.
** This makes an eventually-consistent NoSQL store rather less attractive, unless coupled with a strongly-consistent control layer e.g.
 
zookeeper.
 
* Initial deployment will be into AWS.
** Assuming PiCL succeeds in replacing sync, we can probably subsume some of the sync hardware over time.
 
* It's OK to have brief periods of unavailability
** This is, after all, a background service.  There's no user in the loop most of the time.
** The user-agent will be expected to deal gracefully with server unavailability.
 
* Ops would like the ability to move users onto different levels of infrastructure, depending on their usage profile
** For example, moving highly active users out of AWS and onto bare metal hardware.
** Or, moving inactive users off onto lower-cost storage.
** Or, just experimenting with a new setup for a select subset of users.




Basic Principles:
Basic Principles:


* Each user account is assigned to a particular '''shard''', identified by an integer.
* Each user account gets an opaque, immutable user id.
** Their shard assignment will never change unless they delete and re-create their account.
** This will only change if they completely delete and then re-create their account.
 
* Each user account is explicitly assigned to a particular '''cluster'''.
** Each cluster is a stand-alone piece of infrastructure with no links to other clusters.
** Each cluster is responsible for its own durability, replication, scalability and so-on.
 
* Each cluster is identified by a URL, at which it speaks a common protocol.
** Different clusters may have different underlying technologies, e.g. one may be MySQL, one may be Cassandra.
** But they all look the same from the outside.
 
* A user's cluster assignment might change over time; this migration will require careful management.
** This would be fairly infrequent, however.
 
* The user-account and cluster-mapping information lives in a stand-alone piece of infra, the "userdb".
 
 
Architecturally, the system winds up looking something like this:
 
 
            login handshake      +--------+
        +----------------------->| UserDB |<-------------------+
        |+-----------------------| System |  management api  |
        ||    cluster URL        +--------+                    |
        ||                                                    |
        ||                                                    |
        |v                                                    |
  +--------+  storage protocol  +----------------------+      |
  | client |<-------------------->| MySQL-Backed Cluster |<-----+
  +--------+                      +----------------------+      |
                                                                |
                                  +----------------------+      |
                                  | MySQL-Backed Cluster |<-----+
                                  +----------------------+      |
                                                                |
                                  +-------------------------+  |
                                  | Casandra-Backed Cluster |<--+
                                  +-------------------------+
 
 
== What the Client Sees ==
 
To begin a syncing session, the user-agent first "logs in" to the storage system, performing a handshake to exchange its BrowserID assertion for some short-lived Hawk access credentials.  As part of this handshake, it will be told the base_url to which it should direct its storage operations.
 
For simple third-party deployments, the base_url will point back to the originating server.  For at-scale Mozilla deployments, it will point into the user's assigned cluster.
 
In this example, the user has id "12345" and is assigned to the "mysql3" cluster:
 
    >  POST https://storage.picl.services.mozilla.com HTTP/1.1
    >  {
    >  "assertion": <browserid assertion>,
    >  "device": <device UUID>
    >  }
    .
    .
    <  HTTP/1.1 200 OK
    <  Content-Type: application/json
    <  {
    <  "base_url": "https://mysql3.storage.picl.services.mozilla.com/storage/12345",
    <  "id": <hawk auth id>,
    <  "key": <hawk auth secret key>
    <  }
    <  }
 
 
The client then syncs away by talking to this base_url via the as-yet-undefined sync protocol:
 
    >  GET https://mysql3.storage.picl.services.mozilla.com/storage/12345 HTTP/1.1
    >  Authorization:  <hawk auth parameters>
    .
    .
    <  HTTP/1.1 200 OK
    <  Content-Type: application/json
    <  {
    <  "collections": {
    <    "XXXXX": 42,
    <    "YYYYY": 128
    <  }
    <  }
 
 
When the Hawk credentials expire, or when the user's cluster assignment is changed, it will receive a "401 Unauthorized" response from the storage server.  To continue syncing, it will have to perform a new handshake and get a new base_url.  In this example, the user has been re-assigned to the "cassandra1" cluster:
 
    >  GET https://mysql3.storage.picl.services.mozilla.com/storage/12345 HTTP/1.1
    >  Authorization:  <hawk auth parameters>
    .
    .
    <  HTTP/1.1 401 Unauthorized
    <  Content-Length: 0
    .
    .
    >  POST https://storage.picl.services.mozilla.com HTTP/1.1
    >  {
    >  "assertion": <fresh browserid assertion>,
    >  "device": <device UUID>
    >  }
    .
    .
    <  HTTP/1.1 200 OK
    <  Content-Type: application/json
    <  {
    <  "base_url": "https://cassandra1.storage.picl.services.mozilla.com/storage/12345",
    <  "id": <hawk auth id>,
    <  "key": <hawk auth secret key>
    <  }
    <  }
 
 
 
== The UserDB System ==
 
The UserDB system contains the mapping of user account emails to userids, and mapping of userids to clusters.
 
This component has a lot of similarity to the TokenServer from the Sync2.0 architecture:
 
  https://wiki.mozilla.org/Services/Sagrada/TokenServer
  https://docs.services.mozilla.com/token/index.html
 
However, we intend for it to manage a relatively small number of clusters, which each have their own internal sharding or other scaling techniques, rather than managing a large number of service node shards.  We're also going to simplify some of the secrets/signing management, and not supporting multiple services from a single user account.
 
It's not terribly write-heavy, but is very valuable data that must be kept strongly consistent - if we lose the ability to direct a user to the correct cluster, or send different devices to different clusters, the user is not going to be happy.
 
It also needs to be highly available for reads, since if UserDB read capability goes down, then we lose the ability to access all clusters.
 
To keep things simple and reliable and available, this will use a Multi-DC Replicated MySQL setup.  It would be awesome if the write load is small enough to do '''synchronous''' replication here, using something like Galera cluster:
 
  http://codership.com/content/using-galera-cluster
 
If not, then a standard master/slave setup should be OK.  As long as we're careful about send users to stale cluster assignments.
 
Example schema:
 
    CREATE TABLE users
      userid INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY
      email VARCHAR(128) NOT NULL
      clusterid INTEGER NOT NULL
      previous_clusterid INTEGER
 
Each user is assigned to a particular cluster.  We can also track the cluster they were previously assigned to, which might help with managing migration of users between clusters.
 
 
    CREATE TABLE clusters
      clusterid INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY
      base_url VARCHAR(128) NOT NULL
      assignment_weight INTEGER NOT NULL
 
Each cluster as a base_url and an assignment_weight.  When a new user account gets created, we randomly assignment the to a cluster with probability proportional to the assignment_weight.  Set it to zero to stop sending new users to a particular cluster.
 
This service will need to have a user-facing API to support the login handshake dance, and some private management APIs for managing clusters, assignments, etc.  Maybe even a nice friendly admin UI for the ops folks to use.
 
 
== A Massively-Sharded MySQL Cluster ==
 
One of the leading options for storage is a massively-sharded MySQL setup, taking advantage of the highly shardable nature of the data set.  Basic principles:
 
* Each user is transaprently mapped to a shard via e.g. consistent hashing


* All reads and writes for a shard go to a single '''master''' MySQL database.
* All reads and writes for a shard go to a single '''master''' MySQL database.
Line 33: Line 202:
* Each master synchronously replicates to one or more '''hot standby''' dbs in the same DC, to guard against individual machine failure.
* Each master synchronously replicates to one or more '''hot standby''' dbs in the same DC, to guard against individual machine failure.


* The entire DC is asynchronously replicated to a '''warm standby''' setup in another region, to guard against whole-DC failure.
* One of the standby dbs is periodically snapshotted into S3, to guard agaist data loss if the whole DC goes down.


* All sharding logic and management lives in a stand-alone "db router" process, so that it's transparent to the application.
* There is no cross-DC replication; if the DC goes down, the cluster becomes unavailable and we might have to restore from S3.


* All sharding logic and management lives in a stand-alone "db router" process, so that it's transparent to the webapp code.


== What the App Sees ==
* We should try to implement this using ScaleBase to start, but keep in mind the possibility of a custom dbrouter process.


From the POV of the application code, it's just talking to a regular old MySQL database:
 
=== What the WebApp Sees ===
 
From the POV of the webapp code, it's just talking to a regular old MySQL database:


     +---------+          +--------------+
     +---------+          +--------------+
Line 52: Line 225:




== Transparent DB Router ==
=== Transparent DB Router ===


The application code is actually talking to a "db router" server that speaks the MySQL wire protocol.  In turn, the router is talking to the individual MySQL servers that are hosting each shard:
The application code is actually talking to a "db router" server that speaks the MySQL wire protocol.  In turn, the router is talking to the individual MySQL servers that are hosting each shard:
Line 85: Line 258:




== Intra-DC Redundancy ==
=== Intra-DC Redundancy ===


We need to guard against the loss of any individual server within a DC.  There are separate redundancy schemes for the MySQL servers, and for the other supporting services.
We need to guard against the loss of any individual server within a DC.  There are separate redundancy schemes for the MySQL servers, and for the other supporting services.


=== MySQL Redundancy ===
==== MySQL Redundancy ====


To guard against the loss of any individual database server, each shard will also have a hot standby database, living in the same DC and configured for synchronous (semi-synchronous?) replication.  For AWS it would be in a separate Availability Zone.  The router monitors the health of the standby database, but does not forward it any queries.  Its only job is to serve as a backup for the active master:
To guard against the loss of any individual database server, each shard will also have a hot standby database, living in the same DC and configured for synchronous (semi-synchronous?) replication.  For AWS it would be in a separate Availability Zone.  The router monitors the health of the standby database, but does not forward it any queries.  Its only job is to serve as a backup for the active master:
Line 118: Line 291:
'''TODO:'''  We could use the standby as a read slave, but I don't see the point.  In a failure scenario the master needs to be able to handle the entire read load on its own, so it might as well do that all the time.
'''TODO:'''  We could use the standby as a read slave, but I don't see the point.  In a failure scenario the master needs to be able to handle the entire read load on its own, so it might as well do that all the time.


=== Other Service Redundancy ===
 
==== Other Service Redundancy ====


We don't want any single-point-of-failures, so we'll have to have multiple instances of the webapp talking to multiple instances of the router.  These are connected via loadbalancing, virtual IPs, and whatever Ops wizardry is required to make single-machine failures in each tier be a non-event:
We don't want any single-point-of-failures, so we'll have to have multiple instances of the webapp talking to multiple instances of the router.  These are connected via loadbalancing, virtual IPs, and whatever Ops wizardry is required to make single-machine failures in each tier be a non-event:
Line 151: Line 325:




With multiple DB Router processes, we run into the problem of shared state.  They must all agree on the current mapping of userids to shards, of shards to database machines, and which database machines are master versus standby.  They'll operate as a ZooKeeper (or similar) cluster to store this state in a consistent and highly-available fashion:
With multiple DB Router processes, we run into the problem of shared state.  They must all agree on the current mapping of userids to shards, of shards to database machines, and which database machines are master versus standby.  One solution is to have them operate as a ZooKeeper (or similar) cluster to store this state in a consistent and highly-available fashion:


   +----------------------------------------------+
   +----------------------------------------------+
Line 173: Line 347:




== Inter-DC Redundancy ==
=== Database Snapshots ===
 
For a final level of redundancy, we periodically snapshot each database into long-term storage, e.g. S3.  Likely take the snapshot on the least up-to-date replica to minimize the chances that it would impact production capacity.
 
As well as providing redundancy, these snapshots allow us to quickly bring up another DB for a particular shard.  E.g. if we lose the hot standby, we can start a fresh one, restore it from a snapshot, then set it to work catching up from that point via standard replication.  We'd use a similar process if we need to move or split shards - bring up a new replica from snapshot, get it up to date, then start sending traffic to it.
 
 
=== Inter-DC Redundancy ===


We'll replicate the entire stack into a second data-center, which will maintain a full backup copy of all shards. In concrete AWS terms, this means a second AWS Region.
There is no Inter-DC redundancy from an availability perspective.  If a DC goed down (e.g. AWS region outage) then we just tell the client that we're unavailable, come back soon.


One DC will be the active master for all shards, and the other is purely a backupEvery shard has a designated warm-standby host in this DC, configured for asynchronous WAN replication from the hot standby (so that the master doesn't have additional load from this replication). Likewise, the internal DB Router state is replicated into the second DC:
For durability, we periodically snapshot the data into offsite long-term storage, e.g. S3For a prolonged region outage, we could consider re-creating the entire cluster from these snapshots, but that sounds like an awful lot of work...


'''TODO:''' If we want to spend the money, we could keep replicas on standby in another DC.  I doubt we'll want to spend the money.


  +--------------------------------------------------------------------------------+
  | US-East Data Center                                                            |
  |                                                                                |
  |  +--------------+    +----------------+                                      |
  |  | Web App Tier |    | DB Router Tier |        +---------------------+      |
  |  |              |    |                |    +-->| Master for Shard #N |      |
  |  |  +---------+ |    | +-----------+  |    |  +----------+----------+      |
  |  |  | Web App | |--->| | DB Router |  |-----+              | (replication)    |
  |  |  +---------+ |    | +-----------+  |    |  +----------V---------------+  |
  |  |  +---------+ |    | +-----------+  |    +-->| Hot Standby for Shard #N |--+-----+
  |  |  | Web App | |    | | DB Router |  |        +--------------------------+  |    |
  |  |  +---------+ |    | +-----------+  |                                      |    |
  |  +--------------+    +----------------+                                      |    |
  |                                |                                              |    |
  +--------------------------------+-----------------------------------------------+    |
                                  |                                                    |
                                  | (very slow replication)                            | (very slow replication)
                                  |                                                    |
  +--------------------------------+-------------------------------------------------+  |
  | US-West Data Center            |                                                |  |
  |                                V                                                |  |
  |  +--------------+    +----------------+                                        |  |
  |  | Web App Tier |    | DB Router Tier |        +---------------------------+  |  |
  |  |              |    |                |    +-->| Warm Standby for Shard #N |<--|---+
  |  |  +---------+ |    | +-----------+  |    |  +----------+----------------+  |
  |  |  | Web App | |--->| | DB Router |  |-----+              | (replication)      |
  |  |  +---------+ |    | +-----------+  |    |  +----------V-----------------+  |
  |  |  +---------+ |    | +-----------+  |    +-->| Tepid Standby for Shard #N |  |
  |  |  | Web App | |    | | DB Router |  |        +----------------------------+  |
  |  |  +---------+ |    | +-----------+  |                                        |
  |  +--------------+    +----------------+                                        |
  +----------------------------------------------------------------------------------+


=== Implications for the Client ===


Since this is replicating cross-DC, any attempt to fail over to the warm standby will almost certainly lose recently-written transactionsWe should probably not try to automate whole-DC failover, so that Ops can ensure consistent state before anything tries to send writes to a new location.
Using a single master for each shard means we don't have to worry about conflicts or consistencyThe sharding means this should not be a bottle-neck, and the use of an intermediate router process means we can fail over fast if the master goes down.


'''TODO:''' How many DCs?  The principle should be the same regardless of how many we haveNested Star Topology FTW.
''However'', since we're doing asynchronous replication, there's a chance that recent database writes could be lost in the event of failureThe client will see a consistent, but out-of-date view of its data. It must be able to recover from such a situation, although we hope this would be a very rare occurrence!


'''TODO:''' We could potentially fail over to the second DC for individual shards, if we happen to lose all DBs for that shard in the master DC.  At the cost of sending DB queries to a separate region.  Worth it?


=== Implementing the Router ===


== Database Snapshots ==
The DB Router process is obviously key here, and looks like a reasonably complex beast.  Can we use an off-the-shelf solution for this?  There are some that have most of the required features, e.g. ScaleBase.


For a final level of redundancy, we periodically snapshot each database into long-term storage, e.g. S3Likely take the snapshot on the least up-to-date replica to minimize the chances that it would impact production capacity.
On the other hand, the feature set seems small enough that we could realistically implement the router in-house, with the benefit of tighter focus and greater control over the details of monitoring, failover, replication etc.
 
 
=== Things to Think About ===
 
 
* Needs a detailed and careful plan for how we'll bring up new DBs for existing shards, how we'll move dshards between DBs, and how we'll split shards if that becomes necessary. All very doable, just fiddly.
 
* Increasing the number of shards could be '''very''' trickyIt might be simpler to:
** spin up a new, bigger cluster using the same architecture
** stop sending new users to the old cluster, start sending them to the new one
** gradually migrate old users over to the new cluster
** tear down the old cluster when finished
 
 
== A Cassandra Cluster ==
 
 
Another promising storage option is Cassandra.  It provides a rich-enough data model and automatic cluster management, at the cost of eventual consistency and the vague fear that it will try to do something "clever" when you really don't want it to.  To get strong consistency back, we'd use a locking layer such as Zookeeper or memcached. Basic principles:
 
* There is a single Cassandra storage node cluster backend the usual array of webhead machines.
** We set a replication factor of 3 and do LOCAL_QUORUM reads and writes for all queries
 
* The Cassandra cluster spans multiple DCs for durability (since it's not clear to me how well it would handle being snapshotted into S3)
** All reads and writes are done in a single datacenter, so that we can enforce consistency
** Read/write locks are taken in ZooKepper/memcached, on a per-user basis, to ensure consistency


As well as providing redundancy, these snapshots allow us to quickly bring up another DB for a particular shard.  E.g. if we lose the hot standby, we can start a fresh one, restore it from a snapshot, then set it to work catching up from that point via standard replication.  We'd use a similar process if we need to move or split shards - bring up a new replica from snapshot, get it up to date, then start sending traffic to it.


From the POV of the webapp code, it's just talking to ZooKeeper and Cassandra Storage Node as abstract systems:


== Implications for the Client ==
    +---------+          +-----------+
    | Web App |--------->| ZooKeeper |
    +---------+          +-----------+
            |
            |            +-----------+
            +----------->| Cassandra |
                        +-----------+


Using a single master for each shard means we don't have to worry about conflicts or consistency.  The sharding means this should not be a bottle-neck, and the use of an intermediate router process means we can fail over fast if the master goes down.


''However'', since we're doing asynchronous replication, there's a chance that recent database writes could be lost in the event of failure. The client will see a consistent, but out-of-date view of its data.  It must be able to recover from such a situation, although we hope this would be a very rare occurrence!
The fact that these are clustered, and membership may grow/shrink over time, should be transparent.


'''TODO:''' In the presence of multiple clients and asynchronous replication and failover, are we exposing any stronger guarantees to the client than we'd get from an eventually-consistent store?  E.g. client A writes, the write is lost due to failover, client B writes, client A is now in an inconsistent state.  Is this any different to client A and client B doing a conflicting writes in a NoSQL store, and us arbitrarily picking a winner?
'''TODO:''' Try to use Route53 to provide consistent names for some of the nodes, to act as introducers even if the entire membership of the cluster has changed


The Cassandra cluster replicates out to another DC for durability, but everything else stays in the one DC.  If that DC goes down, the cluster becomes unavailable but no data is lost.  We can re-create it in the other DC, or wait for it to come back up.


== Implementing the Router ==


The DB Router process is obviously key here, and looks like a reasonably complex beast. Can we use an off-the-shelf solution for this? There are some that have most of the required features, e.g. ScaleBase.
    +-------------------------------+  +-----------------+
    |  US-East                      |  |  US-West        |
    |                              |  |                |
    | +---------+    +-----------+  |  |                |
    | | Web App |--->| ZooKeeper |  |  |  +-----------+  |
    | +---------+    +-----------+  |  |  | Cassandra |  |
    |        |                      |  | +-----------+ |
    |        |                      |  |      ^          |
    |        |      +-----------+  |  |      |          |
    |        +------>| Cassandra |<-|---|------+          |
    |                +-----------+  |  |                |
    |                              |  |                |
    +-------------------------------+  +-----------------+


On the other hand, the feature set seems small enough that we could realistically implement the router in-house, with the benefit of tighter focus and greater control over the details of monitoring, failover, replication etc.


== A Hibernation Cluster ==


== Things To Think About ==
If a user doesn't use the service in, say, six months, then we could migrate them out of one of the active clusters and into a special "hibernation cluster".


I've tried to strike a balance between operational simplicity, application simplicity, and functionality here. We pay a price for it though:
Data that is moved into this cluster might simple be snapshoted into low-cost storage such as S3.  Or it might get put onto a very crowded, very slow MySQL machine that can only handle a trickle of user requests.


* There's quite a few moving parts here.  ZooKeeper is a beast.  The router process has a few different, interacting responsibilities that would have to be carefully modeled and managed.
If they come back and try to use their data again, we immediately trigger a migration back to one of the active clusters.


* There's only a single active DC which has to handle all traffic.  That's the price we pay for using MySQL and exposing a strongly-consistent client API.
** We ''could'' have multiple DCs active and serving web traffic, routing read queries to the local replica and write queries over to the proper master.  Seems like an unnecessary pain and expense though, esp. with the possibility of losing read-your-own-writes consistency.
** It's not like that DC is going to run out of capacity, right?
** Since this is not a user-facing API, I think this is overall a good trade-off.  We don't care quite as much about the perceived latency and responsiveness of these requests, don't need location-based routing or any such fanciness.


* There's a lot of redundancy here, which will cost a lot to run.  Are our uptime requirements really so tight that we need a warm-standby in a separate DC?  Could we get away with just the hot standby and periodic database dumps into S3, with which we can (slowly) recover from meteor-hit-the-data-center scale emergencies?
== High-Level Things To Think About ==


* Needs a detailed and careful plan for how we'll bring up new DBs for existing shards, how we'll move dshards between DBs, and how we'll split shards if that becomes necessaryAll very doable, just fiddly.
* There's a bit of management overhead in the API, with the handshake etc.  We could consider factoring that out and just doing the routing internallyBut there's something to be said for explicitness.


  <mmayo> [21:22:58] rfkelly|away: telliott: rnewman: will reply to PICL storage thread soon, but if I forget
* Needs a detailed and careful plan for how we would migrate users from one cluster to another. Very doable, just fiddly and potentially quite slow.
  the TLDR; version is: we should plan for a caching tier not in AWS
  <mmayo> [21:23:20] mechanism TBD
  <mmayo> [21:23:39] but basically keep the hot transactions on high-spindle DB servers in a datacenter.
  <mmayo> [21:24:05] since god-awful I/O rates are still really expensive and shitty in EC2.
 
  <mmayo> might be as simple as detecting hot "shards", might be more sophisticated.
  <mmayo> but it would be very nice to have some form of hierarchical storage management as far of the design.
 
  <mmayo> I was doing some Cassandra testing instead of sleeping the other night, and even the biggest EC2 instances
  can only do about 1/2 the IOPS of a bare metal, lesser machine.

Revision as of 07:21, 4 June 2013

Summary

This is a working proposal for the backend storage architecture of PiCL server. It tries to take some of the good bits from the Firefox Sync backend, add in some lessons learned from running that in the field, simplify things a little, and make some adjustments towards stronger durability. It is far from final. All feedback welcome!

Goals:

  • Scale to billions of users. Quickly. Easily.
  • Don't lose user data. Even if a machine dies. Even if a meteor hits a data-center.
  • Provide a simple programming model to the client, and to web application.
  • Provide a relatively simple and well-understood Ops environment.
  • Try to be low-cost, while maintaining acceptable levels of durability and availability.
  • Provide for on-going infrasturcture experiments, refinements and upgrades


Boundary Conditions:

  • Each user's data is completely independent, there's no need for queries that cross multiple user accounts.
    • This means that our data storage problem is embarrassingly shardable. Good times!
  • The client-facing API is strongly consistent, and exposes an atomic check-and-set operation.
    • This makes an eventually-consistent NoSQL store rather less attractive, unless coupled with a strongly-consistent control layer e.g.

zookeeper.

  • Initial deployment will be into AWS.
    • Assuming PiCL succeeds in replacing sync, we can probably subsume some of the sync hardware over time.
  • It's OK to have brief periods of unavailability
    • This is, after all, a background service. There's no user in the loop most of the time.
    • The user-agent will be expected to deal gracefully with server unavailability.
  • Ops would like the ability to move users onto different levels of infrastructure, depending on their usage profile
    • For example, moving highly active users out of AWS and onto bare metal hardware.
    • Or, moving inactive users off onto lower-cost storage.
    • Or, just experimenting with a new setup for a select subset of users.


Basic Principles:

  • Each user account gets an opaque, immutable user id.
    • This will only change if they completely delete and then re-create their account.
  • Each user account is explicitly assigned to a particular cluster.
    • Each cluster is a stand-alone piece of infrastructure with no links to other clusters.
    • Each cluster is responsible for its own durability, replication, scalability and so-on.
  • Each cluster is identified by a URL, at which it speaks a common protocol.
    • Different clusters may have different underlying technologies, e.g. one may be MySQL, one may be Cassandra.
    • But they all look the same from the outside.
  • A user's cluster assignment might change over time; this migration will require careful management.
    • This would be fairly infrequent, however.
  • The user-account and cluster-mapping information lives in a stand-alone piece of infra, the "userdb".


Architecturally, the system winds up looking something like this:


            login handshake      +--------+
        +----------------------->| UserDB |<-------------------+
        |+-----------------------| System |   management api   |
        ||    cluster URL        +--------+                    |
        ||                                                     |
        ||                                                     |
        |v                                                     |
 +--------+   storage protocol   +----------------------+      |
 | client |<-------------------->| MySQL-Backed Cluster |<-----+
 +--------+                      +----------------------+      |
                                                               |
                                 +----------------------+      |
                                 | MySQL-Backed Cluster |<-----+
                                 +----------------------+      |
                                                               |
                                 +-------------------------+   |
                                 | Casandra-Backed Cluster |<--+
                                 +-------------------------+


What the Client Sees

To begin a syncing session, the user-agent first "logs in" to the storage system, performing a handshake to exchange its BrowserID assertion for some short-lived Hawk access credentials. As part of this handshake, it will be told the base_url to which it should direct its storage operations.

For simple third-party deployments, the base_url will point back to the originating server. For at-scale Mozilla deployments, it will point into the user's assigned cluster.

In this example, the user has id "12345" and is assigned to the "mysql3" cluster:

   >  POST https://storage.picl.services.mozilla.com HTTP/1.1
   >  {
   >   "assertion": <browserid assertion>,
   >   "device": <device UUID>
   >  }
   .
   .
   <  HTTP/1.1 200 OK
   <  Content-Type: application/json
   <  {
   <   "base_url": "https://mysql3.storage.picl.services.mozilla.com/storage/12345",
   <   "id": <hawk auth id>,
   <   "key": <hawk auth secret key>
   <   }
   <  }


The client then syncs away by talking to this base_url via the as-yet-undefined sync protocol:

   >  GET https://mysql3.storage.picl.services.mozilla.com/storage/12345 HTTP/1.1
   >  Authorization:  <hawk auth parameters>
   .
   .
   <  HTTP/1.1 200 OK
   <  Content-Type: application/json
   <  {
   <   "collections": {
   <     "XXXXX": 42,
   <     "YYYYY": 128
   <   }
   <  }


When the Hawk credentials expire, or when the user's cluster assignment is changed, it will receive a "401 Unauthorized" response from the storage server. To continue syncing, it will have to perform a new handshake and get a new base_url. In this example, the user has been re-assigned to the "cassandra1" cluster:

   >  GET https://mysql3.storage.picl.services.mozilla.com/storage/12345 HTTP/1.1
   >  Authorization:  <hawk auth parameters>
   .
   .
   <  HTTP/1.1 401 Unauthorized
   <  Content-Length: 0
   .
   .
   >  POST https://storage.picl.services.mozilla.com HTTP/1.1
   >  {
   >   "assertion": <fresh browserid assertion>,
   >   "device": <device UUID>
   >  }
   .
   .
   <  HTTP/1.1 200 OK
   <  Content-Type: application/json
   <  {
   <   "base_url": "https://cassandra1.storage.picl.services.mozilla.com/storage/12345",
   <   "id": <hawk auth id>,
   <   "key": <hawk auth secret key>
   <   }
   <  }


The UserDB System

The UserDB system contains the mapping of user account emails to userids, and mapping of userids to clusters.

This component has a lot of similarity to the TokenServer from the Sync2.0 architecture:

 https://wiki.mozilla.org/Services/Sagrada/TokenServer
 https://docs.services.mozilla.com/token/index.html

However, we intend for it to manage a relatively small number of clusters, which each have their own internal sharding or other scaling techniques, rather than managing a large number of service node shards. We're also going to simplify some of the secrets/signing management, and not supporting multiple services from a single user account.

It's not terribly write-heavy, but is very valuable data that must be kept strongly consistent - if we lose the ability to direct a user to the correct cluster, or send different devices to different clusters, the user is not going to be happy.

It also needs to be highly available for reads, since if UserDB read capability goes down, then we lose the ability to access all clusters.

To keep things simple and reliable and available, this will use a Multi-DC Replicated MySQL setup. It would be awesome if the write load is small enough to do synchronous replication here, using something like Galera cluster:

 http://codership.com/content/using-galera-cluster

If not, then a standard master/slave setup should be OK. As long as we're careful about send users to stale cluster assignments.

Example schema:

   CREATE TABLE users
     userid INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY
     email VARCHAR(128) NOT NULL
     clusterid INTEGER NOT NULL
     previous_clusterid INTEGER

Each user is assigned to a particular cluster. We can also track the cluster they were previously assigned to, which might help with managing migration of users between clusters.


   CREATE TABLE clusters
     clusterid INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY
     base_url VARCHAR(128) NOT NULL
     assignment_weight INTEGER NOT NULL

Each cluster as a base_url and an assignment_weight. When a new user account gets created, we randomly assignment the to a cluster with probability proportional to the assignment_weight. Set it to zero to stop sending new users to a particular cluster.

This service will need to have a user-facing API to support the login handshake dance, and some private management APIs for managing clusters, assignments, etc. Maybe even a nice friendly admin UI for the ops folks to use.


A Massively-Sharded MySQL Cluster

One of the leading options for storage is a massively-sharded MySQL setup, taking advantage of the highly shardable nature of the data set. Basic principles:

  • Each user is transaprently mapped to a shard via e.g. consistent hashing
  • All reads and writes for a shard go to a single master MySQL database.
    • This saves us having to deal with conflicting writes and other multi-master headaches.
    • To keep things simple, there are no read slaves. The sharding is the only thing responsible for distributing server load.
    • A single DB host machine might hold multiple shards.
  • Each master synchronously replicates to one or more hot standby dbs in the same DC, to guard against individual machine failure.
  • One of the standby dbs is periodically snapshotted into S3, to guard agaist data loss if the whole DC goes down.
  • There is no cross-DC replication; if the DC goes down, the cluster becomes unavailable and we might have to restore from S3.
  • All sharding logic and management lives in a stand-alone "db router" process, so that it's transparent to the webapp code.
  • We should try to implement this using ScaleBase to start, but keep in mind the possibility of a custom dbrouter process.


What the WebApp Sees

From the POV of the webapp code, it's just talking to a regular old MySQL database:

   +---------+          +--------------+
   | Web App |--------->| MySQL Server |
   +---------+          +--------------+

It has to respect a couple of restrictions though:

  • No DDL statements are allowed, only regular queries. Schema changes and sharding don't mix.
  • All queries must specify a fixed userid, e.g. by having a "userid = XXX" component in the WHERE clause, or by inserting rows with a known userid. This enables us to do the sharding transparently.


Transparent DB Router

The application code is actually talking to a "db router" server that speaks the MySQL wire protocol. In turn, the router is talking to the individual MySQL servers that are hosting each shard:

                                             +---------------------------+
                                      +----->| MySQL Server for Shard #1 |
   +---------+        +-----------+   |      +---------------------------+
   | Web App |------->| DB Router |---+
   +---------+        +-----------+   |      +---------------------------+
                                      +----->| MySQL Server for Shard #2 |
                                             +---------------------------+


The db router will:

  • Receive, parse and validate each incoming query
  • Extract the target userid, erroring out if the query does not have one.
  • Look up the shard number and corresponding database for that userid.
  • Forward the query to the appropriate database host, and proxy back the results.


The particulars of shard selection/lookup are not defined in this proposal, and are orthogonal to the rest of the setup. :rfkelly likes the consistent-hashing-plus-vbucket approach taken by couchbase, but it could be as simple as a lookup table. We assume that the router implements this appropriately and efficiently.

Handling all the sharding logic in a separate process gives us a number of advantages:

  • Application code is greatly simplified.
    • The same code paths are exercised in deployment, testing, and third-party deployments against a single database machine.
  • The total number of connections is reduced, along with various connection-related overheads.
  • The router can do centralized health monitoring of the individual servers, handle failover, etc.


Intra-DC Redundancy

We need to guard against the loss of any individual server within a DC. There are separate redundancy schemes for the MySQL servers, and for the other supporting services.

MySQL Redundancy

To guard against the loss of any individual database server, each shard will also have a hot standby database, living in the same DC and configured for synchronous (semi-synchronous?) replication. For AWS it would be in a separate Availability Zone. The router monitors the health of the standby database, but does not forward it any queries. Its only job is to serve as a backup for the active master:

                                             +---------------------+
                                      +----->| Master for Shard #1 |
                                      |      +----------+----------+
                                      |                 | (replication)
   +---------+        +-----------+   |      +----------V---------------+
   | Web App |------->| DB Router |---+----->| Hot Standby for Shard #1 |
   +---------+        +-----------+   |      +--------------------------+
                                      |
                                      |      +---------------------+
                                      +----->| Master for Shard #2 |
                                      |      +----------+----------+
                                      |                 | (replication)
                                      |      +----------V---------------+
                                      +----->| Hot Standby for Shard #2 |
                                             +--------------------------+


The router process is responsible for monitoring the health of these machines and sounding the alarm if something goes wrong. If the active master appears to be down, the router will transparently promote the hot standby and start sending queries to it. When the downed master comes back up, it is demoted to being the new standby.

TODO: The failover could be performed manually, if we're a bit leery of infrastructure being too clever for its own good.

TODO: Just one standby? Two? The principle should be the same regardless of how many we have. Star Topology FTW.

TODO: We could use the standby as a read slave, but I don't see the point. In a failure scenario the master needs to be able to handle the entire read load on its own, so it might as well do that all the time.


Other Service Redundancy

We don't want any single-point-of-failures, so we'll have to have multiple instances of the webapp talking to multiple instances of the router. These are connected via loadbalancing, virtual IPs, and whatever Ops wizardry is required to make single-machine failures in each tier be a non-event:

 +--------------+    +-----------------+
 | Web App Tier |    | DB Router Tier  |         +---------------------+
 |              |    |                 |     +-->| Master for Shard #N |
 |  +---------+ |    | +-----------+   |     |   +----------+----------+
 |  | Web App | |--->| | DB Router |   |-----+              | (replication)
 |  +---------+ |    | +-----------+   |     |   +----------V---------------+
 |  +---------+ |    | +-----------+   |     +-->| Hot Standby for Shard #N |
 |  | Web App | |    | | DB Router |   |         +--------------------------+
 |  +---------+ |    | +-----------+   |
 +--------------+    +-----------------+


Note that we're not doing this to the MySQL servers. There's too many of them and we already have a custom redundancy scheme from the hot standby.

Rendered concretely into AWS, we would have an Elastic Load Balancer and corresponding Auto-Scaling Group for each of these Tiers. The ELB provides a single endpoint for each service to contact the next, while being a magical auto-cloud-managed non-single-point-of-failure itself:


             +--------------+                  +----------------+
             | Auto-Scale   |                  | Auto-Scale     |         +---------------------+
             |              |                  |                |     +-->| Master for Shard #N |
 +-----+     |  +---------+ |      +-----+     |  +-----------+ |     |   +----------+----------+
 | ELB |--+--|->| Web App |-|--+-->| ELB |--+--|->| DB Router | |-----+              | (replication)
 +-----+  |  |  +---------+ |  |   +-----+  |  |  +-----------+ |     |   +----------V---------------+
          |  |  +---------+ |  |            |  |  +-----------+ |     +-->| Hot Standby for Shard #N |
          +--|->| Web App |-|--+            +--|->| DB Router | |         +--------------------------+
             |  +---------+ |                  |  +-----------+ |
             +--------------+                  +----------------+


With multiple DB Router processes, we run into the problem of shared state. They must all agree on the current mapping of userids to shards, of shards to database machines, and which database machines are master versus standby. One solution is to have them operate as a ZooKeeper (or similar) cluster to store this state in a consistent and highly-available fashion:

  +----------------------------------------------+
  | DB Router Tier                               |
  |                                              |
  |  +------------------+    +-----------------+ |
  |  | DB Router:       |    | DB Router:      | |
  |  |  ZooKeeper Node <+----+> ZooKeeper Node | |
  |  |  Router Process  |    |  Router Process | |
  |  +----|-------------+    +----|------------+ |
  |       |                       |              |
  |       +-----------+-----------+              |
  +-------------------|--------------------------+
                      V
              ...................
              : MySQL Instances :
              :.................:


Note that this shard-state metadata will be very small and be updated very infrequently, which should make it very friendly to a local zookeeper installation. We might even be able to provide a nice web-based status view and management console.


Database Snapshots

For a final level of redundancy, we periodically snapshot each database into long-term storage, e.g. S3. Likely take the snapshot on the least up-to-date replica to minimize the chances that it would impact production capacity.

As well as providing redundancy, these snapshots allow us to quickly bring up another DB for a particular shard. E.g. if we lose the hot standby, we can start a fresh one, restore it from a snapshot, then set it to work catching up from that point via standard replication. We'd use a similar process if we need to move or split shards - bring up a new replica from snapshot, get it up to date, then start sending traffic to it.


Inter-DC Redundancy

There is no Inter-DC redundancy from an availability perspective. If a DC goed down (e.g. AWS region outage) then we just tell the client that we're unavailable, come back soon.

For durability, we periodically snapshot the data into offsite long-term storage, e.g. S3. For a prolonged region outage, we could consider re-creating the entire cluster from these snapshots, but that sounds like an awful lot of work...

TODO: If we want to spend the money, we could keep replicas on standby in another DC. I doubt we'll want to spend the money.


Implications for the Client

Using a single master for each shard means we don't have to worry about conflicts or consistency. The sharding means this should not be a bottle-neck, and the use of an intermediate router process means we can fail over fast if the master goes down.

However, since we're doing asynchronous replication, there's a chance that recent database writes could be lost in the event of failure. The client will see a consistent, but out-of-date view of its data. It must be able to recover from such a situation, although we hope this would be a very rare occurrence!


Implementing the Router

The DB Router process is obviously key here, and looks like a reasonably complex beast. Can we use an off-the-shelf solution for this? There are some that have most of the required features, e.g. ScaleBase.

On the other hand, the feature set seems small enough that we could realistically implement the router in-house, with the benefit of tighter focus and greater control over the details of monitoring, failover, replication etc.


Things to Think About

  • Needs a detailed and careful plan for how we'll bring up new DBs for existing shards, how we'll move dshards between DBs, and how we'll split shards if that becomes necessary. All very doable, just fiddly.
  • Increasing the number of shards could be very tricky. It might be simpler to:
    • spin up a new, bigger cluster using the same architecture
    • stop sending new users to the old cluster, start sending them to the new one
    • gradually migrate old users over to the new cluster
    • tear down the old cluster when finished


A Cassandra Cluster

Another promising storage option is Cassandra. It provides a rich-enough data model and automatic cluster management, at the cost of eventual consistency and the vague fear that it will try to do something "clever" when you really don't want it to. To get strong consistency back, we'd use a locking layer such as Zookeeper or memcached. Basic principles:

  • There is a single Cassandra storage node cluster backend the usual array of webhead machines.
    • We set a replication factor of 3 and do LOCAL_QUORUM reads and writes for all queries
  • The Cassandra cluster spans multiple DCs for durability (since it's not clear to me how well it would handle being snapshotted into S3)
    • All reads and writes are done in a single datacenter, so that we can enforce consistency
    • Read/write locks are taken in ZooKepper/memcached, on a per-user basis, to ensure consistency


From the POV of the webapp code, it's just talking to ZooKeeper and Cassandra Storage Node as abstract systems:

   +---------+          +-----------+
   | Web App |--------->| ZooKeeper |
   +---------+          +-----------+
           |
           |            +-----------+
           +----------->| Cassandra |
                        +-----------+


The fact that these are clustered, and membership may grow/shrink over time, should be transparent.

TODO: Try to use Route53 to provide consistent names for some of the nodes, to act as introducers even if the entire membership of the cluster has changed

The Cassandra cluster replicates out to another DC for durability, but everything else stays in the one DC. If that DC goes down, the cluster becomes unavailable but no data is lost. We can re-create it in the other DC, or wait for it to come back up.


   +-------------------------------+   +-----------------+
   |  US-East                      |   |  US-West        |
   |                               |   |                 |
   | +---------+    +-----------+  |   |                 |
   | | Web App |--->| ZooKeeper |  |   |  +-----------+  |
   | +---------+    +-----------+  |   |  | Cassandra |  |
   |        |                      |   |  +-----------+  |
   |        |                      |   |      ^          |
   |        |       +-----------+  |   |      |          |
   |        +------>| Cassandra |<-|---|------+          |
   |                +-----------+  |   |                 |
   |                               |   |                 |
   +-------------------------------+   +-----------------+


A Hibernation Cluster

If a user doesn't use the service in, say, six months, then we could migrate them out of one of the active clusters and into a special "hibernation cluster".

Data that is moved into this cluster might simple be snapshoted into low-cost storage such as S3. Or it might get put onto a very crowded, very slow MySQL machine that can only handle a trickle of user requests.

If they come back and try to use their data again, we immediately trigger a migration back to one of the active clusters.


High-Level Things To Think About

  • There's a bit of management overhead in the API, with the handshake etc. We could consider factoring that out and just doing the routing internally. But there's something to be said for explicitness.
  • Needs a detailed and careful plan for how we would migrate users from one cluster to another. Very doable, just fiddly and potentially quite slow.