Identity/AttachedServices/StorageServiceArchitecture: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 25: Line 25:
** A single DB host machine might hold multiple shards.
** A single DB host machine might hold multiple shards.


* Each master synchronously replicates to a '''hot standby''' in the same DC, to guard against individual machine failure.
* Each master synchronously replicates to one or more '''hot standby''' dbs in the same DC, to guard against individual machine failure.


* Each master asynchronously replicates to a '''warm standby''' in a separate DC, to guard against whole-DC failure.
* The entire DC is asynchronously replicated to a '''warm standby''' setup in another region, to guard against whole-DC failure.


* All sharding logic and management lives in a stand-alone proxy process, so that it's transparent to the application.
* All sharding logic and management lives in a stand-alone "db router" process, so that it's transparent to the application.




Line 46: Line 46:




== Transparent Sharding Proxy ==
== Transparent DB Router ==


The application code is actually talking to a proxy server that speaks the MySQL wire protocol.  In turn, the proxy is talking to the individual MySQL servers that are hosting each shard:
The application code is actually talking to a "db router" server that speaks the MySQL wire protocol.  In turn, the router is talking to the individual MySQL servers that are hosting each shard:


                                                  +---------------------------+
                                              +---------------------------+
                                            +----->| MySQL Server for Shard #1 |
                                      +----->| MySQL Server for Shard #1 |
     +---------+        +----------------+  |      +---------------------------+
     +---------+        +-----------+  |      +---------------------------+
     | Web App |------->| Sharding Proxy |---+
     | Web App |------->| DB Router |---+
     +---------+        +----------------+  |      +---------------------------+
     +---------+        +-----------+  |      +---------------------------+
                                            +----->| MySQL Server for Shard #2 |
                                      +----->| MySQL Server for Shard #2 |
                                                  +---------------------------+
                                              +---------------------------+




The proxy process will:
The db router will:


* Receive, parse and validate each incoming query
* Receive, parse and validate each incoming query
Line 67: Line 67:




The particulars of shard selection/lookup are not defined in this proposal, and are orthogonal to the rest of the setup.  :rfkelly likes the consistent-hashing-plus-vbucket approach taken by couchbase, but it could be as simple as a lookup table.  We assume that the proxy implements this appropriately and efficiently.
The particulars of shard selection/lookup are not defined in this proposal, and are orthogonal to the rest of the setup.  :rfkelly likes the consistent-hashing-plus-vbucket approach taken by couchbase, but it could be as simple as a lookup table.  We assume that the router implements this appropriately and efficiently.


Handling all the sharding logic in a separate proxy process gives us a number of advantages:
Handling all the sharding logic in a separate process gives us a number of advantages:


* Application code is greatly simplified.
* Application code is greatly simplified.
Line 76: Line 76:
* The total number of connections is reduced, along with various connection-related overheads.
* The total number of connections is reduced, along with various connection-related overheads.


* The proxy can do centralized health monitoring of the individual servers.
* The router can do centralized health monitoring of the individual servers, handle failover, etc.




Line 85: Line 85:
=== MySQL Redundancy ===
=== MySQL Redundancy ===


To guard against the loss of any individual database server, each shard will also have a hot standby database, living in the same DC and configured for synchronous (semi-synchronous?) replication.  The proxy monitors the health of the standby database, but does not forward it any queries.  Its only job is to serve as a backup for the active master:
To guard against the loss of any individual database server, each shard will also have a hot standby database, living in the same DC and configured for synchronous (semi-synchronous?) replication.  For AWS it would be in a separate Availability Zone.  The router monitors the health of the standby database, but does not forward it any queries.  Its only job is to serve as a backup for the active master:


                                                  +---------------------+
                                              +---------------------+
                                            +----->| Master for Shard #1 |
                                      +----->| Master for Shard #1 |
                                            |      +----------+----------+
                                      |      +----------+----------+
                                            |                | (replication)
                                      |                | (replication)
     +---------+        +----------------+  |      +----------V---------------+
     +---------+        +-----------+  |      +----------V---------------+
     | Web App |------->| Sharding Proxy |---+----->| Hot Standby for Shard #1 |
     | Web App |------->| DB Router |---+----->| Hot Standby for Shard #1 |
     +---------+        +----------------+  |      +--------------------------+
     +---------+        +-----------+  |      +--------------------------+
                                            |
                                      |
                                            |      +---------------------+
                                      |      +---------------------+
                                            +----->| Master for Shard #2 |
                                      +----->| Master for Shard #2 |
                                            |      +----------+----------+
                                      |      +----------+----------+
                                            |                | (replication)
                                      |                | (replication)
                                            |      +----------V---------------+
                                      |      +----------V---------------+
                                            +----->| Hot Standby for Shard #2 |
                                      +----->| Hot Standby for Shard #2 |
                                                  +--------------------------+
                                              +--------------------------+




The proxy process is responsible for monitoring the health of these machines and sounding the alarm if something goes wrong.  If the active master appears to be down, the proxy will transparently promote the hot standby and start sending queries to it.  When the downed master comes back up, it is demoted to being the new standby.
The router process is responsible for monitoring the health of these machines and sounding the alarm if something goes wrong.  If the active master appears to be down, the router will transparently promote the hot standby and start sending queries to it.  When the downed master comes back up, it is demoted to being the new standby.


'''TODO:''' The failover could be performed manually, if we're a bit leery of infrastructure being too clever for its own good.
'''TODO:''' The failover could be performed manually, if we're a bit leery of infrastructure being too clever for its own good.
Line 114: Line 114:
=== Other Service Redundancy ===
=== Other Service Redundancy ===


We don't want any single-point-of-failures, so we'll have to have multiple instances of the webapp talking to multiple instances of the proxy.  These are connected via loadbalancing, virtual IPs, and whatever Ops wizardry is required to make single-machine failures in each tier be a non-event:
We don't want any single-point-of-failures, so we'll have to have multiple instances of the webapp talking to multiple instances of the router.  These are connected via loadbalancing, virtual IPs, and whatever Ops wizardry is required to make single-machine failures in each tier be a non-event:


   +--------------+    +------------------+
   +--------------+    +-----------------+
   | Web App Tier |    | Shard Proxy Tier |        +---------------------+
   | Web App Tier |    | DB Router Tier |        +---------------------+
   |              |    |                 |    +-->| Master for Shard #N |
   |              |    |                 |    +-->| Master for Shard #N |
   |  +---------+ |    | +-------------+ |    |  +----------+----------+
   |  +---------+ |    | +-----------+   |    |  +----------+----------+
   |  | Web App | |--->| | Shard Proxy | |-----+              | (replication)
   |  | Web App | |--->| | DB Router |   |-----+              | (replication)
   |  +---------+ |    | +-------------+ |    |  +----------V---------------+
   |  +---------+ |    | +-----------+   |    |  +----------V---------------+
   |  +---------+ |    | +-------------+ |    +-->| Hot Standby for Shard #N |
   |  +---------+ |    | +-----------+   |    +-->| Hot Standby for Shard #N |
   |  | Web App | |    | | Shard Proxy | |        +--------------------------+
   |  | Web App | |    | | DB Router |   |        +--------------------------+
   |  +---------+ |    | +-------------+ |
   |  +---------+ |    | +-----------+   |
   +--------------+    +------------------+
   +--------------+    +-----------------+




Note that we're '''not''' doing this to the MySQL servers.  There's too many of them and we already have a custom redundancy scheme from the hot standby.
Note that we're '''not''' doing this to the MySQL servers.  There's too many of them and we already have a custom redundancy scheme from the hot standby.


With multiple Shard Proxy processes, we run into the problem of shared state.  They must all agree on the current mapping of userids to shards, of shards to database machines, and which database machines are master versus standby.  They'll operate as a ZooKeeper (or similar) cluster to store this state in a consistent and highly-available fashion:
Rendered concretely into AWS, we would have an Elastic Load Balancer and corresponding Auto-Scaling Group for each of these Tiers.  The ELB provides a single endpoint for each service to contact the next, while being a magical auto-cloud-managed non-single-point-of-failure itself:
 
 
              +--------------+                  +----------------+
              | Auto-Scale  |                  | Auto-Scale    |        +---------------------+
              |              |                  |                |    +-->| Master for Shard #N |
  +-----+    |  +---------+ |      +-----+    |  +-----------+ |    |  +----------+----------+
  | ELB |--+--|->| Web App |-|--+-->| ELB |--+--|->| DB Router | |-----+              | (replication)
  +-----+  |  |  +---------+ |  |  +-----+  |  |  +-----------+ |    |  +----------V---------------+
          |  |  +---------+ |  |            |  |  +-----------+ |    +-->| Hot Standby for Shard #N |
          +--|->| Web App |-|--+            +--|->| DB Router | |        +--------------------------+
              |  +---------+ |                  |  +-----------+ |
              +--------------+                  +----------------+
 
 
With multiple DB Router processes, we run into the problem of shared state.  They must all agree on the current mapping of userids to shards, of shards to database machines, and which database machines are master versus standby.  They'll operate as a ZooKeeper (or similar) cluster to store this state in a consistent and highly-available fashion:


   +----------------------------------------------+
   +----------------------------------------------+
   | Shard Proxy Tier                             |
   | DB Router Tier                               |
   |                                              |
   |                                              |
   |  +------------------+    +-----------------+ |
   |  +------------------+    +-----------------+ |
   |  | Shard Proxy:     |    | Shard Proxy:   | |
   |  | DB Router:       |    | DB Router:     | |
   |  |  ZooKeeper Node <+----+> ZooKeeper Node | |
   |  |  ZooKeeper Node <+----+> ZooKeeper Node | |
   |  |  Proxy Process   |    |  Proxy Process | |
   |  |  Router Process |    |  Router Process | |
   |  +----|-------------+    +----|------------+ |
   |  +----|-------------+    +----|------------+ |
   |      |                      |              |
   |      |                      |              |
Line 149: Line 164:




Note that this shard-state metadata will be very small and be updated very infrequently, which should make it very friendly to a local zookeeper installation.
Note that this shard-state metadata will be very small and be updated very infrequently, which should make it very friendly to a local zookeeper installation.  We might even be able to provide a nice web-based status view and management console.




== Inter-DC Redundancy ==
== Inter-DC Redundancy ==


We'll replicate the entire stack into several data-centers, each of which will maintain a full copy of all shards.
We'll replicate the entire stack into a second data-center, which will maintain a full backup copy of all shards.  In concrete AWS terms, this means a second AWS Region.


One DC will be the active master for each shardAll reads and writes for that shard will be forwarded into that DC and routed to the master.  Other DCs are designated as warm-standby hosts for that shard, configured for asynchronous WAN replication.  They can be failed-over to if there is a serious outage in the master DC, but this will almost certainly result in the loss of some recent transactions:
One DC will be the active master for all shards, and the other is purely a backupWe could have it accepting web traffic and forwarding all DB queries across to the master, or we could remove it from DNS entirely unless we need itEvery shard has a designated warm-standby host in this DC, configured for asynchronous WAN replication from the hot standby (so that the master doesn't have additional load from this replication)Likewise, the internal DB Router state is replicated into the second DC:


  +--------------------------------------------------------------------------------+
  | US-East Data Center                                                            |
  |                                                                                |
  |  +--------------+    +----------------+                                      |
  |  | Web App Tier |    | DB Router Tier |        +---------------------+      |
  |  |              |    |                |    +-->| Master for Shard #N |      |
  |  |  +---------+ |    | +-----------+  |    |  +----------+----------+      |
  |  |  | Web App | |--->| | DB Router |  |-----+              | (replication)    |
  |  |  +---------+ |    | +-----------+  |    |  +----------V---------------+  |
  |  |  +---------+ |    | +-----------+  |    +-->| Hot Standby for Shard #N |--+-----+
  |  |  | Web App | |    | | DB Router |  |        +--------------------------+  |    |
  |  |  +---------+ |    | +-----------+  |                                      |    |
  |  +--------------+    +----------------+                                      |    |
  |                                |                                              |    |
  +--------------------------------+-----------------------------------------------+    |
                                  |                                                    |
                                  | (very slow replication)                            | (very slow replication)
                                  |                                                    |
  +--------------------------------+-------------------------------------------------+  |
  | US-West Data Center            |                                                |  |
  |                                V                                                |  |
  |  +--------------+    +----------------+                                        |  |
  |  | Web App Tier |    | DB Router Tier |        +---------------------------+  |  |
  |  |              |    |                |    +-->| Warm Standby for Shard #N |<--|---+
  |  |  +---------+ |    | +-----------+  |    |  +----------+----------------+  |
  |  |  | Web App | |--->| | DB Router |  |-----+              | (replication)      |
  |  |  +---------+ |    | +-----------+  |    |  +----------V-----------------+  |
  |  |  +---------+ |    | +-----------+  |    +-->| Tepid Standby for Shard #N |  |
  |  |  | Web App | |    | | DB Router |  |        +----------------------------+  |
  |  |  +---------+ |    | +-----------+  |                                        |
  |  +--------------+    +----------------+                                        |
   +----------------------------------------------------------------------------------+
   +----------------------------------------------------------------------------------+
  | US-East Data Center                                                              |
  |                                                                                  |
  |  +--------------+    +------------------+                                      |
  |  | Web App Tier |    | Shard Proxy Tier |        +---------------------+      |
  |  |              |    |                  |    +-->| Master for Shard #N |-------+-----+
  |  |  +---------+ |    | +-------------+  |    |  +----------+----------+      |    |
  |  |  | Web App | |--->| | Shard Proxy |  |-----+              | (replication)    |    |
  |  |  +---------+ |    | +-------------+  |    |  +----------V---------------+  |    |
  |  |  +---------+ |    | +-------------+  |    +-->| Hot Standby for Shard #N |  |    | (very slow replication)
  |  |  | Web App | |    | | Shard Proxy |  |        +--------------------------+  |    |
  |  |  +---------+ |    | +-------------+  |                                      |    |
  |  +--------------+    +------------------+                                      |    |
  |                                |                                                |    |
  +--------------------------------+-------------------------------------------------+    |
                                  |                                                      |
                                  | (very slow replication)                              |
                                  |                                                      |
  +--------------------------------+---------------------------------------------------+  |
  | US-West Data Center            |                                                  |  |
  |                                V                                                  |  |
  |  +--------------+    +------------------+                                        |  |
  |  | Web App Tier |    | Shard Proxy Tier |        +---------------------------+  |  |
  |  |              |    |                  |    +-->| Warm Standby for Shard #N |<--|---+
  |  |  +---------+ |    | +-------------+  |    |  +----------+----------------+  |
  |  |  | Web App | |--->| | Shard Proxy |  |-----+              | (replication)      |
  |  |  +---------+ |    | +-------------+  |    |  +----------V-----------------+  |
  |  |  +---------+ |    | +-------------+  |    +-->| Tepid Standby for Shard #N |  |
  |  |  | Web App | |    | | Shard Proxy |  |        +----------------------------+  |
  |  |  +---------+ |    | +-------------+  |                                        |
  |  +--------------+    +------------------+                                        |
  +------------------------------------------------------------------------------------+




For this scheme to work, every Shard Proxy process in every DC needs to agree on where the master is for every shardThat's not trivialBut it should be pretty doable with some replication between the respective ZooKeeper masters in each DC.  Remember: this data is small and changes infrequently.
Since this is replicating cross-DC, any attempt to fail over to the warm standby will almost certainly lose recently-written transactionsWe should probably not try to automate whole-DC failover, so that Ops can ensure consistent state before anything tries to send writes to a new location.
 
'''TODO:''' How many DCs? The principle should be the same regardless of how many we have.  Nested Star Topology FTW.
 
'''TODO:''' We could potentially fail over to the second DC for individual shards, if we happen to lose all DBs for that shard in the master DC.  At the cost of sending DB queries to a separate region. Worth it?


We sould probably not try to automate whole-DC failover, so that Ops can ensure consistent state before anything tries to send writes to a new location.


'''TODO:''' How many DCs?  The principle should be the same regardless of how many we have.  Nested Star Topology FTW.
== Database Snapshots ==


'''TODO:''' Will we wind up in a situation where different shard have their master db in different DCs?  Does it matter?
For a final level of redundancy, we periodically snapshot each database into long-term storage, e.g. S3.  Likely take the snapshot on the least up-to-date replica to minimize the chances that it would impact production capacity.


'''TODO:''' Do we want to try for any sort of geo-locality with data storage, e.g. putting the master for European users in Europe?  This could greatly complicate the assignment of users to shards.
As well as providing redundancy, these snapshots allow us to quickly bring up another DB for a particular shard.  E.g. if we lose the hot standby, we can start a fresh one, restore it from a snapshot, then set it to work catching up from that point via standard replication.




== Implications ==
== Implications ==


Using a single master means we don't have to worry about conflicts or consistency.  The sharding means this should not be a bottle-neck, and the use of an intermediate proxy means we can fail over fast if the master goes down.
Using a single master for each shard means we don't have to worry about conflicts or consistency.  The sharding means this should not be a bottle-neck, and the use of an intermediate router process means we can fail over fast if the master goes down.


'''However''', since we're doing asynchronous replication, there's a chance that recent database writes could be lost in the event of failure.  The client will see a consistent, but out-of-date view of its data.  It must be able to recover from such a situation, although we hope this would be a very rare occurrence!
'''However''', since we're doing asynchronous replication, there's a chance that recent database writes could be lost in the event of failure.  The client will see a consistent, but out-of-date view of its data.  It must be able to recover from such a situation, although we hope this would be a very rare occurrence!




== Implementing the Proxy ==
== Implementing the Router ==


The Sharding Proxy process is obviously key here, and looks like a reasonably complex beast.  Can we use an off-the-shelf solution for this?  There are some that have most of the required features, e.g. ScaleBase.
The DB Router process is obviously key here, and looks like a reasonably complex beast.  Can we use an off-the-shelf solution for this?  There are some that have most of the required features, e.g. ScaleBase.


On the other hand, the feature set seems small enough that we could realistically implement the proxy in-house, with the benefit of tighter focus and greater control over the details of monitoring, failover, replication etc.
On the other hand, the feature set seems small enough that we could realistically implement the router in-house, with the benefit of tighter focus and greater control over the details of monitoring, failover, replication etc.




Line 221: Line 240:
I've tried to strike a balance between operational simplicity, application simplicity, and functionality here.  We pay a price for it though:
I've tried to strike a balance between operational simplicity, application simplicity, and functionality here.  We pay a price for it though:


* There's quite a few moving parts here.  ZooKeeper is a beast.  The proxy process has a few different, interacting responsibilities that would have to be carefully modeled and managed.
* There's quite a few moving parts here.  ZooKeeper is a beast.  The router process has a few different, interacting responsibilities that would have to be carefully modeled and managed.


* There's only a single active DC which has to handle all traffic.  That's the price we pay for using MySQL and exposing a strongly-consistent client API.
** Since this is not a user-facing API, I think this is a good trade-off.  We don't care quite as much about the perceived latency and responsiveness of these requests.
** It's not like that DC is going to run out of capacity, right?
** We ''could'' have multiple DCs active and serving web traffic, and route the individual DB queries to the proper master.  Seems like an unnecessary pain and expense though.
* There's the potential for severe cross-DC latency if you receive a HTTP request in one data-center, but have to forward all the MySQL queries over to the master in another data-center.  I don't think there's any way around this without going to an eventually-consistent model, which would complicate the client API.
* There's the potential for severe cross-DC latency if you receive a HTTP request in one data-center, but have to forward all the MySQL queries over to the master in another data-center.  I don't think there's any way around this without going to an eventually-consistent model, which would complicate the client API.



Revision as of 00:28, 1 May 2013

Summary

This is a working proposal for the backend storage architecture of PiCL server. It's based on a massively-sharded and cross-DC-replicated MySQL installation, and is far from final. All feedback welcome!

The goals are:

  • Scale to billions of users. Quickly. Easily.
  • Don't lose user data. Even if a machine dies. Even if a meteor hits a data-center.
  • Maximize uptime, running costs be damned.
  • Provide a simple programming model to the application.
  • Provide a relatively simple and well-understood Ops environment.


The basic principles:

  • Each user's data is completely independent, there's no need for queries that cross multiple user accounts.
    • This means that our data storage problem is embarrassingly shardable. Good times!
  • Each user account is assigned to a particular shard, identified by an integer.
    • Their shard assignment will never change unless they delete and re-create their account.
  • All reads and writes for a shard go to a single master MySQL database.
    • This saves us having to deal with conflicting writes and other multi-master headaches.
    • To keep things simple, there are no read slaves. The sharding is the only thing responsible for distributing server load.
    • A single DB host machine might hold multiple shards.
  • Each master synchronously replicates to one or more hot standby dbs in the same DC, to guard against individual machine failure.
  • The entire DC is asynchronously replicated to a warm standby setup in another region, to guard against whole-DC failure.
  • All sharding logic and management lives in a stand-alone "db router" process, so that it's transparent to the application.


What the App Sees

From the POV of the application code, it's just talking to a regular old MySQL database:

   +---------+          +--------------+
   | Web App |--------->| MySQL Server |
   +---------+          +--------------+

It has to respect a couple of restrictions though:

  • No DDL statements are allowed, only regular queries. Schema changes and sharding don't mix.
  • All queries must specify a fixed userid, e.g. by having a "userid = XXX" component in the WHERE clause, or by inserting rows with a known userid. This enables us to do the sharding transparently.


Transparent DB Router

The application code is actually talking to a "db router" server that speaks the MySQL wire protocol. In turn, the router is talking to the individual MySQL servers that are hosting each shard:

                                             +---------------------------+
                                      +----->| MySQL Server for Shard #1 |
   +---------+        +-----------+   |      +---------------------------+
   | Web App |------->| DB Router |---+
   +---------+        +-----------+   |      +---------------------------+
                                      +----->| MySQL Server for Shard #2 |
                                             +---------------------------+


The db router will:

  • Receive, parse and validate each incoming query
  • Extract the target userid, erroring out if the query does not have one.
  • Look up the shard number and corresponding database for that userid.
  • Forward the query to the appropriate database host, and proxy back the results.


The particulars of shard selection/lookup are not defined in this proposal, and are orthogonal to the rest of the setup. :rfkelly likes the consistent-hashing-plus-vbucket approach taken by couchbase, but it could be as simple as a lookup table. We assume that the router implements this appropriately and efficiently.

Handling all the sharding logic in a separate process gives us a number of advantages:

  • Application code is greatly simplified.
    • The same code paths are exercised in deployment, testing, and third-party deployments against a single database machine.
  • The total number of connections is reduced, along with various connection-related overheads.
  • The router can do centralized health monitoring of the individual servers, handle failover, etc.


Intra-DC Redundancy

We need to guard against the loss of any individual server within a DC. There are separate redundancy schemes for the MySQL servers, and for the other supporting services.

MySQL Redundancy

To guard against the loss of any individual database server, each shard will also have a hot standby database, living in the same DC and configured for synchronous (semi-synchronous?) replication. For AWS it would be in a separate Availability Zone. The router monitors the health of the standby database, but does not forward it any queries. Its only job is to serve as a backup for the active master:

                                             +---------------------+
                                      +----->| Master for Shard #1 |
                                      |      +----------+----------+
                                      |                 | (replication)
   +---------+        +-----------+   |      +----------V---------------+
   | Web App |------->| DB Router |---+----->| Hot Standby for Shard #1 |
   +---------+        +-----------+   |      +--------------------------+
                                      |
                                      |      +---------------------+
                                      +----->| Master for Shard #2 |
                                      |      +----------+----------+
                                      |                 | (replication)
                                      |      +----------V---------------+
                                      +----->| Hot Standby for Shard #2 |
                                             +--------------------------+


The router process is responsible for monitoring the health of these machines and sounding the alarm if something goes wrong. If the active master appears to be down, the router will transparently promote the hot standby and start sending queries to it. When the downed master comes back up, it is demoted to being the new standby.

TODO: The failover could be performed manually, if we're a bit leery of infrastructure being too clever for its own good.

TODO: Just one standby? Two? The principle should be the same regardless of how many we have. Star Topology FTW.

TODO: We could use the standby as a read slave, but I don't see the point. In a failure scenario the master needs to be able to handle the entire read load on its own, so it might as well do that all the time.

Other Service Redundancy

We don't want any single-point-of-failures, so we'll have to have multiple instances of the webapp talking to multiple instances of the router. These are connected via loadbalancing, virtual IPs, and whatever Ops wizardry is required to make single-machine failures in each tier be a non-event:

 +--------------+    +-----------------+
 | Web App Tier |    | DB Router Tier  |         +---------------------+
 |              |    |                 |     +-->| Master for Shard #N |
 |  +---------+ |    | +-----------+   |     |   +----------+----------+
 |  | Web App | |--->| | DB Router |   |-----+              | (replication)
 |  +---------+ |    | +-----------+   |     |   +----------V---------------+
 |  +---------+ |    | +-----------+   |     +-->| Hot Standby for Shard #N |
 |  | Web App | |    | | DB Router |   |         +--------------------------+
 |  +---------+ |    | +-----------+   |
 +--------------+    +-----------------+


Note that we're not doing this to the MySQL servers. There's too many of them and we already have a custom redundancy scheme from the hot standby.

Rendered concretely into AWS, we would have an Elastic Load Balancer and corresponding Auto-Scaling Group for each of these Tiers. The ELB provides a single endpoint for each service to contact the next, while being a magical auto-cloud-managed non-single-point-of-failure itself:


             +--------------+                  +----------------+
             | Auto-Scale   |                  | Auto-Scale     |         +---------------------+
             |              |                  |                |     +-->| Master for Shard #N |
 +-----+     |  +---------+ |      +-----+     |  +-----------+ |     |   +----------+----------+
 | ELB |--+--|->| Web App |-|--+-->| ELB |--+--|->| DB Router | |-----+              | (replication)
 +-----+  |  |  +---------+ |  |   +-----+  |  |  +-----------+ |     |   +----------V---------------+
          |  |  +---------+ |  |            |  |  +-----------+ |     +-->| Hot Standby for Shard #N |
          +--|->| Web App |-|--+            +--|->| DB Router | |         +--------------------------+
             |  +---------+ |                  |  +-----------+ |
             +--------------+                  +----------------+


With multiple DB Router processes, we run into the problem of shared state. They must all agree on the current mapping of userids to shards, of shards to database machines, and which database machines are master versus standby. They'll operate as a ZooKeeper (or similar) cluster to store this state in a consistent and highly-available fashion:

  +----------------------------------------------+
  | DB Router Tier                               |
  |                                              |
  |  +------------------+    +-----------------+ |
  |  | DB Router:       |    | DB Router:      | |
  |  |  ZooKeeper Node <+----+> ZooKeeper Node | |
  |  |  Router Process  |    |  Router Process | |
  |  +----|-------------+    +----|------------+ |
  |       |                       |              |
  |       +-----------+-----------+              |
  +-------------------|--------------------------+
                      V
              ...................
              : MySQL Instances :
              :.................:


Note that this shard-state metadata will be very small and be updated very infrequently, which should make it very friendly to a local zookeeper installation. We might even be able to provide a nice web-based status view and management console.


Inter-DC Redundancy

We'll replicate the entire stack into a second data-center, which will maintain a full backup copy of all shards. In concrete AWS terms, this means a second AWS Region.

One DC will be the active master for all shards, and the other is purely a backup. We could have it accepting web traffic and forwarding all DB queries across to the master, or we could remove it from DNS entirely unless we need it. Every shard has a designated warm-standby host in this DC, configured for asynchronous WAN replication from the hot standby (so that the master doesn't have additional load from this replication). Likewise, the internal DB Router state is replicated into the second DC:


 +--------------------------------------------------------------------------------+
 | US-East Data Center                                                            |
 |                                                                                |
 |   +--------------+    +----------------+                                       |
 |   | Web App Tier |    | DB Router Tier |         +---------------------+       |
 |   |              |    |                |     +-->| Master for Shard #N |       |
 |   |  +---------+ |    | +-----------+  |     |   +----------+----------+       |
 |   |  | Web App | |--->| | DB Router |  |-----+              | (replication)    |
 |   |  +---------+ |    | +-----------+  |     |   +----------V---------------+  |
 |   |  +---------+ |    | +-----------+  |     +-->| Hot Standby for Shard #N |--+-----+
 |   |  | Web App | |    | | DB Router |  |         +--------------------------+  |     |
 |   |  +---------+ |    | +-----------+  |                                       |     |
 |   +--------------+    +----------------+                                       |     |
 |                                |                                               |     |
 +--------------------------------+-----------------------------------------------+     |
                                  |                                                     |
                                  | (very slow replication)                             | (very slow replication)
                                  |                                                     |
 +--------------------------------+-------------------------------------------------+   |
 | US-West Data Center            |                                                 |   |
 |                                V                                                 |   |
 |   +--------------+    +----------------+                                         |   |
 |   | Web App Tier |    | DB Router Tier |         +---------------------------+   |   |
 |   |              |    |                |     +-->| Warm Standby for Shard #N |<--|---+
 |   |  +---------+ |    | +-----------+  |     |   +----------+----------------+   |
 |   |  | Web App | |--->| | DB Router |  |-----+              | (replication)      |
 |   |  +---------+ |    | +-----------+  |     |   +----------V-----------------+  |
 |   |  +---------+ |    | +-----------+  |     +-->| Tepid Standby for Shard #N |  |
 |   |  | Web App | |    | | DB Router |  |         +----------------------------+  |
 |   |  +---------+ |    | +-----------+  |                                         |
 |   +--------------+    +----------------+                                         |
 +----------------------------------------------------------------------------------+


Since this is replicating cross-DC, any attempt to fail over to the warm standby will almost certainly lose recently-written transactions. We should probably not try to automate whole-DC failover, so that Ops can ensure consistent state before anything tries to send writes to a new location.

TODO: How many DCs? The principle should be the same regardless of how many we have. Nested Star Topology FTW.

TODO: We could potentially fail over to the second DC for individual shards, if we happen to lose all DBs for that shard in the master DC. At the cost of sending DB queries to a separate region. Worth it?


Database Snapshots

For a final level of redundancy, we periodically snapshot each database into long-term storage, e.g. S3. Likely take the snapshot on the least up-to-date replica to minimize the chances that it would impact production capacity.

As well as providing redundancy, these snapshots allow us to quickly bring up another DB for a particular shard. E.g. if we lose the hot standby, we can start a fresh one, restore it from a snapshot, then set it to work catching up from that point via standard replication.


Implications

Using a single master for each shard means we don't have to worry about conflicts or consistency. The sharding means this should not be a bottle-neck, and the use of an intermediate router process means we can fail over fast if the master goes down.

However, since we're doing asynchronous replication, there's a chance that recent database writes could be lost in the event of failure. The client will see a consistent, but out-of-date view of its data. It must be able to recover from such a situation, although we hope this would be a very rare occurrence!


Implementing the Router

The DB Router process is obviously key here, and looks like a reasonably complex beast. Can we use an off-the-shelf solution for this? There are some that have most of the required features, e.g. ScaleBase.

On the other hand, the feature set seems small enough that we could realistically implement the router in-house, with the benefit of tighter focus and greater control over the details of monitoring, failover, replication etc.


Things To Think About

I've tried to strike a balance between operational simplicity, application simplicity, and functionality here. We pay a price for it though:

  • There's quite a few moving parts here. ZooKeeper is a beast. The router process has a few different, interacting responsibilities that would have to be carefully modeled and managed.
  • There's only a single active DC which has to handle all traffic. That's the price we pay for using MySQL and exposing a strongly-consistent client API.
    • Since this is not a user-facing API, I think this is a good trade-off. We don't care quite as much about the perceived latency and responsiveness of these requests.
    • It's not like that DC is going to run out of capacity, right?
    • We could have multiple DCs active and serving web traffic, and route the individual DB queries to the proper master. Seems like an unnecessary pain and expense though.
  • There's the potential for severe cross-DC latency if you receive a HTTP request in one data-center, but have to forward all the MySQL queries over to the master in another data-center. I don't think there's any way around this without going to an eventually-consistent model, which would complicate the client API.
  • There's a lot of redundancy here, which will cost a lot to run. Are our uptime requirements really so tight that we need a warm-standby in a separate DC? Could we get away with just the hot standby and periodic database dumps into S3, with which we can (slowly) recover from meteor-hit-the-data-center scale emergencies?
  • How will we cope with moving shards between database hosts, or replacing dead hosts with fresh machines that have to catch up to the master. Checkpointing for faster recovery?