Identity/AttachedServices/StorageServiceArchitecture
Summary
This is a working proposal for the backend storage architecture of PiCL server. It's based on a massively-sharded and cross-DC-replicated MySQL installation, and is far from final. All feedback welcome!
The basic principles:
- Each user's data is completely independent, there's no need for queries that cross multiple user accounts.
- This means that our data storage problem is embarrassingly shardable. Good times!
- Each user account is assigned to a particular shard, identified by an integer.
- Their shard assignment will never change unless they delete and re-create their account.
- All reads and writes for a shard go to a single master MySQL database.
- This saves us having to deal with conflicting writes and othe multi-master headaches.
- To keep things simple, there are no read slaves. The sharding is the only thing responsible for distributing server load.
- A single DB host machine might hold multiple shards.
- Each master synchronously replicates to a hot standby in the same DC, to guard against individual machine failure.
- This may not be necessary, depending on instance reliability and our uptime requirements.
- Each master asynchronously replicates to a warm standby in a separate DC, to guard against whole-DC failure.
- Or maybe to multiple independent DCs. Same same.
- The warm standby should probably be a synchronous replication pair. Because symmetry. And failover.
- All sharding logic and management lives in a stand-alone proxy process, so that it's transparent to the application.
What the App Sees
From the POV of the application code, it's just talking to a regular old MySQL database:
+---------+ +--------------+ | Web App |--------->| MySQL Server | +---------+ +--------------+
It has to respect a couple of restrictions though:
- No DDL statements are allowed, only regular queries.
- All queries must specify a fixed userid, e.g. by having a "userid = XXX" component in the WHERE clause, or by inserting rows with a known userid.
Transparent Sharding Proxy
In actuality, the application code is talking to a proxy server that speaks the MySQL wire protocol. In turn, the proxy is talking to the individual MySQL servers that are hosting each shard:
+---------------------------+
+----->| MySQL Server for Shard #1 |
+---------+ +----------------+ | +---------------------------+
| Web App |------->| Sharding Proxy |---+
+---------+ +----------------+ | +---------------------------+
+----->| MySQL Server for Shard #2 |
+---------------------------+
The proxy process will:
- Receive, parse and validate each incoming query
- Extract the target userid, erroring out if the query does not have one.
- Look up the shard number and corresponding database for that userid.
- Forward the query to the appropriate database host, and proxy back the results.
The particulars of shard selection/lookup are not defined in this proposal. :rfkelly likes the consistent-hashing-plus-vbucket approach taken by couchbase, but it could be as simple as a lookup table. We assume that the proxy implements this appropriately and efficiently.
Moving the shard-handling into a separate proxy process gives us a number of advantages:
- Application code is greatly simplified.
- Same code paths can be used for testing or third-party deploy against a single database machine.
- Reduces total number of connections and connection-related overheads.
- Proxy can do centralized health monitoring of the individual servers.
Intra-DC Redundancy
We need to guard against the loss of any individual server within a DC. There are separate redundancy schemes for the MySQL servers, and the for supporting infrastructure.
MySQL Redundancy
To guard against the loss of any individual database server, each shard will also have a hot standby database, living in the same DC and configured for synchronous (semi-synchronous?) replication. The proxy monitors the health of the standby database, but does not forward it any queries. Its only job is to serve as a backup for the active master:
+---------------------+
+----->| Master for Shard #1 |
| +----------+----------+
| | (replication)
+---------+ +----------------+ | +----------V---------------+
| Web App |------->| Sharding Proxy |---+----->| Hot Standby for Shard #1 |
+---------+ +----------------+ | +--------------------------+
|
| +---------------------+
+----->| Master for Shard #2 |
| +----------+----------+
| | (replication)
| +----------V---------------+
+----->| Hot Standby for Shard #2 |
+--------------------------+
The proxy process is responsible for monitoring the health of these machines and sounding the alarm if something goes wrong. If the active master appears to be down, the proxy will transparently promote the hot standby and start sending queries to it. When the downed master comes back up, it is demoted to being the new standby.
TODO: Just one standby? Two? The principle should be the same regardless of how many we have.
TODO: We could use the standby as a read slave, but I don't see the point. In a failure scenario the master needs to be able to handle the entire read load on its own, so it might as well do that all the time.
TODO: The failover could also be performed manually, if we're a bit leery of infrastructure being too clever for its own good.
TODO: Depending on our uptime requirements, we might not need to spend the money on establishing this layer. Cross-DC replication and the acceptance of an occasional lost transaction may be a better tradeoff.
Other Service Redundancy
We don't want a single-point-of-failure, so we'll have to have multiple instances of the webapp talking to multiple instances of the proxy. These are connected via loadbalancing, virtual IPs, and whatever Ops wizardry is required to make single-machine failures in each tier be a non-event:
+--------------+ +------------------+ | Web App Tier | | Shard Proxy Tier | +---------------------+ | | | | +-->| Master for Shard #N | | +---------+ | | +-------------+ | | +----------+----------+ | | Web App | |--->| | Shard Proxy | |-----+ | (replication) | +---------+ | | +-------------+ | | +----------V---------------+ | +---------+ | | +-------------+ | +-->| Hot Standby for Shard #N | | | Web App | | | | Shard Proxy | | +--------------------------+ | +---------+ | | +-------------+ | +--------------+ +------------------+
Note that we're not doing this to the MySQL servers. There's too many of them and we already have a custom redundancy scheme from the hot standby.
With multiple Shard Proxy processes, we run into the problem of shared state. They must all agree on the current mapping of userids to shards, of shards to database machines, and which database machines are master versus standby. They'll operate as a ZooKeeper (or similar) cluster to store this state in a consistent and highly-available fashion:
+----------------------------------------------+
| Shard Proxy Tier |
| |
| +------------------+ +-----------------+ |
| | Shard Proxy: | | Shard Proxy: | |
| | ZooKeeper Node <+----+> ZooKeeper Node | |
| | Proxy Process | | Proxy Process | |
| +----|-------------+ +----|------------+ |
| | | |
| +-----------+-----------+ |
+-------------------|--------------------------+
V
...................
: MySQL Instances :
:.................:
Note that this shard-state metadata will be very small and be updated very infrequently, which should make it very friendly to a local zookeeper installation.
Inter-DC Redundancy
We'll replicate the entire stack into several data-centers, each of which will maintain a full copy of all shards.
One DC will be the active master for each shard. All reads and writes for that shard will be forwarded into that DC and routed to the master. (This will save us a world of pain by not having multiple conflicting writes going into different DCs). Other DCs are designated as warm-standby hosts for that shard, configured for asynchronous WAN replication. They can be failed-over to if there is a serious outage in the master DC, but this will almost certainly result in the loss of some recent transactions:
+----------------------------------------------------------------------------------+
| US-East Data Center |
| |
| +--------------+ +------------------+ |
| | Web App Tier | | Shard Proxy Tier | +---------------------+ |
| | | | | +-->| Master for Shard #N |-------|-----+
| | +---------+ | | +-------------+ | | +----------+----------+ | |
| | | Web App | |--->| | Shard Proxy | |-----+ | (replication) | |
| | +---------+ | | +-------------+ | | +----------V---------------+ | |
| | +---------+ | | +-------------+ | +-->| Hot Standby for Shard #N | | | ( v e r y )
| | | Web App | | | | Shard Proxy | | +--------------------------+ | | ( s l o w )
| | +---------+ | | +-------------+ | | | ( r e p l i c a t i o n )
| +--------------+ +------------------+ | |
+----------------------------------------------------------------------------------+ |
|
|
+------------------------------------------------------------------------------------+ |
| US-West Data Center | |
| | |
| +--------------+ +------------------+ | |
| | Web App Tier | | Shard Proxy Tier | +---------------------------+ | |
| | | | | +-->| Warm Standby for Shard #N |<--|---+
| | +---------+ | | +-------------+ | | +----------+----------------+ |
| | | Web App | |--->| | Shard Proxy | |-----+ | (replication) |
| | +---------+ | | +-------------+ | | +----------V-----------------+ |
| | +---------+ | | +-------------+ | +-->| Tepid Standby for Shard #N | |
| | | Web App | | | | Shard Proxy | | +----------------------------+ |
| | +---------+ | | +-------------+ | |
| +--------------+ +------------------+ |
+------------------------------------------------------------------------------------+