* All sharding logic and management lives in a stand-alone proxy process, so that it's transparent to the application.
== What the App Sees ==
It has to respect a couple of restrictions though:
* No DDL statements are allowed, only regular queries. Schema changes and sharding don't mix.* All queries must specify a fixed userid, e.g. by having a "userid = XXX" component in the WHERE clause, or by inserting rows with a known userid. This enables us to do the sharding transparently.
== Transparent Sharding Proxy ==
In actuality, the The application code is actually talking to a proxy server that speaks the MySQL wire protocol. In turn, the proxy is talking to the individual MySQL servers that are hosting each shard:
+---------------------------+
+----->| MySQL Server for Shard #2 |
+---------------------------+
The proxy process will:
The particulars of shard selection/lookup are not defined in this proposal, and are orthogonal to the rest of the setup. :rfkelly likes the consistent-hashing-plus-vbucket approach taken by couchbase, but it could be as simple as a lookup table. We assume that the proxy implements this appropriately and efficiently.
Moving Handling all the shard-handling into sharding logic in a separate proxy process gives us a number of advantages:
* Application code is greatly simplified.
* Same * The same code paths can be used for are exercised in deployment, testing or , and third-party deploy deployments against a single database machine. * Reduces The total number of connections and is reduced, along with various connection-related overheads. * Proxy The proxy can do centralized health monitoring of the individual servers.
The proxy process is responsible for monitoring the health of these machines and sounding the alarm if something goes wrong. If the active master appears to be down, the proxy will transparently promote the hot standby and start sending queries to it. When the downed master comes back up, it is demoted to being the new standby.
'''TODO:''' The failover could be performed manually, if we're a bit leery of infrastructure being too clever for its own good.
'''TODO:''' Just one standby? Two? The principle should be the same regardless of how many we have. Star Topology FTW.
'''TODO:''' We could use the standby as a read slave, but I don't see the point. In a failure scenario the master needs to be able to handle the entire read load on its own, so it might as well do that all the time.
'''TODO:''' The failover could also be performed manually, if we're a bit leery of infrastructure being too clever for its own good.
'''TODO:''' Depending on our uptime requirements, we might not need to spend the money on establishing this layer. Cross-DC replication and the acceptance of an occasional lost transaction may be a better tradeoff.
=== Other Service Redundancy ===
We don't want a any single-point-of-failurefailures, so we'll have to have multiple instances of the webapp talking to multiple instances of the proxy. These are connected via loadbalancing, virtual IPs, and whatever Ops wizardry is required to make single-machine failures in each tier be a non-event:
+--------------+ +------------------+
For this scheme to work, every Shard Proxy process in every DC needs to agree on where the master is for every shard. That's not trivial. But it should be pretty doable with some replication between the respective ZooKeeper masters in each DC. We would probably not try to automate this failover, so that Ops can ensure consistent state before anything tries to send writes to a new location.
We would probably not try to automate whole-DC failover, so that Ops can ensure consistent state before anything tries to send writes to a new location.
'''TODO:''' How many DCs? The principle should be the same regardless of how many we have. Nested Star Topology FTW.
'''TODO:''' Will we wind up in a situation where different shard have their master db in different DCs? Does it matter?
'''TODO:''' Do we want to try for any sort of geo-locality with data storage, e.g. putting the master for European users in Europe? This could greatly complicate the assignment of users to shards.
== Implementing the Proxy ==
The Sharding Proxy process is obviously key here, and looks like a reasonably complex beast. Can we use an off-the-shelf solution for this? There are some that have most of the required features, e.g. ScaleBase.
On the other hand, the feature set seems small enough that we could realistically implement the proxy in-house, with the benefit of tighter focus and greater control over the details of monitoring, failover, replication etc.
== Things To Think About ==
I've tried to strike a balance between operational simplicity, application simplicity, and functionality here. We pay a price for it:
* There's quite a few moving parts here. Particularly the proxy process has a few different, interacting responsibilities.
* There's the potential for severe cross-DC latency if you receive a HTTP request in one data-center, but have to forward all the MySQL queries over to the master in another data-center. I don't think there's any way around this without going to an eventually-consistent model, which will complicate the client API.
* Are our uptime requirements so tight that we really need a warm-standby in a separate DC? Could we get away with just the hot standby and periodic database dumps into S3, with which we can (slowly) recover from meteor-hit-the-data-center scale emergencies?