This service will need to have a user-facing API to support the login handshake dance, and some private management APIs for managing clusters, assignments, etc. Maybe even a nice friendly admin UI for the ops folks to use.
== A Massively-Sharded MySQL Types of Cluster ==
One of the leading options for storage is a massively=== Massively-sharded Shared MySQL setup, taking advantage of the highly shardable nature of the data set. Basic principles:===
* Each user One of the leading options for storage is transaprently mapped to a shard via emassively-sharded MySQL setup, taking advantage of the highly shardable nature of the data set.g This essentially the storage architecture underlying Firefox Sync, but we could make a lot of operational improvements. consistent hashing
* All reads and writes for a shard go to a single '''master''' Details here: [[MySQL database.** This saves us having to deal with conflicting writes and other multi-master headaches.** To keep things simple, there are no read slaves. The sharding is the only thing responsible for distributing server load.Storage Cluster|Identity/AttachedServices/StorageServiceArchitecture/MySQLStorageCluster]]** A single DB host machine might hold multiple shards.Basic principles:
* Each user is transaprently mapped to a shard via e.g. consistent hashing
* All reads and writes for a shard go to a single '''master''' MySQL database, so avoid consistency headaches.
* Each master synchronously replicates to one or more '''hot standby''' dbs in the same DC, to guard against individual machine failure.
* One of the standby dbs is periodically snapshotted into S3, to guard agaist data loss if the whole DC goes down.
* There is no cross-DC replication; if the DC goes down, the cluster becomes unavailable and we might have to restore from S3.
* All sharding logic and management lives in a stand-alone "db router" process, so that it's transparent to the webapp code.
* We should try to implement this using ScaleBase to start, but keep in mind the possibility of a custom dbrouter process. === What the WebApp Sees === From the POV of the webapp code, itThere's just talking to a regular old MySQL database: +---------+ +--------------+ | Web App |--------->| MySQL Server | +---------+ +--------------+ It has to respect a couple of restrictions though: * No DDL statements are allowed, only regular queries. Schema changes and sharding don't mix.* All queries must specify a fixed userid, e.g. by having a commercial software product called "userid = XXX" component in the WHERE clause, or by inserting rows with a known userid. This enables us to do the sharding transparently. === Transparent DB Router === The application code is actually talking to a ScaleBase"db router" server that speaks the MySQL wire protocol. In turn, the router is talking to the individual MySQL servers that are hosting each shard: +---------------------------+ +----->| MySQL Server for Shard #1 | +---------+ +-----------+ | +---------------------------+ | Web App |------->| DB Router |---+ +---------+ +-----------+ | +---------------------------+ +----->| MySQL Server for Shard #2 | +---------------------------+ The db router will: * Receive, parse and validate each incoming query* Extract the target userid, erroring out if the query does not have one.* Look up the shard number and corresponding database for that userid.* Forward the query to the appropriate database host, and proxy back the results. The particulars implements much of shard selection/lookup are not defined in this proposal, and are orthogonal to functionality off the rest of the setup. :rfkelly likes the consistent-hashing-plus-vbucket approach taken by couchbase, but it could be as simple as a lookup table. We assume that the router implements this appropriately and efficiently. Handling all the sharding logic in a separate process gives us a number of advantages: * Application code is greatly simplified.** The same code paths are exercised in deployment, testing, and third-party deployments against a single database machine. * The total number of connections is reduced, along with various connection-related overheads. * The router can do centralized health monitoring of the individual servers, handle failover, etc. === Intra-DC Redundancy === We need to guard against the loss of any individual server within a DC. There are separate redundancy schemes for the MySQL servers, and for the other supporting services. ==== MySQL Redundancy ==== To guard against the loss of any individual database server, each shard will also have a hot standby database, living in the same DC and configured for synchronous (semi-synchronous?) replication. For AWS it would be in a separate Availability Zone. The router monitors the health of the standby database, but does not forward it any queries. Its only job is to serve as a backup for the active master: +---------------------+ +----->| Master for Shard #1 | | +----------+----------+ | | (replication) +---------+ +-----------+ | +----------V---------------+ | Web App |------->| DB Router |---+----->| Hot Standby for Shard #1 | +---------+ +-----------+ | +--------------------------+ | | +---------------------+ +----->| Master for Shard #2 | | +----------+----------+ | | (replication) | +----------V---------------+ +----->| Hot Standby for Shard #2 | +--------------------------+ The router process is responsible for monitoring the health of these machines and sounding the alarm if something goes wrong. If the active master appears to be down, the router will transparently promote the hot standby and start sending queries to it. When the downed master comes back up, it is demoted to being the new standby. '''TODO:''' The failover could be performed manually, if we're a bit leery of infrastructure being too clever for its own goodshelf. '''TODO:''' Just one standby? Two? The principle should be the same regardless of how many we have. Star Topology FTW. '''TODO:''' We could use the standby as a read slave, but I don't see the point. In a failure scenario the master needs to be able to handle the entire read load on its own, so it might as well do that all the time. ==== Other Service Redundancy ==== We don't want any single-point-of-failures, so we'll have to have multiple instances of the webapp talking to multiple instances of the router. These are connected via loadbalancing, virtual IPs, and whatever Ops wizardry is required to make single-machine failures in each tier be a non-event: +--------------+ +-----------------+ | Web App Tier | | DB Router Tier | +---------------------+ | | | | +-->| Master for Shard #N | | +---------+ | | +-----------+ | | +----------+----------+ | | Web App | |--->| | DB Router | |-----+ | (replication) | +---------+ | | +-----------+ | | +----------V---------------+ | +---------+ | | +-----------+ | +-->| Hot Standby for Shard #N | | | Web App | | | | DB Router | | +--------------------------+ | +---------+ | | +-----------+ | +--------------+ +-----------------+ Note that we're '''not''' doing this to the MySQL servers. There's too many of them and we already have a custom redundancy scheme from the hot standby. Rendered concretely into AWS, we would have an Elastic Load Balancer and corresponding Auto-Scaling Group for each of these Tiers. The ELB provides a single endpoint for each service to contact the next, while being a magical auto-cloud-managed non-single-point-of-failure itself: +--------------+ +----------------+ | Auto-Scale | | Auto-Scale | +---------------------+ | | | | +-->| Master for Shard #N | +-----+ | +---------+ | +-----+ | +-----------+ | | +----------+----------+ | ELB |--+--|->| Web App |-|--+-->| ELB |--+--|->| DB Router | |-----+ | (replication) +-----+ | | +---------+ | | +-----+ | | +-----------+ | | +----------V---------------+ | | +---------+ | | | | +-----------+ | +-->| Hot Standby for Shard #N | +--|->| Web App |-|--+ +--|->| DB Router | | +--------------------------+ | +---------+ | | +-----------+ | +--------------+ +----------------+ With multiple DB Router processes, we run into the problem of shared state. They must all agree on the current mapping of userids to shards, of shards to database machines, and which database machines are master versus standby. One solution is to have them operate as a ZooKeeper (or similar) cluster to store this state in a consistent and highly-available fashion: +----------------------------------------------+ | DB Router Tier | | | | +------------------+ +-----------------+ | | | DB Router: | | DB Router: | | | | ZooKeeper Node <+----+> ZooKeeper Node | | | | Router Process | | Router Process | | | +----|-------------+ +----|------------+ | | | | | | +-----------+-----------+ | +-------------------|--------------------------+ V ................... : MySQL Instances : :.................: Note that this shard-state metadata will be very small and be updated very infrequently, which should make it very friendly to a local zookeeper installation. We might even be able to provide a nice web-based status view and management console. === Database Snapshots === For a final level of redundancy, we periodically snapshot each database into long-term storage, e.g. S3. Likely take the snapshot on the least up-to-date replica to minimize the chances that it would impact production capacity. As well as providing redundancy, these snapshots allow us to quickly bring up another DB for a particular shard. E.g. if we lose the hot standby, we can start a fresh one, restore it from a snapshot, then set it to work catching up from that point via standard replication. We'd use a similar process if we need to move or split shards - bring up a new replica from snapshot, get it up to date, then start sending traffic to it. === Inter-DC Redundancy === There is no Inter-DC redundancy from an availability perspective. If a DC goed down (e.g. AWS region outage) then we just tell the client that we're unavailable, come back soon. For durability, we periodically snapshot the data into offsite long-term storage, e.g. S3. For a prolonged region outage, we could consider re-creating the entire cluster from these snapshotsthere, but that sounds like an awful lot of work... '''TODO:''' If we want to spend the money, we could keep replicas on standby in another DC. I doubt we'll want to spend mind the money. === Implications for the Client === Using a single master for each shard means we don't have to worry about conflicts or consistency. The sharding means this should not be a bottle-neck, and the use of an intermediate router process means we can fail over fast if the master goes down. ''However'', since we're doing asynchronous replication, there's a chance that recent database writes could be lost in the event possibility of failure. The client will see a consistent, but out-of-date view of its data. It must be able to recover from such a situation, although we hope this would be a very rare occurrence! === Implementing the Router === The DB Router custom dbrouter process is obviously key here, and looks like a reasonably complex beast. Can we use an off-the-shelf solution for this? There are some that have most of the required features, e.g. ScaleBase. On the other hand, the feature set seems small enough that we could realistically implement the router in-house, with the benefit of tighter focus and greater control over the details of monitoring, failover, replication etc. === Things to Think About === * Needs a detailed and careful plan for how we'll bring up new DBs for existing shards, how we'll move dshards between DBs, and how we'll split shards if that becomes necessary. All very doable, just fiddly. * Increasing the number of shards could be '''very''' tricky. It might be simpler to:** spin up a new, bigger cluster using the same architecture** stop sending new users to the old cluster, start sending them to the new one** gradually migrate old users over to the new cluster** tear down the old cluster when finished == A Cassandra Cluster == Another promising storage option is Cassandra. It provides a rich-enough data model and automatic cluster management, at the cost of eventual consistency and the vague fear that it will try to do something "clever" when you really don't want it to. To get strong consistency back, we'd use a locking layer such as Zookeeper or memcached. Basic principles: * There is a single Cassandra storage node cluster backend the usual array of webhead machines.** We set a replication factor of 3 and do LOCAL_QUORUM reads and writes for all queries * The Cassandra cluster spans multiple DCs for durability (since it's not clear to me how well it would handle being snapshotted into S3)** All reads and writes are done in a single datacenter, so that we can enforce consistency** Read/write locks are taken in ZooKepper/memcached, on a per-user basis, to ensure consistency From the POV of the webapp code, it's just talking to ZooKeeper and Cassandra Storage Node as abstract systems: +---------+ +-----------+ | Web App |--------->| ZooKeeper | +---------+ +-----------+ | | +-----------+ +----------->| Cassandra | +-----------+
The fact that these are clustered, and membership may grow/shrink over time, should be transparent.=== Cassandra Cluster ===
'''TODO:''' Try to use Route53 to provide consistent names for some of the nodes, to act as introducers even if the entire membership of the cluster has changed
The Another promising storage option is Cassandra . It provides a rich-enough data model and automatic cluster replicates out to another DC for durabilitymanagement, but everything else stays in at the one DC. If cost of eventual consistency and the vague fear that DC goes down, the cluster becomes unavailable but no data is lostit will try to do something "clever" when you really don't want it to. We can re-create it in the other DCTo get strong consistency back, we'd use a locking layer such as Zookeeper or wait for it to come back upmemcached.
Basic principles:
+-------------------------------+ +-----------------+ | US-East | | US-West | | | | |* There is a single Cassandra storage node cluster fronted by the usual array of webhead machines. | +---------+ +-----------+ | | |* It uses a replication factor of 3, QUORUM reads and writes, and all notes live in a single datacenter. | | Web App |--->| * The webheads also have a shared ZooKeeper | | | +--or memcached install, which they use to serialize operations on a per---------+ |user basis | +---------+ +-----------+ | | | * Cassandra | | | | | | +-----------+ | | | | | ^ | | | +-----------+ | | | | | +------>| Cassandra |<-|---|------+ | | +-----------+ | | | | | | | +-------------------------------+ +-----------------+is periodically snapshotted into S3 for extra durability.
== A Hibernation Cluster ==
If a user doesn't use the service in, say, six months, then we could migrate them out of one of the active clusters and into a special "hibernation cluster".
Data that is moved into this cluster might simple simply be snapshoted into low-cost storage such as S3. Or it might get put onto a very crowded, very slow MySQL machine that can only handle a trickle of user requests.
If they come back and try to use their data again, we immediately trigger a migration back to one of the active clusters.