Changes

Jump to: navigation, search

Identity/AttachedServices/StorageServiceArchitecture

3,904 bytes added, 04:37, 30 April 2013
no edit summary
== Hot Standby Databases Intra-DC Redundancy == We need to guard against the loss of any individual server within a DC. There are separate redundancy schemes for the MySQL servers, and the for supporting infrastructure.  === MySQL Redundancy ===
To guard against the loss of any individual database server, each shard will also have a hot standby database, living in the same DC and configured for synchronous (semi-synchronous?) replication. The proxy monitors the health of the standby database, but does not forward it any queries. Its only job is to serve as a backup for the active master:
The proxy process is responsible for monitoring the health of these machines and sounding the alarm if something goes wrong. If the active master appears to be down, the proxy will transparently promote the hot standby and start sending queries to it. When the downed master comes back up, it is demoted to being the new standby.
 
'''TODO:''' Just one standby? Two? The principle should be the same regardless of how many we have.
'''TODO:''' We could use the standby as a read slave, but I don't see the point. In a failure scenario the master needs to be able to handle the entire read load on its own, so it might as well do that all the time.
== Intra-Tier Other Service Redundancy ==
We don't want a single-point-of-failure, so we'll have to have multiple instances of the webapp talking to multiple instances of the proxy. These are connected via loadbalancing, virtual IPs, and whatever Ops wizardry is required to make single-machine failures in each tier be a non-event:
: MySQL Instances :
:.................:
 
Note that this shard-state metadata will be very small and be updated very infrequently, which should make it very friendly to a local zookeeper installation.
 
 
== Inter-DC Redundancy ==
 
We'll replicate the entire stack into several data-centers, each of which will maintain a full copy of all shards.
 
One DC will be the active master for each shard. All reads and writes for that shard will be forwarded into that DC and routed to the master. (This will save us a ''world of pain'' by not having multiple conflicting writes going into different DCs). Other DCs are designated as warm-standby hosts for that shard, configured for asynchronous WAN replication. They can be failed-over to if there is a serious outage in the master DC, but this will almost certainly result in the loss of some recent transactions:
 
+----------------------------------------------------------------------------------+
| US-East Data Center |
| |
| +--------------+ +------------------+ |
| | Web App Tier | | Shard Proxy Tier | +---------------------+ |
| | | | | +-->| Master for Shard #N |-------|-----+
| | +---------+ | | +-------------+ | | +----------+----------+ | |
| | | Web App | |--->| | Shard Proxy | |-----+ | (replication) | |
| | +---------+ | | +-------------+ | | +----------V---------------+ | |
| | +---------+ | | +-------------+ | +-->| Hot Standby for Shard #N | | | ( v e r y )
| | | Web App | | | | Shard Proxy | | +--------------------------+ | | ( s l o w )
| | +---------+ | | +-------------+ | | | ( r e p l i c a t i o n )
| +--------------+ +------------------+ | |
+----------------------------------------------------------------------------------+ |
|
|
+------------------------------------------------------------------------------------+ |
| US-West Data Center | |
| | |
| +--------------+ +------------------+ | |
| | Web App Tier | | Shard Proxy Tier | +---------------------------+ | |
| | | | | +-->| Warm Standby for Shard #N |<--|---+
| | +---------+ | | +-------------+ | | +----------+----------------+ |
| | | Web App | |--->| | Shard Proxy | |-----+ | (replication) |
| | +---------+ | | +-------------+ | | +----------V-----------------+ |
| | +---------+ | | +-------------+ | +-->| Tepid Standby for Shard #N | |
| | | Web App | | | | Shard Proxy | | +----------------------------+ |
| | +---------+ | | +-------------+ | |
| +--------------+ +------------------+ |
+------------------------------------------------------------------------------------+
Confirm
358
edits

Navigation menu