Loop/Architecture/Redis

From MozillaWiki
Jump to: navigation, search

Note that the changes proposed here are preliminary, and will be updated as the result of conversations with interested parties.

Date Author Changes
Wed Sep 17 11:48:12 CDT 2014 Rémy Hubscher Brainstorming about using Redis as a Key-Value data store
Wed Sep 17 12:18:12 CDT 2014 Alexis Métaireau Adding information about how the data flows, and a general overview of the problems.

What are the problems we're trying to solve?

Currently, Loop server uses exlusively redis to store data, it uses it via elasticache and has only one master and one replica.

The problems:

  • Redis can reach capacity and we currently don't do anything about it (sharding is one potential answer to that concern);
  • In case the redis master goes down, we don't have any mitigation plan.

Data flow

We currently using redis to:

  • Store long-lived data:
    • data identifying the user (data from the FxA / MSISDN assertions);
    • simple push urls;
    • call-url associated data;
    • user sessions.
  • Store short-lived data:
    • Call state information;
    • Calls information.
  • Handle the pubsub between call parties in the call setup protocol (progressurl)

When a call takes place,

  1. The user registers on the server;
    1. (read) If a session is passed, check it's valid;
    2. (write) A simple push url is stored;
    3. (write) If no session is passed, a new one is created.
  2. The user creates a call-url
    1. (write) The session is touched.
    2. (write) A new call-url is stored;
  3. The user checks what are the active calls:
    1. (write) The session is touched.
    2. (read) The list of calls is retrieved
  4. A new call is issued:
    1. [Optional: (write) The session is touched in case a session exists];
    2. (read) The simple push urls of the callee are retrieved;
    3. (write) A new call is created.

Architecting ElastiCache Redis Tier

What does ElastiCache gives us?

  • A Master with up to 5 read-only replica. If the master goes down, the data is still available as read-only.
  • It is possible to promote a read-only replica as the new master using an API call:
https://elasticache.us-east-1.amazonaws.com/
  ?Action=ModifyReplicationGroup
  &ReplicationGroupId=my-repgroup
  &PrimaryClusterId=my-replica-1  
  &Version=2013-06-15
  &SignatureVersion=4
  &SignatureMethod=HmacSHA256
  &Timestamp=20140421T220302Z
  &X-Amz-Algorithm=AWS4-HMAC-SHA256
  &X-Amz-Date=20140421T220302Z
  &X-Amz-SignedHeaders=Host
  &X-Amz-Expires=20140421T220302Z
  &X-Amz-Credential=<credential>
  &X-Amz-Signature=<signature>
  • The maximum ElastiCache size is 237GB

What are our options

First of all we need to think about sharding never the less 60GB is a too close constrain as we keep adding data that can stay for longer than before inside the database (rooms-url, sessionToken, call-urls)

We need sharding to make sure we will continue to scale up our cluster even if we need to store more than 237GB of data in it.

My plan is to build a cluster with 5 small elasticache redis instance (and there read only replicas) that we will be able to scale up using TwemProxy.

TwemProxy will be responsible for sharding the keys. I will also create a worker that will make sure each node of the master cluster doesn't turn to a read-only master. If that happen, an API call will be made on the cluster to promote a replica as a new master.

What do we need to do on redis client side?

  • First we need to add TwemProxy to our stack (deploy and configure it on each loop-server node).
  • Then we need to replace all multiple keys Redis call to async calls so that TwemProxy will be able to shard the request for each keys.

During the replication phase the loop-server will return a 503 with a Backoff header that will let the loop-client now how many times it should wait before retrying. This window is the needed time for the new master promotion.


Links / Resources