CloudServices/Sagrada/TokenServer

From MozillaWiki
Jump to: navigation, search

Goals

tldr: having a centralized login service.

See: http://docs.services.mozilla.com/token/index.html#goal-of-the-service

APIS

see http://docs.services.mozilla.com/token/apis.html


Proposed Design

This solution proposes to use a token-based authentication system. A user that wants to connect to one of our service asks to a central server an access token.

The central server, a.k.a. the Login Server checks the authenticity of the user with a supported authentication method, and attributes to the user a server he needs to use with that token.

The server, a.k.a. the Service Node, that gets called controls the validity of the token included in the request. Token have a limited lifespan.


Definitions and assumptions

See http://docs.services.mozilla.com/token/index.html#assumptions

Flow

see http://docs.services.mozilla.com/token/user-flow.html

Authorization token

A token is a json encoded mapping. The keys of the Authorization Token are:

  • expires: an expire timestamp (UTC) defaults to current time + 30 mn
  • uid: the app-specific user id (the user id integer in the case of sync)
  • salt: a randomly-generated salt for use in the calculation of the Token Secret (optional)
  • node: the name of the service node to which the user is assigned

Example:

 auth_token = {"uid": 123, "node": "https://sync-1.services.mozilla.com", "expires": 1324654308.907832, "salt": "sghfwq6875765..UYgs"}  


The token is signed using the Signing Secret and base64-ed. The signature is HMAC-SHA256:

 auth_token, signature = HMAC-SHA256(auth_token, sig_secret)
 auth_token = b64encode(auth_token, signature)

The authorization token is not encrypted

Secrets

Each Service Node has a unique Master Secret that it shares with the Login Server,which is used to sign and validate authentication tokens. Multiple secrets can be active at any one time to support graceful rolling over to a new secret.

To simplify management of these secrets, the tokenserver maintains a single list of master secrets and derives a secret specific to each node using HKDF:

  • node-info = "services.mozilla.com/mozsvc/v1/node_secret/" + node-name
  • node-master-secret = HKDF(master-secret, salt=None, info=node-info, size=digest-length)

The node-specific Master Secret is used to derive keys for various cryptographic routines. At startup time, the Login Server and Node should pre-calculate and cache the signing key as follows:

  • sig-secret: HKDF(node-master-secret, salt=None, info="SIGNING", size=digest-length)

By using a no salt (or a fixed salt) these secrets can be calculated once and then used for each request.

When issuing or checking an Auth Token, the corresponding Token Secret is calculated as:

  • token-secret: b64encode(HKDF(node-master-secret, salt=token-salt, info=auth-token, size=digest-length))

Note that the token-secret is base64-encoded for ease of transmission back to the client.


Configuring Secrets

The tokenserver should be configured to use the DerivedSecrets class with the list of master secrets:

   [tokenserver]
   secrets.backend = mozsvc.secrets.DerivedSecrets
   secrets.master_secrets = master-secret-one master-secret-two

A suitable master secret can be generated using mozsvc as follows:

   python -m mozsvc.secrets new

Each node should be configured to use the FixedSecrets class and its corresponding derived secret:

   [hawkauth]
   secrets.backend = mozsvc.secrets.FixedSecrets
   secrets.secrets = node-master-secret-one, node-master-secret-two

This prevents a compromise on one service node from leaking the secrets on all nodes. A suitable node-specific secret can be derived from the master secret as follows:

   python -m mozsvc.secrets derive <master_secret> https://<node_name>


Secret Update Process

To revoke the secrets for a specific node, simply rename it so that its derived secret will be different.

To update the master secrets, the following procedure should be used:

1) Generate the new master secret, but keep the old one as well for now

2) For each storage node, derive both the new and old node-specific secrets and push them out, so that its config file looks like this:

    [hawkauth]
    secrets.backend = mozsvc.secrets.FixedSecrets
    secrets.secrets = <old-derived-node-secret-as-hex> <new-derived-node-secret-as-hex>

Restart it. It is now able to accept tokens signed with either secret.

3) For each tokenserver webhead, update it with the new master secret, removing the old one. Its config file will look like:

    [tokenserver]
    secrets.backend = mozsvc.secrets.DerivedSecrets
    secrets.master_secrets = <new-master-secret-as-hex>

Restart it. It now generates tokens signed with the new derived secrets.

4) Discard the old master secret.

5) Wait for one token expiration period, e.g. five minutes.

6) For each storage node, derive just the new node-specific secret and push it out, so that its config file looks like this:

Pulling a secret

In case we want to instantly remove the validity of a secret, we add a new secret as described before, but prune the old secrets right away, so any token out there are instantly rejected.

Backward Compatibility

The Login server uses the same snode and ldap servers, so both authentication systems can cohabit during a transition period.

Infra/Scaling

On the Login Server

The flow is:

  1. the user ask for a token, with a browser id assertion
  2. the server verifies locally the assertion [CPU bound]
  3. the server calls the User DB [I/O Bound]
  4. the server calls the Node Assignment Server [I/O Bound] (optional)
  5. the server builds the token and sends it back [CPU bound]
  6. the user uses the node for the time of the ttl (30mn)

So, for 100k users it means we'll do 200k requests on the Login Server per hour, so 50 RPS. For 1M users, 500 RPS. For 10M users, 5000 RPS. For 100M users, 50000 RPS.


Deployment

  • A Login Server is stateless, so we can deploy as many as we want and have Zeus load balance over them
  • A Login Server sees all secrets, so it can be cross-cluster / cross-datacenter
  • The shared secrets files can stay in memory -- updating the files should ping the app so we reload them
  • The User DB is the current LDAP, and may evolve into a more specialised metadata DB later

On each Service Node

Flow :

  1. the server checks the token [CPU Bound]
  2. the server process the request [Sync = I/O Bound]


Phase 1

[End of January? Need to check with ally]

End to end prototype with low-level scaling

  • Fully defined API, including headers and errors
  • Assigns Nodes
  • Maintains Node state for a user (in the existing LDAP)
  • Issues valid tokens
  • Downs nodes if needed

Phase 2

[End of Q1?]

Scalable implementation of the above in place.

  • Migration
  • Operational support scripts (TBD)
  • Logging and Metrics


Implementation details

  • The Token Server web service is implemented using Cornice and Pyramid, and sends crypto work to a crypto service via zmq.
  • The Crypto worker is a c++ program using cryptopp


token.png