CloudServices/Sagrada/TokenServer: Difference between revisions
m (Telliott moved page Services/Sagrada/TokenServer to CloudServices/Sagrada/TokenServer) |
|||
| Line 34: | Line 34: | ||
* '''uid''': the app-specific user id (the user id integer in the case of sync) | * '''uid''': the app-specific user id (the user id integer in the case of sync) | ||
* '''salt''': a randomly-generated salt for use in the calculation of the Token Secret (''optional'') | * '''salt''': a randomly-generated salt for use in the calculation of the Token Secret (''optional'') | ||
* '''node''': the name of the service node to which the user is assigned | |||
Example: | Example: | ||
auth_token = {"uid": 123, "expires": 1324654308.907832, "salt": "sghfwq6875765..UYgs"} | auth_token = {"uid": 123, "node": "https://sync-1.services.mozilla.com", "expires": 1324654308.907832, "salt": "sghfwq6875765..UYgs"} | ||
The token is signed using the Signing Secret and base64-ed. The signature is HMAC- | The token is signed using the Signing Secret and base64-ed. The signature is HMAC-SHA256: | ||
auth_token, signature = HMAC- | auth_token, signature = HMAC-SHA256(auth_token, sig_secret) | ||
auth_token = b64encode(auth_token, signature) | auth_token = b64encode(auth_token, signature) | ||
Revision as of 21:44, 27 January 2014
Goals
tldr: having a centralized login service.
See: http://docs.services.mozilla.com/token/index.html#goal-of-the-service
APIS
see http://docs.services.mozilla.com/token/apis.html
Proposed Design
This solution proposes to use a token-based authentication system. A user that wants to connect to one of our service asks to a central server an access token.
The central server, a.k.a. the Login Server checks the authenticity of the user with a supported authentication method, and attributes to the user a server he needs to use with that token.
The server, a.k.a. the Service Node, that gets called controls the validity of the token included in the request. Token have a limited lifespan.
Definitions and assumptions
See http://docs.services.mozilla.com/token/index.html#assumptions
Flow
see http://docs.services.mozilla.com/token/user-flow.html
Authorization token
A token is a json encoded mapping. The keys of the Authorization Token are:
- expires: an expire timestamp (UTC) defaults to current time + 30 mn
- uid: the app-specific user id (the user id integer in the case of sync)
- salt: a randomly-generated salt for use in the calculation of the Token Secret (optional)
- node: the name of the service node to which the user is assigned
Example:
auth_token = {"uid": 123, "node": "https://sync-1.services.mozilla.com", "expires": 1324654308.907832, "salt": "sghfwq6875765..UYgs"}
The token is signed using the Signing Secret and base64-ed. The signature is HMAC-SHA256:
auth_token, signature = HMAC-SHA256(auth_token, sig_secret) auth_token = b64encode(auth_token, signature)
The authorization token is not encrypted
Secrets
Each Service Node has a unique Master Secret per Node it serves, it shares with the Login Server. A Master Secret is a timestamp rounded to the second, followed by a column, and a pseudo-random hex string of 256 chars from [a-f0-9].
Example of generating such string:
>>> import binascii, os, time >>> print '%d:%s' % (int(time.time()), binascii.b2a_hex(os.urandom(256))[:256]) 1326322983:646dc48...4ad86dca82d
(XXX crypto review required, not sure if this is the best/correct way to use HKDF for this purpose)
The Master Secret is used to derive keys for various cryptographic routines. At startup time, the Login Server and Node should pre-calculate and cache the signing key as follows:
- sig-secret: HKDF(master-secret, salt=None, info="SIGNING", size=digest-length)
By using a no salt (or a fixed salt) these secrets can be calculated once and then used for each request.
When issuing or checking an Auth Token, the corresponding Token Secret is calculated as:
- token-secret: b64encode(HKDF(master-secret, salt=token-salt, info=auth-token, size=digest-length))
Note that the token-secret is base64-encoded for ease of transmission back to the client.
Ops create secrets for each Node, and maintain for each cluster a file containing all secrets. The file is deployed on the Login Server and on each Service Node. The Login Server has all clusters files.
Each file is a CSV file called /var/moz/shared_secrets/CLUSTER, where CLUSTER is the name of the cluster,
Example:
phx1,1326322983:secret phx2,1326322990:secret ...
Secret Update Process
When an existing secret needs to be changed for whatever reason, Ops can add new secrets to the file.
The new secret is appended to the Node's line on each file :
phx1,1326322983:secret,1326324523:secret phx2,1326322990:secret ...
The Service Nodes are the first ones to be updated, then the Login Server is updated in turn, so the new tokens are immediatly recognized by the Nodes.
The Service Node sorts the secret by timestamp and tries the newest one, then fallback to the next one in case the token could not be validated.
The Login Server always works with the newest secret, so ignores older secrets when it creates tokens. Old secret are pruned eventually.
The Login Server and Service Node applications should watch the files and reload them in case they change.
Pulling a secret
In case we want to instantly remove the validity of a secret, we add a new secret as described before, but prune the old secrets right away, so any token out there are instantly rejected.
Backward Compatibility
The Login server uses the same snode and ldap servers, so both authentication systems can cohabit during a transition period.
Infra/Scaling
On the Login Server
The flow is:
- the user ask for a token, with a browser id assertion
- the server verifies locally the assertion [CPU bound]
- the server calls the User DB [I/O Bound]
- the server calls the Node Assignment Server [I/O Bound] (optional)
- the server builds the token and sends it back [CPU bound]
- the user uses the node for the time of the ttl (30mn)
So, for 100k users it means we'll do 200k requests on the Login Server per hour, so 50 RPS. For 1M users, 500 RPS. For 10M users, 5000 RPS. For 100M users, 50000 RPS.
Deployment
- A Login Server is stateless, so we can deploy as many as we want and have Zeus load balance over them
- A Login Server sees all secrets, so it can be cross-cluster / cross-datacenter
- The shared secrets files can stay in memory -- updating the files should ping the app so we reload them
- The User DB is the current LDAP, and may evolve into a more specialised metadata DB later
On each Service Node
Flow :
- the server checks the token [CPU Bound]
- the server process the request [Sync = I/O Bound]
Phase 1
[End of January? Need to check with ally]
End to end prototype with low-level scaling
- Fully defined API, including headers and errors
- Assigns Nodes
- Maintains Node state for a user (in the existing LDAP)
- Issues valid tokens
- Downs nodes if needed
Phase 2
[End of Q1?]
Scalable implementation of the above in place.
- Migration
- Operational support scripts (TBD)
- Logging and Metrics
Implementation details
- The Token Server web service is implemented using Cornice and Pyramid, and sends crypto work to a crypto service via zmq.
- The Crypto worker is a c++ program using cryptopp