CloudServices/Sagrada/TokenServer: Difference between revisions
Ametaireau (talk | contribs) No edit summary |
|||
Line 136: | Line 136: | ||
* will need to migrate all user login data over to the browserid servers, but that's not relevant to tokenserver. | * will need to migrate all user login data over to the browserid servers, but that's not relevant to tokenserver. | ||
== Load == | |||
Considering Sync is currently handling 10M users and each node is able to handle about 100K users. | |||
This means we'll need 10M/100K nodes = 100 nodes to handle the current load on sync. | |||
=== On the central server === | |||
Network: the cost of checking with the authority that the assertion is valid | |||
CPU: Each time a client asks for a new token, we need to generate it. | |||
Memory: we don't need to store anything in memory on the server side | |||
=== On Nodes === | |||
Memory: There is no need to store anything in memory apart from the shared secret | |||
CPU: each time a request is made, there is a need to un-cypher the token and check it against the shared secret. If we consider | |||
Network: Nothing. | |||
= Phase 1 = | = Phase 1 = |
Revision as of 19:35, 27 December 2011
Goals
So here's the challenge we face. Current login for sync looks like this:
- 1/ provide username and password
- 2/ we log into ldap with that username and password and grab your sync node
- 3/ we check the sync node against the url you've accessed, and use that to configure where your data is stored.
This solution works great for centralized login. It's fast, has a minimum number of steps, and caches the data centrally. The system that does node-assignment is lightweight, since the client and server both cache the result, and has support for multiple applications with the /node/<app> API protocol.
However, this breaks horribly when we don't have centralized login. And adding support for browserid to the SyncStorage protocol means that we're now there. We're going to get valid requests from users who don't have an account in LDAP. We won't even know, when they make a first request, if the node-assignment server has ever heard of them.
So, we have a bunch of requirements for the system. Not all of them are must-haves, but they're all things we need to think about trading off in whatever system gets designed:
- need to support multiple services (not necessarily centrally)
- need to be able to assign users to different machines as a service scales out, or somehow distribute them
- need to consistently send a user back to the same server once they've been assigned
- need to give operations some level of control over how users are allocated
- need to provide some recourse if a particular node dies
- need to handle exhaustion attacks. For example, I could set up an RP that just auto-approved any username, then loop through users until all nodes were full.
- need support for future developments like bucketed assignment
- Needs to be a system that scales infinitely.
Proposed design
This solution proposes to use a token-based authentication on each node. A node can control the validity of a token without having to call a third party.
A token server is dedicated to a single service and knows all its nodes.
flow
Here's the proposed flow:
- the client asks for a node allocation, giving its browser id assertion
- the token server checks the browser id assertion
- the token server checks in a DB if the user is already allocated to a node.
- if the user is not allocated to a node, the token picks one by selecting the node that has the less users
- the token server creates a token using the user id, the node url, a time stamp and a secret string known only by the selected node and itself.
- the client calls the right node. and the node is able with its secret to validate that the token is valid. if it's an invalid or outdated token, the node returns a 401
nodes management
[Note: this is not an early requirement - to this point, ops has not been interested in interacting with the API, preferring to do use scripts or work directly with the db. Given that we don't know what the final form of this will look like, this part should be deprioritized.]
There's an HTTP API to manage the list of nodes on a given token server. The API can be used by nodes or by an admin script.
- GET http://token-server/nodes : returns a list of managed nodes
- POST http://token-server/nodes : push a new node = url + public key (+ secret) [restricted access]
- PUT http://token-server/nodes/phx345 : update a node info [restricted access]
- ...
Secrets will be managed through operations. If they push a new secret to a box, they will be responsible for updating the central db (we may provide scripts to help manage this)
XXX also see https://wiki.mozilla.org/Services/NodeAssignment to grab back some stuff
tokens
The token is composed of:
- a timestamp
- a ttl value
- the user email used in the browser id assertion
- the app-specific user id (the user id integer in the case of sync)
- the node id (the url)
- a HMAC-SHA1 signature using a shared secreat
Implementation example: https://github.com/mozilla-services/tokenserver/blob/master/crypto.py
Example:
$ python crypto.py Creating a secret ae6c3407ccf354f4d029061a5de97b188791e078398256a1f78b1b47...b40f834e570f74d9987ac9aa9cc7fa9fa ========= SERVER ========== Creating the signed token {'node': 'phx345', 'uid': '123', 'timestamp': 1324654308.907832, 'ttl': 30, 'signature': '452671cf538528cc427e98d42c0fd43ebf285ae5', 'email': 'tarek@mozilla.com'} creating a header with it Authorization: MozToken {"node": "phx345", "uid": "123", "timestamp": 1324654308.907832, "ttl": 30, "signature": "452671cf538528cc427e98d42c0fd43ebf285ae5", "email": "tarek@mozilla.com"} ========= NODE ========== extracting the token from the header Authorization: MozToken {"node": "phx345", "uid": "123", "timestamp": 1324654308.907832, "ttl": 30, "signature": "452671cf538528cc427e98d42c0fd43ebf285ae5", "email": "tarek@mozilla.com"} validating the signature
[Trying to think of ways in which we might care about exposing uid.]
[Also email. Security may have an issue with that, as it's theoretically loggable. Need to talk to them.]
secrets
The token server has a DB listing of each node. For each node it has:
- its url
- a list of secrets
- the public key of the node
Each node has a unique secret it shares with the token server. A secret is an ascii string of 128 chars.
Example of generating such string: https://github.com/mozilla-services/tokenserver/blob/master/crypto.py
The node is responsible to create the secret and give it to the token server. The nodes encrypts the secret using its private key and GNU Privacy Guard, then push it to the central node using a PUT call.
updating a secret
We can face the case where a secret needs to be changed.
For this, the token server maintains a list of 2 secrets for each node. One "old" secret and one "active" secret. Every time a new secret is pushed, the token server deletes the "old" secret and move the "active" secret so it becomes to "old" secret, and the new one become "active".
The node also maintains the two last secrets it has generated.
When a token comes in on the node, the node tries the active secret, then fallback to the old secret.
node deactivation
When a node needs to be shut down,
- the backoff flag is set in the token db
- if the user asks for a new token on the node, the server returns a 403 + Retry-After: ttl + 1
backward compatibility
Older versions of the system will use completely different API entrypoints - the old /user api, and the 1.1 /sync api. Those will need to be maintained during the transition, though new clusters should spin up with only 2.0 support.
We should watch logs to study 1.1 falloff and consolidate those users through migration as they diminish.
However, There are a couple of points that need to be synced up:
- The database that assigns nodes needs to be shared between the two. We should add a column for "1.0 acceptable" and update the old system to only look at those columns. Alternately, could work with ops to just have an "all old assignments go to this one cluster", in which case, the db doesn't need to be shared.
- There will be a migration that moves all the user node data from LDAP to the tokenserver. However, we need to make sure that any subsequent migrations update this data. This ensures that a user with a pre-2 client and post-2 client point at the same place, and that people moving to the new systems will have the right node. We can't punt this, because if a node goes down post-migration, a user who switches over afterwards is stuck on it. (at the very least, we need to purge these nodes from the 2.0 db).
- will need to migrate all user login data over to the browserid servers, but that's not relevant to tokenserver.
Load
Considering Sync is currently handling 10M users and each node is able to handle about 100K users.
This means we'll need 10M/100K nodes = 100 nodes to handle the current load on sync.
On the central server
Network: the cost of checking with the authority that the assertion is valid CPU: Each time a client asks for a new token, we need to generate it. Memory: we don't need to store anything in memory on the server side
On Nodes
Memory: There is no need to store anything in memory apart from the shared secret CPU: each time a request is made, there is a need to un-cypher the token and check it against the shared secret. If we consider Network: Nothing.
Phase 1
[End of January? Need to check with ally]
End to end prototype with low-level scaling
- Fully defined API, including headers and errors
- Assigns Nodes
- Maintains Node state for a user
- Issues valid tokens
- Downs nodes if needed
Phase 2
[End of Q1?]
Scalable implementation of the above in place.
- Migration
- Operational support scripts (TBD)
- Logging and Metrics