Sauropod

From MozillaWiki
Jump to: navigation, search
Draft-template-image.png THIS PAGE IS A WORKING DRAFT Pencil-emoji U270F-gray.png
The page may be difficult to navigate, and some information on its subject might be incomplete and/or evolving rapidly.
If you have any questions or ideas, please add them as a new topic on the discussion page.

Sauropod Technical Specification

Sauropod is a secure storage system for user data. It employs encryption and secure key storage to enable least-privilege access, fine-grain user permissioning, and a controlled and auditable process for administrative and automated data access.

To application developers, Sauropod presents a key-value storage API, where each user has a completely independent universe of keys. Applications gain access to a user's store by presenting a user credential, the generation and validation of which is external to the Sauropod system. The store may also, optionally, restrict access to a particular set of user keys based on the application making the access. Applications may extend the privileges on a particular object key according to sensible transitive principles: a user that can read a file can extend read permission to any other user, and similarly for writes. (XX support locking an item down as non-sharable?)

Administrative and automated access is supported through "super-credentials". These allow developers and batch processes to obtain capabilities identical to those of a user for a limited time. There is no "super-user" that is allowed to access all records; instead, an administrator acquires the permissions of a user through an authenticated, auditable process.

(XX The desired implementation of Sauropod is of a key-value store with encrypted values, where the encryption keys are per-user keys that are wrapped with a small number of master secrets. Keys are only unwrapped inside the system. The API described to the client does not expose the details of the internal data protection scheme, but it is completely compatible with this internal representation)

Project Phasing

  • Phase Zero: In phase zero, a tentative API, including credentials, is implemented. What happens within the API is unspecified, and may not involve any cryptography to start with.
  • Phase One: In phase one, the session API is fully implemented. Callers are required to present user credentials or an administrative credential to access user data. The internal implementation is not encrypted, but uses row-level access control to enforce fine-grained access control. The Access Server, Logging, and Credential Oracle are implemented fully; the Data Server is a non-encrypted databae; the Key Server is not present. The Sharing API is not implemented. (XX how much of Administrative and Automated?)
  • Phase Two: In phase two, the Key Server is fully implemented, and the Data Server is modified to store encrypted data. The encrypted ACL system is implemented. The Sharing, Administrative, and Automated APIs are implemented.

Definition of Terms

  • Application: a process that is accessing user data on behalf of a user. Applications use Application Authentication to prove to the access server which process they are.
  • Access Server: a Sauropod internal process that handles requests from applications to access data and keys.
  • Data Server: a Sauropod internal process that maintains a table of user data. Each atom of user data has a bucket and a value.
  • Credential Oracle: (too cute? name?) An external process, configured as part of a Sauropod installation, which verifies a credential and translates it into a user identifier.
  • Credential: An string of bytes, presented by the application to the Access Server, which encodes the successful authentication of a user into the system. A credential could be a cookie (which would then be checked with a session server connected to the authentication system) or a directly-verifiable credential such as a BrowserID assertion or proof of SSL client certificate handshake.
  • User Identifier: An string of bytes that represents a single user in the Sauropod system. Credentials can be converted into user identifiers by the Credential Oracle.
  • Key Server: a Sauropod internal process that maintains a list of per-value keys. All keys are wrapped with a Master Secret that is known only to the Key Server (or, better yet, locked away in a hardware module that only the Key Server can access). Every unique value has its own key; a key may be wrapped by more than one user key (if more than one user has access to it).
  • Logging Aggregator: A Sauropod internal process that collates the logs of the access, data, and key servers to provide a unified view of data access behavior. It may optionally run audit logic to detect anomalous access patterns.

Basic Flow of Control

In the course of processing a request from a user, an application needs to retrieve some data. As part of the request (or a session context connected to it), the application has a user credential.

The application begins a session with the access server by sending the user credential in a BeginSession request.

The Access Server validates the credential by consulting the Credential Oracle and creates a session associated with the User Identifier. A session identifier is returned to the application, which must be included with all subsequent requests.

The application than issues some number of requests to the Access Server, including the session identifier with each.

The Access Server authenticates the request to determine which application is making the request. It then

  • Determines if the application is allowed to access the requested bucket
  • in the Clear Data Model, verifies the credential and derives a User Identifier from it, and then determines whether that User Identifier has permissions to access the requested bucket; if permission is allowed performs the read or write
  • In the Encrypted Data Model, verifies the credential and derives a User Identifier from it, and then retrieves the ACL record and data ciphertext for that User Identifier at the requested bucket; the ACL is then passed to the Key Server for validation and decryption, and the key contained in the ACL is used to decrypt the ciphertext or encrypt a new ciphertext for the data.

The key server and data server only respond to requests from access servers. All key server and data server operations are logged.

Data Storage Model

(XX Need lots of feedback here).

  • Key-value pairs
  • Collections? Ordered lists?
  • Trees? Graphs?

Clear Data Model

In the Clear Data Model, the Data Server maintains two types of data:

  • A user data bucket is keyed on a bucket location, and contains the data.
  • A user access bucket is keyed on a bucket location and a user identifier, and contains a permission (level, or bitwise mask, of read/write/admin).

Encrypted Data Model

In the Encrypted Data Model, the Data Server maintains two types of data:

  • A user data bucket is keyed on a bucket location, and contains the data encrypted with a data key (for location d, called Kd)
  • A user access bucket is keyed on a bucket location and a user identifier, and contains a tuple of (perms, (location, identifier, perms, Kd)_Kmaster). That is, the access control list (ACL) record, which uniquely identifies the ability of a particular user to access a particular bucket, and contains the key to read that bucket, is encrypted with a master secret.

The Key Server simply maintains a small number of master secrets. The Access Server retrieves ACL records from the data server and submits them to the Key Server, along with a credential. The Key Server verifies the credential, decrypts the ACL tuple, verifies that the identifier in the tuple matches the identifier of the credential, logs the access, and returns Kd to the Access Server. (NB the Key Server should be locked down so only the Access Server can talk to it)

((XX should access server include the intent of which permission it wants? would be for logging only))

Application Authentication

All application-level calls to the AccessServer must be authenticated **by the application**. The terms of this authentication are up to the implementation, but could include transport level or message level techniques (IP range pinning, IPsec, SSL/TLS, and API keys are all options). ((XX perhaps we could implement a couple))

Administrative Access

The system supports audited, fine-grained administrative access through the creation of "super-credentials". These are credentials which represents a super-user access to a **single user's data**. The Credential Oracle is required to provide features to make this work, so the exact details are out of scope for this specification, but the general flow is:

  • Administrative user authenticates to the credentialing system, presenting a superuser authentication and a target user
  • Credentialing system produces a super-credential for the target user, tagged with audit trail metadata (for example, "super user Jane, accessing user Joe, to investigate bug #6143635, at 2:25 PM 10/15/2011").
  • The superuser then creates a session and issues commands as though he or she were the actual target user. The credential information is logged with all accesses.

The credential system should be designed to expirte the credential in a reasonable interval to allow the administrative user to finish his or her tasks.

Batch/Automation Access

In a similar fashion to administrative access, the system supports batch mode access for automation and aggregate analysis. The credential oracle is required to support "automation credentials" for this purpose.

An automation credential is a credential that allows access to a set of users. The automated process must authenticate itself to the credential oracle and receive a credential, just like a superuser access. The process must assert the purpose of its access during this credentialing process, and the credential it receives is time- and scope-limited. ((XX details!)) For example, "Indexing batch process, handling user updates new since 10/14/2011-01:50-Z".

Once the automated system has a credential, it may use that credential to begin a session and access user data.

(XX If the system supports internal aggregate calculations, e.g. map-reduce or tree walking of data, the Access Server session could make use of the credential internally to access multiple user keys).

Provisioning to applications outside the trusted perimeter

If the application-level authentication is strong enough, the Access Server could accept inbound requests from applications outside the trusted computing perimeter. These applications would still be required to authenticate themselves and present a user credential. Transport-level encryption of the communication would be mandatory.

Possible topologies:

  • Traditional OAuth: The application authenticates with an API key and secret; the user credential is an OAuth Access Token (which was previously provisioned through an authorization flow)
  • BrowserID: The application authenticates with SSL/TLS or an API key and secret; the user credential is an identity assertion. ((XX what would the audience be - probably the accessing application?))


Questions

"The Vault" and "The Cloud"

Can we use this system to store non-decryptable user data without any issues? This would mean storing an ACL that contains, not Kd, but a NULL key - that is, a record that we don't know how to read the document. Mozilla would still be responsible for maintaining the ACL, so that an authenticated user could extend read or write privileges to another user.

Trusted computing base, lifecycle questions:

  • Does the access server cache credential verification, or credential to user identifier results, or key decrypt results?
  • Currently we need to consult the oracle twice (and maybe thrice) - once to find which identifier to use for the ACL lookup (once in access server, or once in data server, or both?), and again in the keyserver to verify that the ACL's plaintext contains the identifier in the credential. Is that really necessary?

Data partitioning questions

Per-user? ( identifier, datakey) Flat per-user? /identifier/datakey Flat global? /datakey

Anything per user means that app needs to know user identifier to enable cross-user access. Anything global means that all users have to live in a shared namespace, which makes locality of storage harder to achieve. One option: if identifier is not reversible to user identity, could still get locality. avoiding trivial correlation means we need more than one identifier.

Collections? Linked data structures?

What's the right way to represent collections and graphs/trees? This probably depends a lot on the underlying persistence mechanism.

If it's Riak-like, we have the ability to perform tree reassembly inside the persistence layer; this would require decrypting references inside the DB or holding references externally. Can we live with that? We could expose a collection identifier outside of the encryption envelope, for example.

Resolution of user data to account

As currently written, the User Identifier is the only entry point into the database, and there is only one of them. There will be cases (mostly administrative) where a valid user will need to perform discovery based on other data -- for example, to search based on givenName/familyName for a user account, when the email address has been lost, to investigate a payment.

There is no efficient way to perform that query as the system is currently specified.

Strawman API

For fun, we specify the API using HTTP.

Versioning of the API

We punt on versioning for now by using DNS, e.g. https://v1.sauropod.mozilla.org

Caller Authentication

Every caller into the API has an API key and secret. These are used for Caller Authentication. The API key and secret are used to perform 2-legged OAuth-signed calls. There is no dance, just signing of the API call (much like Amazon S3).

Session Initiation

POST /session/start
assertion={browserid_asertion}&audience={app_domain}

returns a session token and secret, which are used to sign subsequent requests

session_token={session_token}&session_secret={session_secret}&expires_at={expiration}

Set

PUT /app/{app_id}/users/{user_id}/keys/{key}
{value}

Get

GET /apps/{app_id}/users/{user_id}/keys/{key}

returns the content of the data at that key, with content-type specified at upload time. Only if authorized, of course.

Use Cases for Sauropod

OpenWebApps

The first use case for Sauropod is to store a user's list of installed "apps" for the OpenWebApps projects. This use-case has the following requirements (all operations are per-user):

The code that appsync is using is located in appsync/storage/sql.py (the SQLDatabase object) - the basic API it uses is given. It uses a few standard objects:

user 
the username (could/should also be inferred by the authentication we used)
collection 
a string we use to identify different kinds of objects (will generally be "apps", but we might want to store the list of clients in a different collection)
timestamps 
are integers (hundredths of a second from the epoc).

And there are these routines:

storage.delete(user, collection, client_id, reason="") 
deletes everything in the collection and leaves behind a stub that says who (client_id) deleted the collection and optionally why (reason). Implemented in SQL with a table of deleted collections (values are a JSON serialization of the client_id and reason).
storage.get_applications(user, collection, since=0) 
returns all the applications (JSON objects really) that have been updated since since. The since time is our local timestamp, not something from the client's time.
storage.add_applications(user, collection, applications) 
adds all the applications; applications is a list of JSONable objects/dicts. Our storage knows that application['origin'] must be unique, and so can delete old applications; Sauropod itself would presumably not know this. Though this doesn't return a timestamp, probably it should/could (something to match with since).

In addition, there are a few use-cases for aggregate data, not per-user. Ideally this aggregation would be done by the Sauropod on the server side.

  1. Retrieve the number of installs of a particular app (without leaking information about users who have the app installed).
  2. Retrieve general statistics such as: installs/hour and uninstalls/hour.
  3. Retrieve information on how many apps have been installed from a particular app store in a given time period.