Changes

Jump to: navigation, search

Sauropod

1,036 bytes added, 18:14, 19 October 2011
no edit summary
{{draft}}
= Sauropod Technical Specification =
Sauropod is a secure storage system for user data. It employs end-to-end encryption and secure key storage to enable least-privilege access, fine-grain user permissioning, and a controlled and auditable process for administrative and automated data access.
To application developers, Sauropod presents a key-value storage API, where each user has a completely independent universe of keys. Applications gain access to a user's store by presenting a user credential, the generation and validation of which is external to the Sauropod system. The store may also, optionally, restrict access to a particular set of user keys based on the application making the access. Applications may extend the privileges on a particular object key according to sensible transitive principles: a user that can read a file can extend read permission to any other user, and similarly for writes. (XX support locking an item down as non-sharable?)
Administrative and automated access is supported through "super-credentials". These allow developers and batch processes to obtain capabilities identical to those of a user for a limited time. There is no "super-user" that is allowed to access all records; instead, an administrator acquires the permissions of a user through an authenticated, auditable process.
(XX The desired implementation of Sauropod is of a key-value store with encrypted values, where the encryption keys are per-user keys that are wrapped with a small number of master secrets. Keys are only unwrapped inside the system. The API described to the client does not expose the details of the internal data protection scheme, but it is completely compatible with this internal representation)
== Project Phasing ==
* ''Phase Zero'': In phase zero, a tentative API, including credentials, is implemented. What happens within the API is unspecified, and may not involve any cryptography to start with.
* ''Phase One'': In phase one, the session API is fully implemented. Callers are required to present user credentials or an administrative credential to access user data. The internal implementation is ''not'' encrypted, but uses row-level access control to enforce fine-grained access control. The Access Server, Logging, and Credential Oracle are implemented fully; the Data Server is a non-encrypted databae; the Key Server is not present. The Sharing API is not implemented. (XX how much of Administrative and Automated?)
* ''Phase Two'': In phase two, the Key Server is fully implemented, and the Data Server is modified to store encrypted data. The encrypted ACL system is implemented. The Sharing, Administrative, and Automated APIs are implemented.
== Definition of Terms ==
* Application: a process that is accessing user data on behalf of a user. Applications use Application Authentication to prove to the access server which process they are.
* Access Server: a Sauropod internal process that handles requests from applications to access data and keys.
* Data Server: a Sauropod internal process that maintains a table of user data. Each atom of user data has a bucket and a value.
* Credential Oracle: (too cute? name?) An external process, configured as part of a Sauropod installation, which verifies a credential and translates it into a user identifier.
* Credential: An string of bytes, presented by the application to the Access Server, which encodes the successful authentication of a user into the system. A credential could be a cookie (which would then be checked with a session server connected to the authentication system) or a directly-verifiable credential such as a BrowserID assertion or proof of SSL client certificate handshake.
* User Identifier: An string of bytes that represents a single user in the Sauropod system. Credentials can be converted into user identifiers by the Credential Oracle.
* Key Server: a Sauropod internal process that maintains a list of per-value keys. All keys are wrapped with a Master Secret that is known only to the Key Server (or, better yet, locked away in a hardware module that only the Key Server can access). Every unique value has its own key; a key may be wrapped by more than one user key (if more than one user has access to it).
* Logging Aggregator: A Sauropod internal process that collates the logs of the access, data, and key servers to provide a unified view of data access behavior. It may optionally run audit logic to detect anomalous access patterns.
== Basic Flow of Control ==
In the course of processing a request from a user, an application needs to retrieve some data. As part of the request (or a session context connected to it), the application has a user credential.
The application begins a session with the access server by sending the user credential in a BeginSession request.
The Access Server validates the credential by consulting the Credential Oracle and creates a session associated with the User Identifier. A session identifier is returned to the application, which must be included with all subsequent requests.
The application than issues some number of requests to the Access Server, including the session identifier with each.
The Access Server authenticates the request to determine which application is making the request. It then* Determines if the application is allowed to access the requested bucket* in the Clear Data Model, verifies the credential and derives a User Identifier from it, and then determines whether that User Identifier has permissions to access the requested bucket; if permission is allowed performs the read or write* In the Encrypted Data Model, verifies the credential and derives a User Identifier from it, and then retrieves the ACL record and data ciphertext for that User Identifier at the requested bucket; the ACL is then passed to the Key Server for validation and decryption, and the key contained in the ACL is used to decrypt the ciphertext or encrypt a new ciphertext for the data.
The key server *Determines if the application is allowed to access the requested bucket *in the Clear Data Model, verifies the credential and derives a User Identifier from it, and then determines whether that User Identifier has permissions to access the requested bucket; if permission is allowed performs the read or write *In the Encrypted Data Model, verifies the credential and derives a User Identifier from it, and then retrieves the ACL record and data server only respond ciphertext for that User Identifier at the requested bucket; the ACL is then passed to requests from access servers. All the Key Server for validation and decryption, and the key server and contained in the ACL is used to decrypt the ciphertext or encrypt a new ciphertext for the data server operations are logged.
== Data Storage Model ==The key server and data server only respond to requests from access servers. All key server and data server operations are logged.
(XX Need lots of feedback here). == Data Storage Model * Key-value pairs* Collections? Ordered lists?* Trees? Graphs?==
== Clear Data Model ==(XX Need lots of feedback here).
In the '''Clear Data Model''', the Data Server maintains two types of data:*Key-value pairs *Collections? Ordered lists? *Trees? Graphs?
* A ''user data bucket'' is keyed on a bucket location, and contains the ''data''.* A ''user access bucket'' is keyed on a bucket location and a user identifier, and contains a permission (level, or bitwise mask, of read/write/admin).== Clear Data Model ==
== Encrypted In the '''Clear Data Model ==''', the Data Server maintains two types of data:
In *A ''user data bucket'' is keyed on a bucket location, and contains the ''data'Encrypted Data Model'. *A ''user access bucket'' is keyed on a bucket location and a user identifier, and contains a permission (level, or bitwise mask, the Data Server maintains two types of data:read/write/admin).
* A ''user data bucket'' is keyed on a bucket location, and contains the ''data'' encrypted with a ''data key'' (for location d, called K<sub>d</sub>)* A ''user access bucket'' is keyed on a bucket location and a user identifier, and contains a tuple of (perms, (location, identifier, perms, K<sub>d</sub>)_K<sub>master</sub>). == Encrypted Data Model That is, the access control list (ACL) record, which uniquely identifies the ability of a particular user to access a particular bucket, and contains the key to read that bucket, is encrypted with a master secret.==
The Key In the '''Encrypted Data Model''', the Data Server simply maintains a small number two types of master secrets. The Access Server retrieves ACL records from the data server and submits them to the Key Server, along with a credential. The Key Server verifies the credential, decrypts the ACL tuple, verifies that the identifier in the tuple matches the identifier of the credential, logs the access, and returns K<sub>d</sub> to the Access Server. (NB the Key Server should be locked down so only the Access Server can talk to it):
*A ''user data bucket'' is keyed on a bucket location, and contains the ''data'' encrypted with a ''data key'' (for location d, called K<sub>d</sub>) *A ''user access bucket'' is keyed on a bucket location and a user identifier, and contains a tuple of (XX should perms, (location, identifier, perms, K<sub>d</sub>)_K<sub>master</sub>). That is, the access server include control list (ACL) record, which uniquely identifies the intent ability of which permission it wants? would be for logging only))a particular user to access a particular bucket, and contains the key to read that bucket, is encrypted with a master secret.
== Application Authentication ==The Key Server simply maintains a small number of master secrets. The Access Server retrieves ACL records from the data server and submits them to the Key Server, along with a credential. The Key Server verifies the credential, decrypts the ACL tuple, verifies that the identifier in the tuple matches the identifier of the credential, logs the access, and returns K<sub>d</sub> to the Access Server. (NB the Key Server should be locked down so only the Access Server can talk to it)
All application-level calls to the AccessServer must be authenticated **by the application**. The terms of this authentication are up to the implementation, but could include transport level or message level techniques (IP range pinning, IPsec, SSL/TLS, and API keys are all options). ((XX perhaps we could implement a coupleshould access server include the intent of which permission it wants? would be for logging only))
== Administrative Access Application Authentication ==
The system supports audited, fineAll application-grained administrative access through level calls to the creation of "super-credentials". These are credentials which represents a super-user access to a AccessServer must be authenticated **single user's databy the application**. The Credential Oracle is required to provide features terms of this authentication are up to make this work, so the exact details are out of scope for this specificationimplementation, but the general flow is:could include transport level or message level techniques (IP range pinning, IPsec, SSL/TLS, and API keys are all options). ((XX perhaps we could implement a couple))
* == Administrative user authenticates to the credentialing system, presenting a superuser authentication and a target user* Credentialing system produces a super-credential for the target user, tagged with audit trail metadata (for example, "super user Jane, accessing user Joe, to investigate bug #6143635, at 2:25 PM 10/15/2011").* The superuser then creates a session and issues commands as though he or she were the actual target user. Access The credential information is logged with all accesses.==
The credential system should be designed to expirte supports audited, fine-grained administrative access through the credential in creation of "super-credentials". These are credentials which represents a reasonable interval super-user access to allow the administrative a **single user 's data**. The Credential Oracle is required to provide features to finish his or her tasks.make this work, so the exact details are out of scope for this specification, but the general flow is:
== Batch*Administrative user authenticates to the credentialing system, presenting a superuser authentication and a target user *Credentialing system produces a super-credential for the target user, tagged with audit trail metadata (for example, "super user Jane, accessing user Joe, to investigate bug #6143635, at 2:25 PM 10/Automation Access ==15/2011"). *The superuser then creates a session and issues commands as though he or she were the actual target user. The credential information is logged with all accesses.
In The credential system should be designed to expirte the credential in a similar fashion reasonable interval to allow the administrative access, the system supports batch mode access for automation and aggregate analysis. The credential oracle is required user to support "automation credentials" for this purposefinish his or her tasks.
An automation credential is a credential that allows access to a set of users. == Batch/Automation Access The automated process must authenticate itself to the credential oracle and receive a credential, just like a superuser access. The process must assert the purpose of its access during this credentialing process, and the credential it receives is time- and scope-limited. ((XX details!)) For example, "Indexing batch process, handling user updates new since 10/14/2011-01:50-Z".==
Once In a similar fashion to administrative access, the automated system has a credential, it may use that supports batch mode access for automation and aggregate analysis. The credential oracle is required to begin a session and access user datasupport "automation credentials" for this purpose.
(XX If An automation credential is a credential that allows access to a set of users. The automated process must authenticate itself to the system supports internal aggregate calculationscredential oracle and receive a credential, ejust like a superuser access.g. map-reduce or tree walking The process must assert the purpose of dataits access during this credentialing process, the Access Server session could make use of and the credential internally to access multiple it receives is time- and scope-limited. ((XX details!)) For example, "Indexing batch process, handling user keys)updates new since 10/14/2011-01:50-Z".
== Provisioning Once the automated system has a credential, it may use that credential to applications outside the trusted perimeter ==begin a session and access user data.
(XX If the applicationsystem supports internal aggregate calculations, e.g. map-level authentication is strong enoughreduce or tree walking of data, the Access Server session could accept inbound requests from applications outside make use of the trusted computing perimeter. These applications would still be required credential internally to authenticate themselves and present a access multiple user credential. Transport-level encryption of the communication would be mandatorykeys).
Possible topologies:== Provisioning to applications outside the trusted perimeter ==
* Traditional OAuth: The If the application authenticates with an API key and secret; -level authentication is strong enough, the user credential is an OAuth Access Token (which was previously provisioned through an authorization flow)* BrowserID: The application authenticates with SSL/TLS or an API key Server could accept inbound requests from applications outside the trusted computing perimeter. These applications would still be required to authenticate themselves and secret; the present a user credential is an identity assertion. ((XX what Transport-level encryption of the communication would the audience be - probably the accessing application?))mandatory.
Possible topologies:
=== Questions ===*Traditional OAuth: The application authenticates with an API key and secret; the user credential is an OAuth Access Token (which was previously provisioned through an authorization flow) *BrowserID: The application authenticates with SSL/TLS or an API key and secret; the user credential is an identity assertion. ((XX what would the audience be - probably the accessing application?))
==== "The Vault" and "The Cloud" ====<br>
Can we use this system to store non-decryptable user data without any issues? === Questions This would mean storing an ACL that contains, not K<sub>d</sub>, but a NULL key - that is, a record that we don't know how to read the document. Mozilla would still be responsible for maintaining the ACL, so that an authenticated user could extend read or write privileges to another user.===
==== Trusted computing base, lifecycle questions: "The Vault" and "The Cloud" ====
* Does the access server cache credential verification, or credential Can we use this system to store non-decryptable user identifier resultsdata without any issues? This would mean storing an ACL that contains, or not K<sub>d</sub>, but a NULL key decrypt results?* Currently - that is, a record that we need don't know how to consult read the oracle twice (and maybe thrice) - once to find which identifier to use document. Mozilla would still be responsible for maintaining the ACL lookup (once in access server, so that an authenticated user could extend read or once in data server, or both?), and again in the keyserver write privileges to verify that the ACL's plaintext contains the identifier in the credentialanother user. Is that really necessary?
==== Data partitioning Trusted computing base, lifecycle questions : ====
Per-*Does the access server cache credential verification, or credential to useridentifier results, or key decrypt results? *Currently we need to consult the oracle twice ( and maybe thrice) - once to find which identifierto use for the ACL lookup (once in access server, or once in data server, datakeyor both?)Flat per-user? /, and again in the keyserver to verify that the ACL's plaintext contains the identifier/datakeyFlat globalin the credential. Is that really necessary? /datakey
Anything per user means that app needs to know user identifier to enable cross-user access.Anything global means that all users have to live in a shared namespace, which makes locality of storage harder to achieve.One option: if identifier is not reversible to user identity, could still get locality. ==== Data partitioning questions avoiding trivial correlation means we need more than one identifier.====
==== CollectionsPer-user? Linked data structures( identifier, datakey) Flat per-user? ====/identifier/datakey Flat global? /datakey
What's the right way Anything per user means that app needs to represent collections and graphs/trees? This probably depends know user identifier to enable cross-user access. Anything global means that all users have to live in a lot on the underlying persistence mechanismshared namespace, which makes locality of storage harder to achieve. One option: if identifier is not reversible to user identity, could still get locality. avoiding trivial correlation means we need more than one identifier.
If it's Riak-like, we have the ability to perform tree reassembly inside the persistence layer; this would require decrypting references inside the DB or holding references externally. Can we live with that==== Collections? Linked data structures? We could expose a collection identifier outside of the encryption envelope, for example.====
==== Resolution of user data What's the right way to account ====represent collections and graphs/trees? This probably depends a lot on the underlying persistence mechanism.
As currently writtenIf it's Riak-like, we have the User Identifier is ability to perform tree reassembly inside the only entry point into persistence layer; this would require decrypting references inside the database, and there is only one of themDB or holding references externally. There will be cases (mostly administrative) where Can we live with that? We could expose a valid user will need to perform discovery based on other data -- collection identifier outside of the encryption envelope, for example, to search based on givenName/familyName for a user account, when the email address has been lost, to investigate a payment.
There is no efficient way ==== Resolution of user data to perform that query as the system is currently specified.account ====
= Strawman API =As currently written, the User Identifier is the only entry point into the database, and there is only one of them. There will be cases (mostly administrative) where a valid user will need to perform discovery based on other data -- for example, to search based on givenName/familyName for a user account, when the email address has been lost, to investigate a payment.
For fun, we specify There is no efficient way to perform that query as the API using HTTPsystem is currently specified.
== Versioning of the Strawman API = =
We punt on versioning for now by For fun, we specify the API using DNS, e.g. <tt>https://v1.sauropod.mozillaHTTP.org</tt>
== Caller Authentication Versioning of the API ==
Every caller into the API has an API key and secretWe punt on versioning for now by using DNS, e. These are used for Caller Authenticationg. <tt>https://v1. The API key and secret are used to perform 2-legged OAuth-signed callssauropod. There is no dance, just signing of the API call (much like Amazon S3)mozilla.org</tt>
== Caller Authentication == Every caller into the API has an API key and secret. These are used for Caller Authentication. The API key and secret are used to perform 2-legged OAuth-signed calls. There is no dance, just signing of the API call (much like Amazon S3).  == Session Initiation ==
POST /session/start
{auth_assertionassertion=$browserid_asertion&amp;audience=$app_domain}
returns a session token and secret, which are used to sign subsequent requests
session_token={session_token}&amp;session_secret={session_secret}&amp;expires_at={expiration}
== Set ==
PUT /app/{app_id}/users/{user_id}/keys/{key}
{value}
== Get ==
GET /apps/{app_id}/users/{user_id}/keys/{key}
returns the content of the data at that key, with content-type specified at upload time. Only if authorized, of course. = Use Cases for Sauropod = == OpenWebApps == The first use case for Sauropod is to store a user's list of installed "apps" for the OpenWebApps projects. This use-case has the following requirements (all operations are per-user):  #Add an installed app. Apps are keyed by domain, and are unique. The value of an app record is an arbritrary JSON object.<br> #Retrieve the list of all installed apps, returns an array of app records.<br>#Modify an app record, disallow deleting an app (a user never unpurchases an app but may choose to uninstall it which is denoted by marking it as such in the app record).<br>#Retrieve the number of installs of a particular app (without leaking information about users who have the app installed). Ideally this aggregation would be done by the Sauropod on the server side.#Retrieve general statistics such as: installs/hour and uninstalls/hour.#Retrieve information on how many apps have been installed from a particular app store in a given time period. <br>
Confirm
188
edits

Navigation menu