Changes

Sauropod

1,036 bytes added, 18:14, 19 October 2011

no edit summary

= Sauropod Technical Specification =

Sauropod is a secure storage system for user data. It employs end-to-end encryption and secure key storage to enable least-privilege access, fine-grain user permissioning, and a controlled and auditable process for administrative and automated data access.

To application developers, Sauropod presents a key-value storage API, where each user has a completely independent universe of keys. Applications gain access to a user's store by presenting a user credential, the generation and validation of which is external to the Sauropod system. The store may also, optionally, restrict access to a particular set of user keys based on the application making the access. Applications may extend the privileges on a particular object key according to sensible transitive principles: a user that can read a file can extend read permission to any other user, and similarly for writes. (XX support locking an item down as non-sharable?)

Administrative and automated access is supported through "super-credentials". These allow developers and batch processes to obtain capabilities identical to those of a user for a limited time. There is no "super-user" that is allowed to access all records; instead, an administrator acquires the permissions of a user through an authenticated, auditable process.

(XX The desired implementation of Sauropod is of a key-value store with encrypted values, where the encryption keys are per-user keys that are wrapped with a small number of master secrets. Keys are only unwrapped inside the system. The API described to the client does not expose the details of the internal data protection scheme, but it is completely compatible with this internal representation)

== Project Phasing ==

* ''Phase Zero'': In phase zero, a tentative API, including credentials, is implemented. What happens within the API is unspecified, and may not involve any cryptography to start with.

* ''Phase One'': In phase one, the session API is fully implemented. Callers are required to present user credentials or an administrative credential to access user data. The internal implementation is ''not'' encrypted, but uses row-level access control to enforce fine-grained access control. The Access Server, Logging, and Credential Oracle are implemented fully; the Data Server is a non-encrypted databae; the Key Server is not present. The Sharing API is not implemented. (XX how much of Administrative and Automated?)

* ''Phase Two'': In phase two, the Key Server is fully implemented, and the Data Server is modified to store encrypted data. The encrypted ACL system is implemented. The Sharing, Administrative, and Automated APIs are implemented.

== Definition of Terms ==

* Application: a process that is accessing user data on behalf of a user. Applications use Application Authentication to prove to the access server which process they are.

* Access Server: a Sauropod internal process that handles requests from applications to access data and keys.

* Data Server: a Sauropod internal process that maintains a table of user data. Each atom of user data has a bucket and a value.

* Credential Oracle: (too cute? name?) An external process, configured as part of a Sauropod installation, which verifies a credential and translates it into a user identifier.

* Credential: An string of bytes, presented by the application to the Access Server, which encodes the successful authentication of a user into the system. A credential could be a cookie (which would then be checked with a session server connected to the authentication system) or a directly-verifiable credential such as a BrowserID assertion or proof of SSL client certificate handshake.

* User Identifier: An string of bytes that represents a single user in the Sauropod system. Credentials can be converted into user identifiers by the Credential Oracle.

* Key Server: a Sauropod internal process that maintains a list of per-value keys. All keys are wrapped with a Master Secret that is known only to the Key Server (or, better yet, locked away in a hardware module that only the Key Server can access). Every unique value has its own key; a key may be wrapped by more than one user key (if more than one user has access to it).

* Logging Aggregator: A Sauropod internal process that collates the logs of the access, data, and key servers to provide a unified view of data access behavior. It may optionally run audit logic to detect anomalous access patterns.

== Basic Flow of Control ==

In the course of processing a request from a user, an application needs to retrieve some data. As part of the request (or a session context connected to it), the application has a user credential.

The application begins a session with the access server by sending the user credential in a BeginSession request.

The Access Server validates the credential by consulting the Credential Oracle and creates a session associated with the User Identifier. A session identifier is returned to the application, which must be included with all subsequent requests.

The application than issues some number of requests to the Access Server, including the session identifier with each.

The Access Server authenticates the request to determine which application is making the request. It then* Determines if the application is allowed to access the requested bucket* in the Clear Data Model, verifies the credential and derives a User Identifier from it, and then determines whether that User Identifier has permissions to access the requested bucket; if permission is allowed performs the read or write* In the Encrypted Data Model, verifies the credential and derives a User Identifier from it, and then retrieves the ACL record and data ciphertext for that User Identifier at the requested bucket; the ACL is then passed to the Key Server for validation and decryption, and the key contained in the ACL is used to decrypt the ciphertext or encrypt a new ciphertext for the data.

~~The key server~~ *Determines if the application is allowed to access the requested bucket *in the Clear Data Model, verifies the credential and derives a User Identifier from it, and then determines whether that User Identifier has permissions to access the requested bucket; if permission is allowed performs the read or write *In the Encrypted Data Model, verifies the credential and derives a User Identifier from it, and then retrieves the ACL record and data ~~server only respond~~ ciphertext for that User Identifier at the requested bucket; the ACL is then passed to ~~requests from access servers. All~~ the Key Server for validation and decryption, and the key ~~server and~~ contained in the ACL is used to decrypt the ciphertext or encrypt a new ciphertext for the data ~~server operations are logged~~.

~~== Data Storage Model ==~~The key server and data server only respond to requests from access servers. All key server and data server operations are logged.

~~(XX Need lots of feedback here).~~ == Data Storage Model * Key-value pairs* Collections? Ordered lists?* Trees? Graphs?==

~~== Clear Data Model ==~~(XX Need lots of feedback here).

~~In the '''Clear Data Model''', the Data Server maintains two types of data:~~*Key-value pairs *Collections? Ordered lists? *Trees? Graphs?

* A ''user data bucket'' is keyed on a bucket location, and contains the ''data''.* A ''user access bucket'' is keyed on a bucket location and a user identifier, and contains a permission (level, or bitwise mask, of read/write/admin).== Clear Data Model ==

~~== Encrypted~~ In the '''Clear Data Model ==''', the Data Server maintains two types of data:

In *A ''user data bucket'' is keyed on a bucket location, and contains the ''data'~~Encrypted Data Model~~'. *A ''user access bucket'' is keyed on a bucket location and a user identifier, and contains a permission (level, or bitwise mask, ~~the Data Server maintains two types~~ of ~~data:~~read/write/admin).

* A ''user data bucket'' is keyed on a bucket location, and contains the ''data'' encrypted with a ''data key'' (for location d, called Kd)* A ''user access bucket'' is keyed on a bucket location and a user identifier, and contains a tuple of (perms, (location, identifier, perms, Kd)_Kmaster). == Encrypted Data Model That is, the access control list (ACL) record, which uniquely identifies the ability of a particular user to access a particular bucket, and contains the key to read that bucket, is encrypted with a master secret.==

~~The Key~~ In the '''Encrypted Data Model''', the Data Server ~~simply~~ maintains ~~a small number~~ two types of ~~master secrets. The Access Server retrieves ACL records from the~~ data server and submits them to the Key Server, along with a credential. The Key Server verifies the credential, decrypts the ACL tuple, verifies that the identifier in the tuple matches the identifier of the credential, logs the access, and returns Kd to the Access Server. (NB the Key Server should be locked down so only the Access Server can talk to it):

*A ''user data bucket'' is keyed on a bucket location, and contains the ''data'' encrypted with a ''data key'' (for location d, called Kd) *A ''user access bucket'' is keyed on a bucket location and a user identifier, and contains a tuple of (~~XX should~~ perms, (location, identifier, perms, Kd)_Kmaster). That is, the access ~~server include~~ control list (ACL) record, which uniquely identifies the ~~intent~~ ability of ~~which permission it wants? would be for logging only))~~a particular user to access a particular bucket, and contains the key to read that bucket, is encrypted with a master secret.

~~== Application Authentication ==~~The Key Server simply maintains a small number of master secrets. The Access Server retrieves ACL records from the data server and submits them to the Key Server, along with a credential. The Key Server verifies the credential, decrypts the ACL tuple, verifies that the identifier in the tuple matches the identifier of the credential, logs the access, and returns Kd to the Access Server. (NB the Key Server should be locked down so only the Access Server can talk to it)

All application-level calls to the AccessServer must be authenticated **by the application**. The terms of this authentication are up to the implementation, but could include transport level or message level techniques (IP range pinning, IPsec, SSL/TLS, and API keys are all options). ((XX ~~perhaps we could implement a couple~~should access server include the intent of which permission it wants? would be for logging only))

== ~~Administrative Access~~ Application Authentication ==

~~The system supports audited, fine~~All application-~~grained administrative access through~~ level calls to the ~~creation of "super-credentials". These are credentials which represents a super-user access to a~~ AccessServer must be authenticated **~~single user's data~~by the application**. The ~~Credential Oracle is required to provide features~~ terms of this authentication are up to ~~make this work, so~~ the ~~exact details are out of scope for this specification~~implementation, but ~~the general flow is:~~could include transport level or message level techniques (IP range pinning, IPsec, SSL/TLS, and API keys are all options). ((XX perhaps we could implement a couple))

* == Administrative ~~user authenticates to the credentialing system, presenting a superuser authentication and a target user~~* Credentialing system produces a super-credential for the target user, tagged with audit trail metadata (for example, "super user Jane, accessing user Joe, to investigate bug #6143635, at 2:25 PM 10/15/2011").* The superuser then creates a session and issues commands as though he or she were the actual target user. Access ~~The credential information is logged with all accesses.~~==

The ~~credential~~ system ~~should be designed to expirte~~ supports audited, fine-grained administrative access through the ~~credential in~~ creation of "super-credentials". These are credentials which represents a ~~reasonable interval~~ super-user access to ~~allow the administrative~~ a **single user 's data**. The Credential Oracle is required to provide features to ~~finish his or her tasks.~~make this work, so the exact details are out of scope for this specification, but the general flow is:

~~== Batch~~*Administrative user authenticates to the credentialing system, presenting a superuser authentication and a target user *Credentialing system produces a super-credential for the target user, tagged with audit trail metadata (for example, "super user Jane, accessing user Joe, to investigate bug #6143635, at 2:25 PM 10/~~Automation Access ==~~15/2011"). *The superuser then creates a session and issues commands as though he or she were the actual target user. The credential information is logged with all accesses.

In The credential system should be designed to expirte the credential in a ~~similar fashion~~ reasonable interval to allow the administrative ~~access, the system supports batch mode access for automation and aggregate analysis. The credential oracle is required~~ user to ~~support "automation credentials" for this purpose~~finish his or her tasks.

~~An automation credential is a credential that allows access to a set of users.~~ == Batch/Automation Access The automated process must authenticate itself to the credential oracle and receive a credential, just like a superuser access. The process must assert the purpose of its access during this credentialing process, and the credential it receives is time- and scope-limited. ((XX details!)) For example, "Indexing batch process, handling user updates new since 10/14/2011-01:50-Z".==

~~Once~~ In a similar fashion to administrative access, the ~~automated~~ system ~~has a credential, it may use that~~ supports batch mode access for automation and aggregate analysis. The credential oracle is required to ~~begin a session and access user data~~support "automation credentials" for this purpose.

~~(XX If~~ An automation credential is a credential that allows access to a set of users. The automated process must authenticate itself to the ~~system supports internal aggregate calculations~~credential oracle and receive a credential, ejust like a superuser access.~~g. map-reduce or tree walking~~ The process must assert the purpose of ~~data~~its access during this credentialing process, ~~the Access Server session could make use of~~ and the credential ~~internally to access multiple~~ it receives is time- and scope-limited. ((XX details!)) For example, "Indexing batch process, handling user ~~keys)~~updates new since 10/14/2011-01:50-Z".

~~== Provisioning~~ Once the automated system has a credential, it may use that credential to ~~applications outside the trusted perimeter ==~~begin a session and access user data.

(XX If the ~~application~~system supports internal aggregate calculations, e.g. map-~~level authentication is strong enough~~reduce or tree walking of data, the Access Server session could ~~accept inbound requests from applications outside~~ make use of the ~~trusted computing perimeter. These applications would still be required~~ credential internally to ~~authenticate themselves and present a~~ access multiple user ~~credential. Transport-level encryption of the communication would be mandatory~~keys).

~~Possible topologies:~~== Provisioning to applications outside the trusted perimeter ==

* Traditional OAuth: The If the application ~~authenticates with an API key and secret;~~ -level authentication is strong enough, the ~~user credential is an OAuth~~ Access ~~Token (which was previously provisioned through an authorization flow)~~* BrowserID: The application authenticates with SSL/TLS or an API key Server could accept inbound requests from applications outside the trusted computing perimeter. These applications would still be required to authenticate themselves and ~~secret; the~~ present a user credential ~~is an identity assertion~~. ~~((XX what~~ Transport-level encryption of the communication would ~~the audience~~ be ~~- probably the accessing application?))~~mandatory.

Possible topologies:

~~=== Questions ===~~*Traditional OAuth: The application authenticates with an API key and secret; the user credential is an OAuth Access Token (which was previously provisioned through an authorization flow) *BrowserID: The application authenticates with SSL/TLS or an API key and secret; the user credential is an identity assertion. ((XX what would the audience be - probably the accessing application?))

~~==== "The Vault" and "The Cloud" ====~~

~~Can we use this system to store non-decryptable user data without any issues?~~ === Questions This would mean storing an ACL that contains, not Kd, but a NULL key - that is, a record that we don't know how to read the document. Mozilla would still be responsible for maintaining the ACL, so that an authenticated user could extend read or write privileges to another user.===

==== ~~Trusted computing base, lifecycle questions:~~ "The Vault" and "The Cloud" ====

* Does the access server cache credential verification, or credential Can we use this system to store non-decryptable user ~~identifier results~~data without any issues? This would mean storing an ACL that contains, or not Kd, but a NULL key ~~decrypt results?~~* Currently - that is, a record that we ~~need~~ don't know how to ~~consult~~ read the ~~oracle twice (and maybe thrice) - once to find which identifier to use~~ document. Mozilla would still be responsible for maintaining the ACL ~~lookup (once in access server~~, so that an authenticated user could extend read or ~~once in data server, or both?), and again in the keyserver~~ write privileges to ~~verify that the ACL's plaintext contains the identifier in the credential~~another user. ~~Is that really necessary?~~

==== ~~Data partitioning~~ Trusted computing base, lifecycle questions : ====

~~Per-~~*Does the access server cache credential verification, or credential to useridentifier results, or key decrypt results? *Currently we need to consult the oracle twice ( and maybe thrice) - once to find which identifierto use for the ACL lookup (once in access server, or once in data server, ~~datakey~~or both?)~~Flat per-user? /~~, and again in the keyserver to verify that the ACL's plaintext contains the identifier~~/datakeyFlat global~~in the credential. Is that really necessary? ~~/datakey~~

~~Anything per user means that app needs to know user identifier to enable cross-user access.Anything global means that all users have to live in a shared namespace, which makes locality of storage harder to achieve.One option: if identifier is not reversible to user identity, could still get locality.~~ ==== Data partitioning questions ~~avoiding trivial correlation means we need more than one identifier.~~====

~~==== Collections~~Per-user? ~~Linked data structures~~( identifier, datakey) Flat per-user? ~~====~~/identifier/datakey Flat global? /datakey

~~What's the right way~~ Anything per user means that app needs to ~~represent collections and graphs/trees? This probably depends~~ know user identifier to enable cross-user access. Anything global means that all users have to live in a ~~lot on the underlying persistence mechanism~~shared namespace, which makes locality of storage harder to achieve. One option: if identifier is not reversible to user identity, could still get locality. avoiding trivial correlation means we need more than one identifier.

If it's Riak-like, we have the ability to perform tree reassembly inside the persistence layer; this would require decrypting references inside the DB or holding references externally. Can we live with that==== Collections? Linked data structures? ~~We could expose a collection identifier outside of the encryption envelope, for example.~~====

~~==== Resolution of user data~~ What's the right way to ~~account ====~~represent collections and graphs/trees? This probably depends a lot on the underlying persistence mechanism.

~~As currently written~~If it's Riak-like, we have the ~~User Identifier is~~ ability to perform tree reassembly inside the ~~only entry point into~~ persistence layer; this would require decrypting references inside the ~~database, and there is only one of them~~DB or holding references externally. ~~There will be cases (mostly administrative) where~~ Can we live with that? We could expose a ~~valid user will need to perform discovery based on other data --~~ collection identifier outside of the encryption envelope, for example~~, to search based on givenName/familyName for a user account, when the email address has been lost, to investigate a payment~~.

~~There is no efficient way~~ ==== Resolution of user data to ~~perform that query as the system is currently specified.~~account ====

~~= Strawman API =~~As currently written, the User Identifier is the only entry point into the database, and there is only one of them. There will be cases (mostly administrative) where a valid user will need to perform discovery based on other data -- for example, to search based on givenName/familyName for a user account, when the email address has been lost, to investigate a payment.

~~For fun, we specify~~ There is no efficient way to perform that query as the ~~API using HTTP~~system is currently specified.

=~~= Versioning of the~~ Strawman API = =

~~We punt on versioning for now by~~ For fun, we specify the API using ~~DNS, e.g. <tt>https://v1.sauropod.mozilla~~HTTP.~~org</tt>~~

== ~~Caller Authentication~~ Versioning of the API ==

~~Every caller into the API has an API key and secret~~We punt on versioning for now by using DNS, e. ~~These are used for Caller Authentication~~g. <tt>https://v1. ~~The API key and secret are used to perform 2-legged OAuth-signed calls~~sauropod. ~~There is no dance, just signing of the API call (much like Amazon S3)~~mozilla.org</tt>

== Caller Authentication == Every caller into the API has an API key and secret. These are used for Caller Authentication. The API key and secret are used to perform 2-legged OAuth-signed calls. There is no dance, just signing of the API call (much like Amazon S3). == Session Initiation ==

POST /session/start

{~~auth_assertion~~assertion=$browserid_asertion&audience=$app_domain}

returns a session token and secret, which are used to sign subsequent requests

session_token={session_token}&session_secret={session_secret}&expires_at={expiration}

== Set ==

PUT /app/{app_id}/users/{user_id}/keys/{key}

{value}

== Get ==

GET /apps/{app_id}/users/{user_id}/keys/{key}

returns the content of the data at that key, with content-type specified at upload time. Only if authorized, of course. = Use Cases for Sauropod = == OpenWebApps == The first use case for Sauropod is to store a user's list of installed "apps" for the OpenWebApps projects. This use-case has the following requirements (all operations are per-user): #Add an installed app. Apps are keyed by domain, and are unique. The value of an app record is an arbritrary JSON object. #Retrieve the list of all installed apps, returns an array of app records. #Modify an app record, disallow deleting an app (a user never unpurchases an app but may choose to uninstall it which is denoted by marking it as such in the app record). #Retrieve the number of installs of a particular app (without leaking information about users who have the app installed). Ideally this aggregation would be done by the Sauropod on the server side.#Retrieve general statistics such as: installs/hour and uninstalls/hour.#Retrieve information on how many apps have been installed from a particular app store in a given time period.

Anant

Confirm

188

edits

Changes

Sauropod

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

How to Contribute

MozillaWiki

Around Mozilla

Tools