Identity/AttachedServices/StorageServerProtocol: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
 
(5 intermediate revisions by the same user not shown)
Line 1: Line 1:
== Summary ==
== Summary ==


This is a working proposal for the PiCL Storage API, to implement the concepts described in [[Identity/CryptoIdeas/04-Delta-Sync]].  It's a work in progress that will eventually obsolete [[Identity/AttachedServices/StorageProtocolZero]].
This is a working proposal for the PiCL Storage API, to implement the concepts described in [[Identity/CryptoIdeas/05-Queue-Sync]].   
 
It's a work in progress that will eventually obsolete [[Identity/AttachedServices/StorageProtocolZero]].
 
 
== Queue-Sync Data Model ==
 
More details at [[Identity/CryptoIdeas/05-Queue-Sync]].
 
Data is stored in independent named '''collections'''.  A collection is a key-value store mapping keys to '''records'''.  Each collection has a monotonically-increasing '''sequence number''' which is incremented whenever a record is changed, and provides the ability to request all '''changes''' since a given sequence number.
 
 
'''Collection''' objects have the following fields:
 
<table>
<tr><th>Parameter</th><th>Type</th><th>Description</th></tr>
 
<tr><td>name</td><td>urlsafe string, 64 bytes</td><td>A unique identifier for this collection amongt all the user's data.  Collection
names may only contain characters from the urlsafe-base64 alphabet (i.e. alphanumerics, underscore and hyphen).</td></tr>
 
<tr><td>seqnum</td><td>integer, 8 bytes</td><td>A monotonically-increasing integer that is incremented with each change to the
contents of the collection.</td></tr>
 
<tr><td>changeid</td><td>urlsafe string, XXX bytes</td><td>A hash that uniquely identifies the last change to this collection.  It is
derived from the new sequence number, the previous changeid, and the details of the change that was made.</td></tr>
<tr><td>signature</td><td>urlsafe string, XXX bytes</td><td>A client-generated HMAC signature of the current changeid.  Not used or
verified by the server, since it doesn't have the secret key.</td></tr>
 
</table>
 
 
 
'''Record''' objects have the following fields:
 
<table>
<tr><th>Parameter</th><th>Type</th><th>Description</th></tr>
 
<tr><td>key</td><td>urlsafe string, 64 bytes</td><td>A unique identifier for this record within the collection.  Keys may only contain
characters from the urlsafe-base64 alphabet (i.e. alphanumerics, underscore and hyphen).</td></tr>
 
<tr><td>payload</td><td>urlsafe string, 256 KB</td><td>The value current stored in this record.  Typically this would be encrypted and
signed by the client.</td></tr>
 
<tr><td>seqnum</td><td>integer, 8 byte</td><td>The collection-level sequence number at which this record was last modified.</td></tr>
 
<tr><td>changeid</td><td>urlsafe string, XXX bytes</td><td>The collection-level changeid corresponding to the modification of this
record.  It is derived from the new sequence number, the previous changeid, the record key, and the new record payload.</td></tr>
 
<tr><td>signature</td><td>urlsafe string, XXX bytes</td><td>A client-generated HMAC signature of the changeid for this record.  Not
used or verified by the server, since it doesn't have the secret key.</td></tr>
 
</table>
 
 
 
'''Change''' objects are identical to '''record''' objects, except their payload field may have the value NULL to indicate a deletion
rather than an update:
 
<table>
<tr><th>Parameter</th><th>Type</th><th>Description</th></tr>


== Delta-Sync Data Model ==
<tr><td>key</td><td>urlsafe string, 64 bytes</td><td>A unique identifier for the changed record within the collection.  Keys may only
contain characters from the urlsafe-base64 alphabet (i.e. alphanumerics, underscore and hyphen).</td></tr>


The storage server hosts a number of independent named '''collections''' for each userEach collection is a key-value store whose contents can be atomically modified by the client. Each modification of a collection creates a new '''version''' with corresponding version identifier, which is a signed hash of the contents of the collection at that version.
<tr><td>payload</td><td>urlsafe string or null, 256 KB</td><td>The new value to be stored in the record, or null if the record is to
be deletedTypically this would be encrypted and signed by the client.</td></tr>
 
<tr><td>seqnum</td><td>integer, 8 byte</td><td>The new collection-level sequence number after this change is applied.</td></tr>
 
<tr><td>changeid</td><td>urlsafe string, XXX bytes</td><td>The new collection-level changeid corresponding to this change.  It is
derived from the new sequence number, the previous changeid, the record key, and the new record payload.</td></tr>
 
<tr><td>signature</td><td>urlsafe string, XXX bytes</td><td>A client-generated HMAC signature of the changeid.  Not used or verified
by the server, since it doesn't have the secret key.</td></tr>
 
</table>




More details at [[Identity/CryptoIdeas/04-Delta-Sync]].


== Authentication ==
== Authentication ==


To access the storage service, a client device must authenticate by providing a BrowserID assertion and a Device ID.  It will receive in exchange:
To access the storage service, a client device must authenticate by providing a BrowserID assertion and a Device ID.  It will receive  
in exchange:


* the current version number of each collection
* a short-lived id/key pair that can be used to authenticate subsequent requests using the Hawk request-signing scheme
* a short-lived id/key pair that can be used to authenticate subsequent requests with Hawk
* a mapping of collection names to access URLs
* a URL to which further requests should be directed




You can think of this as establishing a "login session" with the server, although we're also tunneling some basic metadata in order to reduce the number of round-trips.
You can think of this as establishing a "login session" with the server.  Access requests for a specific collection should then be directed
to the appropriate URL.


Example:
Example:
Line 32: Line 105:
     <  Content-Type: application/json
     <  Content-Type: application/json
     <  {
     <  {
    <  "base_url": <user-specific access url>,
     <  "id": <hawk auth id>,
     <  "id": <hawk auth id>,
     <  "key": <hawk auth secret key>,
     <  "key": <hawk auth secret key>,
     <  "collections": {
     <  "collections": {
     <    "bookmarks": <version id for bookmarks collection>,
     <    "history": <access url for history collection>,
     <    "passwords": <version id for passwords collection>,
     <    "bookmarks": <access url for bookmarks collection>,
     <    <...etc...>
     <    <...etc...>
     <  }
     <  }
     <  }
     <  }


The user and device identity information is encoded in the hawk auth id, to avoid re-sending it on each request.  The server may also include additional state in this value, depending on the implementation.  It's opaque to the client.
The user and device identity information is encoded in the hawk auth id, to avoid re-sending it on each request.  The server may also  
include additional state in this value, depending on the implementation.  It's opaque to the client.


The base_url may include a unique identifier for the user, in order to improve RESTful-icity of the API.  Or it might point the client to a specific data-center which houses their write master.  It's opaque to the client.
The collection-specific access URLs may include a unique identifier for the user, in order to improve RESTful-icity of the API.  Or  
they might point the client to a specific data-center which houses their write master for each collection.  It's opaque to the client.


== Data Access ==
== Data Access ==


The client now makes Hawk-authenticated requests to the storage API under its assigned base_url. The following operations are available.
The client now makes Hawk-authenticated requests to a specific collection at its assigned access url.
The following operations are available on each collection.


=== GET <base-url> ===


Get the current version id for all collections.  This is the same data as returned in the session-establishment call above, but it may be useful if the client wants to refresh its view.  Example:
=== GET <collection-url> ===


     >  GET <base-url>
Get the current metadata for a collection: its name, seqnum and changeid.
Example:
 
     >  GET <collection-url>
     >  Authorization:  <hawk auth parameters>
     >  Authorization:  <hawk auth parameters>
     .
     .
Line 60: Line 137:
     <  Content-Type: application/json
     <  Content-Type: application/json
     <  {
     <  {
     <  "collections": {
     <  "name": "history"
     <     "bookmarks": <version id for bookmarks collection>,
     <   "seqnum": 123,
     <     "passwords": <version id for passwords collection>,
     <   "changeid": "HASH_OF_DETAILS_OF_THE_MOST_RECENT_CHANGE",
    <    <...etc...>
     <  "signature": "HMAC_SIGNATURE_OF_CHANGEID"
     <  }
     <  }
     <  }


=== GET <base-url>/<collection> ===


Get the current version id for a specific collection.  Example:
=== GET <collection-url>/records ===


     >  GET <base-url>/<collection>
Query parameters:  start, end, limit.
 
Request headers: If-Match, If-None-Match
 
Response headers: ETag
 
 
Get the set of records currently contained in the collection.  For small collections, the full set
of records will be returned like so:
 
     >  GET <collection-url>/records
     >  Authorization:  <hawk auth parameters>
     >  Authorization:  <hawk auth parameters>
     .
     .
Line 77: Line 162:
     <  Content-Type: application/json
     <  Content-Type: application/json
     <  {
     <  {
     <  "version": <version id for this collection>
     <  "records": {
    <   "key1": { "payload": "payload1", "seqnum": 123, "changeid": "HASH1", "signature": "sig1" },
    <    "key2": { "payload": "payload2", "seqnum": 124, "changeid": "HASH2", "signature": "sig2" }
    <  }
     <  }
     <  }


=== GET <base-url>/<collection>/<version> ===


Get the contents of a specific version of a specific collection.  In the simplest case, we GET the full contents like so:
If there are a large number of records in the collection then the server may choose to paginate the result, returning only some of the
records in the initial responseIt will include the key "next" in the output to indicate that more records are available:


     >  GET <base-url>/<collection>/<version>
     >  GET <collection-url>/records
     >  Authorization:  <hawk auth parameters>
     >  Authorization:  <hawk auth parameters>
     .
     .
Line 90: Line 178:
     <  Content-Type: application/json
     <  Content-Type: application/json
     <  {
     <  {
    <  "next": "key3",
     <  "items": {
     <  "items": {
     <   "key1": "value1",
     <     "key1": <record1>,
     <   "key2": "value2",
     <     "key2": <record2>
    <    <..etc..>
     <  }
     <  }
     <  }
     <  }


However, clients will usually want to request a delta from a previous version.  They can do this by specifying the "from" parameter.  New or updated keys are represented with their value, while deleted keys are represented with a value of null.  Like so:
Clients can request the next batch using the 'start' query parameter:


     >  GET <base-url>/<collection>/<version>?from=<previous version>
     >  GET <collection-url>/records?start=key3
     >  Authorization:  <hawk auth parameters>
     >  Authorization:  <hawk auth parameters>
     .
     .
Line 106: Line 194:
     <  {
     <  {
     <  "items": {
     <  "items": {
     <    "key1": "value1", // a key that was updated
     <    "key3": <record3>,
     <    "key2": null      // a key that was deleted
     <    "key4": <record4>
     <  }
     <  }
     <  }  
     <  }


To allow reliable transfer of a large number of items, both client and server may choose to paginate responses to this query.
When no "next" value is included in the response, the client knows that all available records have
been fetched.


The client may specify "first" as the key at which to (lexicographically) start the listing, and "upto" as the key at which to stop the listing.  It may also specify an integer "limit" to restrict the total number of keys sent at once.  The server may enforce a default value and/or upper-bound on "limit".
Records are always batched in lexicographic order of their keys, and clients are free to request an arbitrary key range using the  
'start' and 'end' parameters:


If the set of items is truncated, the server will send the response argument "next" to give the next available key in iteration order.  The client should make another request setting "first" equal to the provided value of "next" in order to fetch additional items.
    >  GET <collection-url>/records?start=key2&end=key3
    >  Authorization:  <hawk auth parameters>
    .
    <  200 OK
    <  Content-Type: application/json
    <  {
    <  "items": {
    <    "key2": <record2>,
    <    "key3": <record3>
    <  }
    <  }


As an example, suppose that the client requests at most two items per response, and the collection contains items "key1", "key2" and "key3".  It would would need to fetch them in two batches like so:
Clients may also choose to batch their requests by using the 'limit' query parameter.  As with server-driven batching, the output key
"next" will be used to indicate that more data is available:


     >  GET <base-url>/<collection>/<version>?limit=2
     >  GET <collection-url>/records?start=key2&limit=2
     >  Authorization:  <hawk auth parameters>
     >  Authorization:  <hawk auth parameters>
     .
     .
Line 125: Line 226:
     <  Content-Type: application/json
     <  Content-Type: application/json
     <  {
     <  {
     <  "next": "key3",
     <  "next": "key4",
     <  "items": {
     <  "items": {
     <    "key1": "value1",
     <    "key2": <record2>,
     <    "key2": "value2"
     <    "key3": <record3>
     <  }
     <  }
     <  }
     <  }
     .
     .
     .
     .
     >  GET <base-url>/<collection>/<version>?first=key3&limit=2
     >  GET <collection-url>/records?start=key4&limit=2
    >  Authorization:  <hawk auth parameters>
    .
    <  200 OK
    <  Content-Type: application/json
    <  {
    <  "items": {
    <    "key4": <record4>
    <   }
    <  }
 
 
Each server response will include an "ETag" header, formed from the combination of the current seqnum and changeid of the collection
Clients can use this in combination with standard If-Match and If-None-Match headers to ensure that they're getting a consistent view
of the collection:
 
    > GET <collection-url>/records?start=key2&limit=2
     >  Authorization:  <hawk auth parameters>
     >  Authorization:  <hawk auth parameters>
     .
     .
     <  200 OK
     <  200 OK
     <  Content-Type: application/json
     <  Content-Type: application/json
    <  ETag: 124-HASH2
     <  {
     <  {
    <  "next": "key4",
     <  "items": {
     <  "items": {
     <    "key3": "value3"
    <    "key2": <record2>,
     <    "key3": <record3>
    <  }
    <  }
    .
    .
    >  GET <collection-url>/records?start=key4&limit=2
    >  Authorization:  <hawk auth parameters>
    >  If-Match: 123-HASH
    .
    <  412 Precondition Failed
    <  ETag: 125-HASH3
 
 
XXX TODO: use of headers, versus returning seqnum/changeid in the response body?
 
 
=== GET <collection-url>/records/<key> ===
 
Request headers: If-Match, If-None-Match
 
Response headers: ETag
 
 
Get the specific record stored under the given key:
 
    >  GET <collection-url>/records/<key>
    >  Authorization:  <hawk auth parameters>
    .
    <  200 OK
    <  Content-Type: application/json
    <  ETag: 123-HASH1
    <  {
    <  "key": <key>
    <  "seqnum": 123,
    <  "changeid": "HASH1",
    <  "payload": "payload1"
     <  }
     <  }
     <  }
     <  }


This request supports standard etag behaviour to ensure that a consistent view of the collection is being read.


XXX TODO: There are several error cases that need to be distinguished, possibly by HTTP status code or possibly by some information in the error response body:


* The requested version is not known or no longer present on the server
=== GET <collection-url>/changes ===
* We can't generate a delta from the specified "from" version to the request version
* The specified "from" version is invalid (e.g. due to lost writes during a rollback)


=== POST <base-url>/<collection>/<version> ===
Query parameters: since, limit.


Creates a new version of a specific collection.  In the simplest case, we POST up the full contents of the new version like so:
Get the sequence of changes that have been made to the collection.  If the number of changes to be returned is small, they will be
returned all at once like so:


     >  POST <base-url>/<collection>/<version>
     >  GET <collection-url>/changes
     >  Authorization:  <hawk auth parameters>
     >  Authorization:  <hawk auth parameters>
    >  {
    >  "items": {
    >    "key1": "value1",
    >    "key2": "value2",
    >    <..etc..>
    >  }
    >  }
     .
     .
     <  201 Created
     <  200 OK
    <  Content-Type: application/json
    <  {
    <  "changes": [
    <    { "seqnum": 0, "changeid": "HASH1", "signature": "sig1", "key": "key1", "payload": "payload1" },
    <    { "seqnum": 1, "changeid": "HASH2", "signature": "sig2", "key": "key2", "payload": "payload2" },
    <  }
    <  }


The changeids and signatures on these changes form a hash chain which can be verified by the client.


However, clients will usually want to send just the changes from a previous versionThey can do this by specifying the "from" parameter.  New or updated keys are represented with their value, while deleted keys are represented with a value of null.  Like so:
If there are a large number of changes to be fetched then the server may choose to paginate the result, returning only some of the  
changes in the initial requestIt will include the key "next" in the output to indicate that more changes are available:


     >  POST <base-url>/<collection>/<version>?from=<previous version>
     >  GET <collection-url>/changes
     >  Authorization:  <hawk auth parameters>
     >  Authorization:  <hawk auth parameters>
    >  {
    >  "items": {
    >    "key1": "value1",  // a key to be updated
    >    "key2": null      // a key to be deleted
    >  }
    >  }
     .
     .
     <  201 Created
     <  200 OK
    <  Content-Type: application/json
    <  {
    <  "next": 3,
    <  "changes": [
    <    <change1>,
    <    <change2>
    <  ]
    <  }
 
Clients can request the next batch using the 'since' query parameter:


    >  GET <collection-url>/changes?since=3
    >  Authorization:  <hawk auth parameters>
    .
    <  200 OK
    <  Content-Type: application/json
    <  {
    <  "changes": [
    <    <change3>,
    <    <change4>
    <  ]
    <  }


To guard against intermittent or unreliable connections, the client can also send data in batchesIt can specify the argument "first" to indicate a key offset at which this batch begins, and the argument "upto" to specify a key offset at which this batch ends.  The server will spool all the incoming items until it sees a batch with no "upto" argument, then create the new version as an atomic unit.
Records are always batched in sequence number orderClients are free to request changes starting at an arbitrary sequence number,  
which is useful for pulling in just the things that have changed since a previous sync.


As an example, here is how the client might create a new version by sending items one at a time:
Clients may also choose to batch their requests by using the 'limit' query parameter.  As with server-driven batching, the output key
"next" will be used to indicate that more data is available:


     >  POST <base-url>/<collection>/<version>?upto=key2
     >  GET <collection-url>/changes?since=2&limit=2
     >  Authorization:  <hawk auth parameters>
     >  Authorization:  <hawk auth parameters>
    >  {
    >  "items": {
    >    "key1": value1"
    >  }
    >  }
     .
     .
     <  202 Accepted
     <  200 OK
    <  Content-Type: application/json
    <  {
    <  "next": 4,
    <  "changes": [
    <    <change2>,
    <    <change3>
    <  ]
    <  }
     .
     .
     .
     .
     >  POST <base-url>/<collection>/<version>?start=key2&upto=key3
     >  GET <collection-url>/changes?since=4&limit=2
     >  Authorization:  <hawk auth parameters>
     >  Authorization:  <hawk auth parameters>
    >  {
    >  "items": {
    >    "key2": "value2"
    >  }
    >  }
     .
     .
     <  202 Accepted
     <  200 OK
    <  Content-Type: application/json
    <  {
    <  "changes": {
    <    <change4>
    <  }
    <  }
 
The server is not required to keep the full change history from seqnum zero, and may periodically compact and garbage-collection the
stored data.  If the client requests changes since a seqnum that is no longer known to the server, it will receive an error:
 
    >  GET <collection-url>/changes?since=1
    >  Authorization:  <hawk auth parameters>
     .
     .
     .
     <  416 Requested Range Not Satisfiable
     >  POST <base-url>/<collection>/<version>?start=key3
 
 
XXX TODO: seriously, is there a good error code for this, or should we just tunnel errors in the body?
 
 
=== POST <collection-url>/records ===
 
Request headers: If-Match, If-None-Match
 
Response headers: ETag
 
Update or delete records in the collection. The request body must contain an array of change objects with properly-formed sequence
numbers and changeids, and it must be preconditioned with an If-Match or If-None-Match header:
 
     >  POST <collection-url>/records
     >  Authorization:  <hawk auth parameters>
     >  Authorization:  <hawk auth parameters>
    >  If-Match: 125-HASH1
     >  {
     >  {
     >   "items": {
     >   "changes": [
     >   "key3": "value3"
    >      {"key": "key1", "payload": "newpayload1", "seqnum": 126, "changeid": "NEWHASH1", "signature": "newsig1"},
     >   }
     >     {"key": "key2", "payload": null, "seqnum": 127, "changeid": "NEWHASH2", "signature": "newsig2"}
     >  }
     >   }
     >  }  
     .
     .
     <  201 Created
     <  204 No Content


The server will apply each change in turn, checking that the seqnum and changeid hash chains are properly formed.  If they are not
then an error will be reported:


XXX TODO: There are several error cases that need to be distinguished, possibly by HTTP status code or possibly by some information in the error response body:
    >  POST <collection-url>/records
    >  Authorization: <hawk auth parameters>
    >  If-Match: 120-OLD-HASH
    >  {
    >    "changes": [
    >      {"key": "key1", "payload": "newpayload1", "seqnum": 121, "changeid": "NEWHASH1", "signature": "newsig1"},
    >      {"key": "key2", "payload": null, "seqnum": 122, "changeid": "NEWHASH2", "signature": "newsig2"}
    >    }
    >  }
    .
    <  412 Precondition Failed
    <  ETag: 125-HASH1


* There was a conflicting write, so you can no longer create the requested version
* The requested version is invalid, e.g. wrong sequence number
* The specified "from" version is too old, so we can't use it as the start point of a delta
* The specified "from" version is invalid (e.g. due to lost writes during a rollback)
* The provided batches had holes, or were otherwise invalid
* The server forgot a previous batch and you'll have to start again


No content is returned in response to a POST.  The client has already calculated the new seqnum and changeid for the collection, so
there is no more useful information that the server can provide.


== Things To Think About ==
XXX TODO: since we're posting "change" objects, does it make more sense to direct this POST at <collection-url>/changes rather than at the records resource?


* How do people feel about the separate "login" step.  It's providing value to the server since it lets us tunnel some state information, but maybe it's not very nice from the client side?
=== POST <collection-url>/records/<key> ===
* Currently there's no explicit way for the server to track the current version held by each client.  We could add this in the initial handshake, or intuit it based on their activity.
* Is json the best format for this transfer, or could we come up with a more efficient representation?
* Should we add a way to retrieve specific keys, for real-time updating of just the important bits?




feedback from warner:
Update or delete a specific record in the collection.  The request body must contain a change object with properly-formed sequence
number and changeid, and it must be preconditioned with an If-Match or If-None-Match header:


  <warner> rfkelly: some random thoughts
    > POST <collection-url>/records/<key>
  <rfkelly> please :-)
    > Authorization:  <hawk auth parameters>
  <warner> there will be "shared collections" and "per-device collections", might be useful to have some metadata indicating which is which
    > If-Match: 125-HASH1
  <warner> something to indicate whether data is stored as class-A or class-B, although we've talked (without conclusion) on how to prevent the storage server from getting to make a downgrade attack
    > {
  <warner> might be good to store a key ID with each collection, so clients can discover when a key has been changed (and thus they shouldn't be surprised to get MAC failures when they try to decrypt the records)
    >   "payload": "newpayload1",
  <warner> garbage-collection when the password (and thus kB) is reset, pretty tricky
    >   "seqnum": 126,
  <rfkelly> could the keyID also double as the classA/classB indicator?
    >   "changeid": "NEWHASH1",
  <warner> GET base/collection/version?limit= needs a response code to indicate "we're done" versus "more is coming"
    >   "signature": "newsig1"
  <warner> yeah, probably
    > }
  <warner> keyID probably = hash(key)
    .
  <rfkelly> right
    < 204 No Content
  <warner> although, if that, (encKey,hmacKey,keyID) = HKDF(key) would be better
  <rfkelly> is "garbage collection" essentially "delete everything that was created with the old key"
  <rfkelly> ?
  <warner> POSTing batches: first= and upto= sounds good, using "upto not in args" requires that we can always detect a missing message, which might not be the case if we memcache the inbound batch (or if we write it to SQL but then SQL rolls back). Might be worth thinking about that part more than I did in my docs.
  <rfkelly> GET base/collection/version?limit= currently indicates doneness by presence/absence of the "next" key in the body; a response code would be better
  <warner> yeah, GC is that, although we probably need some care to make sure an out-of-date client doesn't manage to delete everything, or get into a delete-fight with a less-out-of-date client
  <warner> (might require seqnums in the keyids)
  <warner> ah, next= is fine, unless REST prefers a response code
  * warner gets down to Things To Think About
  <warner> I think the login step is fine, you probably don't want to be doing pubkey verification with every message
  <warner> it adds one RTT (plus sign, plus verify) per hour, or per whatever lifetime we use on the certs (maybe 12 hours?), which seems pretty reasonable
  <warner> but removes the verify time on every single server message
  <warner> ok, time to chat with chris about native-data stuff
  <warner> rfkelly: looks good overall, I think your list of outstanding questions matches my own


And more:


  <warner> rfkelly: hm, so it might be useful to put the "which keyids do I have data for" list in the verify-signature/issue-token handshake, and then if it changes later, revoke that token, so they must do a new handshake
The server will check that the seqnum and changeid hash chains are properly formed before applying the change. If they are not then
  <warner> rather than defining error responses for what happens when the data is moved from one class to another (or the class-B data is flushed) in between handshakes
an error will be reported:
  <rfkelly> interesting
  <rfkelly> basically tie your session to a set of metadata, and if the metadata changes you automatically get your session invaldiated
  <warner> yeah
  <rfkelly> the keyids thing, would it be a distinct keyid per collection, or some additional top-level metadata?
  * warner loves to eliminate error pathways
  <rfkelly> warner++
  <warner> probably one keyid per collection
  <warner> something like, "if you can see this account, you can get/set data for the following collections:.." and "to get the plaintext for collection X, you'll need keyid Y"
  <warner> hm
  <warner> well, the main hope is to not confuse clients who try to use the wrong key
  <warner> basically the only time a client should ever see an HMAC failure is when the server manages to corrupt some data
  <rfkelly> right
  <warner> or if the server is being intentionally malicious
  <warner> so there must be some earlier mechanism to indicate A-vs-B-vs-new-B
  <rfkelly> so I'm thinking of doing a bit more explicit "collection metadata" API; currently the only piece of metadata is hte version number, but now it might be (version, keyid, ...other stuff...?)
  <rfkelly> and let clients explicitly get/set/delete this blog to manage the collection state
  <warner> hm, yeah
  <warner> one thing I think we talked about a while ago was collection discovery
  <rfkelly> warner: I like the idea of using an opaque keyid to distinguish classA/classB, because it could prevent the server from learning what class the data is in
  <warner> yeah
  <rfkelly> the client just tries each key in turn until it finds the one that matches the keyid (like truecrypt does to discover the encryption parameters, IIRC)
  <warner> so, looking at this from the inside of the browser..
  <warner> some component or some plugin tells the PICL client "hey, I have data to sync. My data category is named "bookmarks" and this is a "one shared collection" kind of thing"
  <warner> vs one-collection-per-device
  <warner> also it says "this data is going into class-A" or B, probably according to what the user prefs asked for
  <warner> the data-category is unique to this component/plugin (maybe it's a domain name or URL, or GUID)
  <warner> then the PICL client derives some keys, and computes collection-id = hash(kA, category-name), or maybe hash(kB, category-name), for one-shared-collection types
  <warner> or hash(kA/kB, category-name, device-id) for one-collection-per-device
  <warner> so the server can't actually learn what category-name is, or device-id for that matter
  <warner> and then any one-collection-per-device category also needs a device-id-discovery mechanism
  <rfkelly> can it piggyback that from the devices list in the keyserver/idp/thingo
  <rfkelly> ?
  <warner> something like enc(key=hash(kA,category-name), data=device-id), and the server holds a set of the results
  <warner> hm
  <warner> yeah, that's probably better
  <rfkelly> ISTM that "these are my peer devices" is a higher-level concern than at this storage layer
  <warner> although when you add a device, the existing devices need to learn about it
  <rfkelly> it's no specific to a particular collection or datatype
  * warner nods




But [rfkelly] wonders, if the collection name is derived from a hash of its metadata, whether we need to include an explicit "keyid" at all on the server sideChange the key?  Change the name of the collection. Doesn't make garbage-collection any easier though...
    >  POST <collection-url>/records/<key>
    >  Authorization:  <hawk auth parameters>
    >  If-Match: 120-OLD-HASH
    >  {
    >    "payload": "newpayload1",
    >    "seqnum": 126,
    >    "changeid": "NEWHASH1",
    >    "signature": "newsig1"
    >  }
    .
    < 412 Precondition Failed
    < ETag: 125-HASH1


need last-written and last-read timestamps, to enable garbage collection in some to-be-defined clever system
No content is returned in response to a POST.  The client has already calculated the new seqnum and changeid for the collection, so
there is no more useful information that the server can provide.

Latest revision as of 13:48, 18 June 2013


Summary

This is a working proposal for the PiCL Storage API, to implement the concepts described in Identity/CryptoIdeas/05-Queue-Sync.

It's a work in progress that will eventually obsolete Identity/AttachedServices/StorageProtocolZero.


Queue-Sync Data Model

More details at Identity/CryptoIdeas/05-Queue-Sync.

Data is stored in independent named collections. A collection is a key-value store mapping keys to records. Each collection has a monotonically-increasing sequence number which is incremented whenever a record is changed, and provides the ability to request all changes since a given sequence number.


Collection objects have the following fields:

ParameterTypeDescription
nameurlsafe string, 64 bytesA unique identifier for this collection amongt all the user's data. Collection names may only contain characters from the urlsafe-base64 alphabet (i.e. alphanumerics, underscore and hyphen).
seqnuminteger, 8 bytesA monotonically-increasing integer that is incremented with each change to the contents of the collection.
changeidurlsafe string, XXX bytesA hash that uniquely identifies the last change to this collection. It is derived from the new sequence number, the previous changeid, and the details of the change that was made.
signatureurlsafe string, XXX bytesA client-generated HMAC signature of the current changeid. Not used or verified by the server, since it doesn't have the secret key.


Record objects have the following fields:

ParameterTypeDescription
keyurlsafe string, 64 bytesA unique identifier for this record within the collection. Keys may only contain characters from the urlsafe-base64 alphabet (i.e. alphanumerics, underscore and hyphen).
payloadurlsafe string, 256 KBThe value current stored in this record. Typically this would be encrypted and signed by the client.
seqnuminteger, 8 byteThe collection-level sequence number at which this record was last modified.
changeidurlsafe string, XXX bytesThe collection-level changeid corresponding to the modification of this record. It is derived from the new sequence number, the previous changeid, the record key, and the new record payload.
signatureurlsafe string, XXX bytesA client-generated HMAC signature of the changeid for this record. Not used or verified by the server, since it doesn't have the secret key.


Change objects are identical to record objects, except their payload field may have the value NULL to indicate a deletion rather than an update:

ParameterTypeDescription
keyurlsafe string, 64 bytesA unique identifier for the changed record within the collection. Keys may only contain characters from the urlsafe-base64 alphabet (i.e. alphanumerics, underscore and hyphen).
payloadurlsafe string or null, 256 KBThe new value to be stored in the record, or null if the record is to be deleted. Typically this would be encrypted and signed by the client.
seqnuminteger, 8 byteThe new collection-level sequence number after this change is applied.
changeidurlsafe string, XXX bytesThe new collection-level changeid corresponding to this change. It is derived from the new sequence number, the previous changeid, the record key, and the new record payload.
signatureurlsafe string, XXX bytesA client-generated HMAC signature of the changeid. Not used or verified by the server, since it doesn't have the secret key.


Authentication

To access the storage service, a client device must authenticate by providing a BrowserID assertion and a Device ID. It will receive in exchange:

  • a short-lived id/key pair that can be used to authenticate subsequent requests using the Hawk request-signing scheme
  • a mapping of collection names to access URLs


You can think of this as establishing a "login session" with the server. Access requests for a specific collection should then be directed to the appropriate URL.

Example:

   >  POST <server-url>
   >  {
   >   "assertion": <browserid assertion>,
   >   "device": <device UUID>
   >  }
   .
   <  200 OK
   <  Content-Type: application/json
   <  {
   <   "id": <hawk auth id>,
   <   "key": <hawk auth secret key>,
   <   "collections": {
   <     "history": <access url for history collection>,
   <     "bookmarks": <access url for bookmarks collection>,
   <     <...etc...>
   <   }
   <  }

The user and device identity information is encoded in the hawk auth id, to avoid re-sending it on each request. The server may also include additional state in this value, depending on the implementation. It's opaque to the client.

The collection-specific access URLs may include a unique identifier for the user, in order to improve RESTful-icity of the API. Or they might point the client to a specific data-center which houses their write master for each collection. It's opaque to the client.

Data Access

The client now makes Hawk-authenticated requests to a specific collection at its assigned access url. The following operations are available on each collection.


GET <collection-url>

Get the current metadata for a collection: its name, seqnum and changeid. Example:

   >  GET <collection-url>
   >  Authorization:  <hawk auth parameters>
   .
   <  200 OK
   <  Content-Type: application/json
   <  {
   <   "name": "history"
   <   "seqnum": 123,
   <   "changeid": "HASH_OF_DETAILS_OF_THE_MOST_RECENT_CHANGE",
   <   "signature": "HMAC_SIGNATURE_OF_CHANGEID"
   <  }


GET <collection-url>/records

Query parameters: start, end, limit.

Request headers: If-Match, If-None-Match

Response headers: ETag


Get the set of records currently contained in the collection. For small collections, the full set of records will be returned like so:

   >  GET <collection-url>/records
   >  Authorization:  <hawk auth parameters>
   .
   <  200 OK
   <  Content-Type: application/json
   <  {
   <   "records": {
   <    "key1": { "payload": "payload1", "seqnum": 123, "changeid": "HASH1", "signature": "sig1" },
   <    "key2": { "payload": "payload2", "seqnum": 124, "changeid": "HASH2", "signature": "sig2" }
   <   }
   <  }


If there are a large number of records in the collection then the server may choose to paginate the result, returning only some of the records in the initial response. It will include the key "next" in the output to indicate that more records are available:

   >  GET <collection-url>/records
   >  Authorization:  <hawk auth parameters>
   .
   <  200 OK
   <  Content-Type: application/json
   <  {
   <   "next": "key3",
   <   "items": {
   <     "key1": <record1>,
   <     "key2": <record2>
   <   }
   <  }

Clients can request the next batch using the 'start' query parameter:

   >  GET <collection-url>/records?start=key3
   >  Authorization:  <hawk auth parameters>
   .
   <  200 OK
   <  Content-Type: application/json
   <  {
   <   "items": {
   <     "key3": <record3>,
   <     "key4": <record4>
   <   }
   <  }

When no "next" value is included in the response, the client knows that all available records have been fetched.

Records are always batched in lexicographic order of their keys, and clients are free to request an arbitrary key range using the 'start' and 'end' parameters:

   >  GET <collection-url>/records?start=key2&end=key3
   >  Authorization:  <hawk auth parameters>
   .
   <  200 OK
   <  Content-Type: application/json
   <  {
   <   "items": {
   <     "key2": <record2>,
   <     "key3": <record3>
   <   }
   <  }

Clients may also choose to batch their requests by using the 'limit' query parameter. As with server-driven batching, the output key "next" will be used to indicate that more data is available:

   >  GET <collection-url>/records?start=key2&limit=2
   >  Authorization:  <hawk auth parameters>
   .
   <  200 OK
   <  Content-Type: application/json
   <  {
   <   "next": "key4",
   <   "items": {
   <     "key2": <record2>,
   <     "key3": <record3>
   <   }
   <  }
   .
   .
   >  GET <collection-url>/records?start=key4&limit=2
   >  Authorization:  <hawk auth parameters>
   .
   <  200 OK
   <  Content-Type: application/json
   <  {
   <   "items": {
   <     "key4": <record4>
   <   }
   <  }


Each server response will include an "ETag" header, formed from the combination of the current seqnum and changeid of the collection. Clients can use this in combination with standard If-Match and If-None-Match headers to ensure that they're getting a consistent view of the collection:

   >  GET <collection-url>/records?start=key2&limit=2
   >  Authorization:  <hawk auth parameters>
   .
   <  200 OK
   <  Content-Type: application/json
   <  ETag: 124-HASH2
   <  {
   <   "next": "key4",
   <   "items": {
   <     "key2": <record2>,
   <     "key3": <record3>
   <   }
   <  }
   .
   .
   >  GET <collection-url>/records?start=key4&limit=2
   >  Authorization:  <hawk auth parameters>
   >  If-Match: 123-HASH
   .
   <  412 Precondition Failed
   <  ETag: 125-HASH3


XXX TODO: use of headers, versus returning seqnum/changeid in the response body?


GET <collection-url>/records/<key>

Request headers: If-Match, If-None-Match

Response headers: ETag


Get the specific record stored under the given key:

   >  GET <collection-url>/records/<key>
   >  Authorization:  <hawk auth parameters>
   .
   <  200 OK
   <  Content-Type: application/json
   <  ETag: 123-HASH1
   <  {
   <   "key": <key>
   <   "seqnum": 123,
   <   "changeid": "HASH1",
   <   "payload": "payload1"
   <   }
   <  }

This request supports standard etag behaviour to ensure that a consistent view of the collection is being read.


GET <collection-url>/changes

Query parameters: since, limit.

Get the sequence of changes that have been made to the collection. If the number of changes to be returned is small, they will be returned all at once like so:

   >  GET <collection-url>/changes
   >  Authorization:  <hawk auth parameters>
   .
   <  200 OK
   <  Content-Type: application/json
   <  {
   <   "changes": [
   <     { "seqnum": 0, "changeid": "HASH1", "signature": "sig1", "key": "key1", "payload": "payload1" },
   <     { "seqnum": 1, "changeid": "HASH2", "signature": "sig2", "key": "key2", "payload": "payload2" },
   <   }
   <  }

The changeids and signatures on these changes form a hash chain which can be verified by the client.

If there are a large number of changes to be fetched then the server may choose to paginate the result, returning only some of the changes in the initial request. It will include the key "next" in the output to indicate that more changes are available:

   >  GET <collection-url>/changes
   >  Authorization:  <hawk auth parameters>
   .
   <  200 OK
   <  Content-Type: application/json
   <  {
   <   "next": 3,
   <   "changes": [
   <     <change1>,
   <     <change2>
   <   ]
   <  }

Clients can request the next batch using the 'since' query parameter:

   >  GET <collection-url>/changes?since=3
   >  Authorization:  <hawk auth parameters>
   .
   <  200 OK
   <  Content-Type: application/json
   <  {
   <   "changes": [
   <     <change3>,
   <     <change4>
   <   ]
   <  }

Records are always batched in sequence number order. Clients are free to request changes starting at an arbitrary sequence number, which is useful for pulling in just the things that have changed since a previous sync.

Clients may also choose to batch their requests by using the 'limit' query parameter. As with server-driven batching, the output key "next" will be used to indicate that more data is available:

   >  GET <collection-url>/changes?since=2&limit=2
   >  Authorization:  <hawk auth parameters>
   .
   <  200 OK
   <  Content-Type: application/json
   <  {
   <   "next": 4,
   <   "changes": [
   <     <change2>,
   <     <change3>
   <   ]
   <  }
   .
   .
   >  GET <collection-url>/changes?since=4&limit=2
   >  Authorization:  <hawk auth parameters>
   .
   <  200 OK
   <  Content-Type: application/json
   <  {
   <   "changes": {
   <     <change4>
   <   }
   <  }

The server is not required to keep the full change history from seqnum zero, and may periodically compact and garbage-collection the stored data. If the client requests changes since a seqnum that is no longer known to the server, it will receive an error:

   >  GET <collection-url>/changes?since=1
   >  Authorization:  <hawk auth parameters>
   .
   <  416 Requested Range Not Satisfiable


XXX TODO: seriously, is there a good error code for this, or should we just tunnel errors in the body?


POST <collection-url>/records

Request headers: If-Match, If-None-Match

Response headers: ETag

Update or delete records in the collection. The request body must contain an array of change objects with properly-formed sequence numbers and changeids, and it must be preconditioned with an If-Match or If-None-Match header:

   >  POST <collection-url>/records
   >  Authorization:  <hawk auth parameters>
   >  If-Match: 125-HASH1
   >  {
   >    "changes": [
   >      {"key": "key1", "payload": "newpayload1", "seqnum": 126, "changeid": "NEWHASH1", "signature": "newsig1"},
   >      {"key": "key2", "payload": null, "seqnum": 127, "changeid": "NEWHASH2", "signature": "newsig2"}
   >    }
   >  } 
   .
   <  204 No Content

The server will apply each change in turn, checking that the seqnum and changeid hash chains are properly formed. If they are not then an error will be reported:

   >  POST <collection-url>/records
   >  Authorization:  <hawk auth parameters>
   >  If-Match: 120-OLD-HASH
   >  {
   >    "changes": [
   >      {"key": "key1", "payload": "newpayload1", "seqnum": 121, "changeid": "NEWHASH1", "signature": "newsig1"},
   >      {"key": "key2", "payload": null, "seqnum": 122, "changeid": "NEWHASH2", "signature": "newsig2"}
   >    }
   >  } 
   .
   <  412 Precondition Failed
   <  ETag: 125-HASH1


No content is returned in response to a POST. The client has already calculated the new seqnum and changeid for the collection, so there is no more useful information that the server can provide.

XXX TODO: since we're posting "change" objects, does it make more sense to direct this POST at <collection-url>/changes rather than at the records resource?

POST <collection-url>/records/<key>

Update or delete a specific record in the collection. The request body must contain a change object with properly-formed sequence number and changeid, and it must be preconditioned with an If-Match or If-None-Match header:

   >  POST <collection-url>/records/<key>
   >  Authorization:  <hawk auth parameters>
   >  If-Match: 125-HASH1
   >  {
   >    "payload": "newpayload1",
   >    "seqnum": 126,
   >    "changeid": "NEWHASH1",
   >    "signature": "newsig1"
   >  } 
   .
   <  204 No Content


The server will check that the seqnum and changeid hash chains are properly formed before applying the change. If they are not then an error will be reported:


   >  POST <collection-url>/records/<key>
   >  Authorization:  <hawk auth parameters>
   >  If-Match: 120-OLD-HASH
   >  {
   >    "payload": "newpayload1",
   >    "seqnum": 126,
   >    "changeid": "NEWHASH1",
   >    "signature": "newsig1"
   >  }
   .
   <  412 Precondition Failed
   <  ETag: 125-HASH1

No content is returned in response to a POST. The client has already calculated the new seqnum and changeid for the collection, so there is no more useful information that the server can provide.