Services/KeyValueStorage

From MozillaWiki
Jump to: navigation, search

Planning Questions

Scope

  • are we planning a full-stack solution, where we select a particular backend (e.g. riak/hbase/cassandra/membase/etc) then implement our API on top according to what it offers?
  • or are we planning to support multiple backends for apps with different tradeoffs?


Use Cases

  • data storage for mozilla appstore
  • It would be good to find and isolate some use cases in the existing Services apps.
    (see below for strawman implementation of SyncStorage backend using KVStore)


How simple can we get away with while still providing useful functionality?

  • maximum key size, value size?
  • key => single value?
  • key => set of values? (like e.g. memcached, riak with siblings enabled)
  • key + column => value? (like e.g. bigtable or cassandra)
  • keys in sorted order? (i.e. hash or btree? riak or hbase?)
  • bulk insert, update or delete operations?


Consistency

  • Always consistent, at cost of partition intolerance? (i.e. a "CA" system like membase)
  • Partition tolerant, at cost of eventual consistency? (i.e. a "AP" system like riak)
  • Is it possible to let the developer choose on a per-bucket level? On a per-request level?
    (e.g. riak's per-request CAP tuning: http://wiki.basho.com/Tunable-CAP-Controls-in-Riak.html)


Durability

  • flush-to-disk on each write?
  • periodic flush-to-disk in background?
  • Is it possible to let the developer choose on a per-bucket level? On a per-request level?
    (e.g. set "time before write" on each request, tbw=0 for immediate write, tbw=30 for 30-second write-behind)


Crypto

  • can it sensibly be done at this layer, or do we need to defer to the application?
  • encrypted keys would make key-ordering useless


Authentication to App

  • can AppKeys be used for authentication, e.g. some sort of request signing with the app key?


We will probably want some sort of partitioning or "buckets".

  • AppKey => list of buckets?
  • AppKey + UserID => list of buckets?
  • can a bucket be shared between multiple apps? multiple users?


Management features

  • explicit TTLs?
  • built-in quota system? having it built-in will probably be more efficient than enforcing it at a higher level.

Strawman Proposal

In order to guide the design process, I have developed a strawman refactoring of the server-storage code to use a KeyValueStore instead of an RDBMS. Uploaded to a user hg repo here: http://hg.mozilla.org/users/ryan_rfk.id.au/server-storage-kvs/

The proposed python API is defined in http://hg.mozilla.org/users/ryan_rfk.id.au/server-storage-kvs/file/tip/syncstorage/storage/kvstore/__init__.py

The dummy KVS implementation is built on top of SQL and is available in http://hg.mozilla.org/users/ryan_rfk.id.au/server-storage-kvs/file/tip/syncstorage/storage/kvstore/sql.py

The refactoring of the storage interface to use KVS is available in http://hg.mozilla.org/users/ryan_rfk.id.au/server-storage-kvs/file/tip/syncstorage/storage/kvs.py

The design is based mostly on my high-level understanding of the functions provided by Riak, but I've also been thinking about how it might be implemented on top of other backend technology e.g. HBase or Cassandra. So it's definitely intended to be backend-agnostic as much as possible.

Key points of the API follow, in increasing level of sophistication (i.e. difficulty to implement).

Namespacing

Most KVStore implementations provide key namespacing via "collections" or "buckets", so that each bucket is an isolated set of key/value pairs.

Based on the needs of server-storage and of other potential applications, I think it makes the most sense to associate a bucket with an (appkey, user_id) pair - i.e. each application can map each user to a set of buckets.

Python API
    store = magically_get_kvstore("THIS IS MY APPKEY")

    store.create_bucket("ryan@rfk.id.au", "notes")

    store.list_buckets("ryan@rfk.d.au")
    =>  ["notes"]

    bucket = store.get_bucket("ryan@rfk.id.au", "notes")
HTTP API
    GET /appkey/notes
    => 404 Not Found

    PUT /appkey/notes
    => 204 No Content

    GET /appkey/
    => 200 OK
       Content-Type: text/json
       ["notes"]

    GET /appkey/notes
    => 200 OK


Some apps might find it useful to have an application-wide namespace, but I think it should be avoided if possible - as I understand it, much of our infrastructure is predicated on being able to shard on user-id to scale out storage and processing needs.

Note that the HTTP api doesn't specify the user in the URL, since this information will be provided in the auth credentials along with the request. Should we put it in the URL anyway for consistency?

Basic Key-Value Storage

This is the most primitive of all the functionality - a simple map from keys to values.  Nothing fancy, and probably too primitive to build interesting apps on it alone.

Python API
    bucket.put("my key","my exciting value")

    bucket.get("my key")
    => "my exciting value"

    bucket.delete("my key")

    bucket.get("my key")
    => KeyError
HTTP API
    PUT /appkey/bucketname/items/my%20key
    Content-Length: 17
    my exciting value


    GET /appkey/bucketname/items/my%20key
    =>  200 OK
        Content-Length: 17
        my exciting value

    GET /appkey/bucketname/items/otherkey
    =>  404 Not Found


Atomic Compare-and-Swap

Well, as atomic as is reasonable taking into account e.g. vector clocks etc.  This is provided in lieu of transactions so that the application can check that it's not e.g. deleting updates made by another process.

Python API
    item = bucket.getitem("my key")
    item.value    # the value previously stored
    item.version  # opqaue version identifier

    bucket.put("my key", "new value", ifmatch="WRONGVERSION")
    =>  VersionMismatchError

    bucket.put("my key", "new value", item.version)
    =>  OK


HTTP API
    GET /appkey/bucketname/items/my%20key
    =>  200 OK
        Content-Length: 17
        X-Weave-Version: XXXYYY
        my exciting value

    PUT /appkey/bucketname/items/my%20key
    X-Weave-If-Match:  WRONGVALUE
    Content-Length: 2
    Hi
    =>  412 Precondition Failed

    PUT /appkey/bucketname/items/my%20key
    X-Weave-If-Match:  XXXYYY
    Content-Length: 2
    Hi
    =>  204 No Content

We could treat the version number like etag and use standard HTTP headers for it, but I haven't check if this would violate anything in the RFC.  Riak uses a custom header for its vclock thingo.


Secondary Indexes

While we don't want to evolve back to a full-blown RDBMS, the server-storage code does need to be able to query its data based on some simple single-field indices - in particular it needs to perform queries based on last modification time or on a user-defined "sort order".

To assist, we let the application specify secondary indexes against which the data can be queries. It is responsible for constructing and maintaining these indexes manually. The KVStore only provides the ability to query an existing index, and to atomically update indexes when storing a new value.

Python API
    bucket.put("key1", "data1", indexes = {
        "modified": "2011-09-12"
    })

    bucket.put("key2", "data2", indexes = {
        "modified": "2011-09-13"
    })

    list(bucket.scan("modified"))
    =>  [("2011-09-12", "key1"), ("2011-09-13", "key2")]

    list(bucket.scan("modified", reverse=True))
    =>  [("2011-09-13", "key2"), ("2011-09-12", "key1")]
    

    list(bucket.scan("modified", max="2011-09-12"))
    =>  [("2011-09-12", "key1")]
HTTP API
    PUT /appkey/bucketname/items/key1
    Content-Length: 5
    X-Weave-Index-modified 2011-09-12
    data1
    =>  204 No Content

    PUT /appkey/bucketname/items/key2
    Content-Length: 5
    X-Weave-Index-modified 2011-09-13
    data2
    =>  204 No Content

    GET /appkey/bucketname/indexes/modified
    =>  200 OK
        Content-Type: text/json
        [["2011-09-12", "key1"], ["2011-09-13", "key2"]]

    GET /appkey/bucketname/indexes/modified?reverse=true
    =>  200 OK
        Content-Type: text/json
        [["2011-09-13", "key2"], ["2011-09-12", "key1"]]
    
    GET /appkey/bucketname/indexes/modified?max=2011-09-12
    =>  200 OK
        Content-Type: text/json
        [["2011-09-12", "key1"]]

Keys are always returned in sorted order. Accepted arguments to the scan would be "min", "max", "reverse" and "limit".

It may be quite challenging to implement this efficiently, but I think it's worth pursuing as it will make the KVStore much more widely useful. For example, the server-storage app can't be written without indexing of some sort. So we would either have a KVS/DB hybrid or push the indexes into the KVStore.

Furthermore, several popular key-value stores are working on secondary indexes as a built-in feature. The above API is inspired by how Riak does it, but Cassandra also offers something similar. Whether they provide a performant-enough implementation remains to be seen.

Future: Lightweight Meta-Data

If the KVStore being used to store large values, then it may be valuable to offer lightweight metadata alongside the main "body" data. This would make it possible to update just the metadata/indexes without re-uploading the entire body.

Python API
    bucket.put("key1", "lots of data here", meta = {
        "modified": "2011-09-12",
        "owner": "Ryan Kelly"
    }, indexes = {
        "modified": "2011-09-12"
    })

    item = bucket.getitem("key1")
    item.meta
    =>  {"modified": "2011-09-12", "owner": "Ryan Kelly"}

    #  Update metadata without changing the value
    bucket.put("key1", meta = {
        "modified": "2011-09-12",
        "owner": "Ryan Kelly"
    }, indexes = {
        "modified": "2011-09-12"
    })

Is it worth it? I don't know. It's a fair chunk of complication for the API.


Future: Explicit Conflict Resolution

If we build on a KVStore that has vector blocks (e.g. riak) then it might be worthwhile to expose any conflicts to the calling application, rather than going with last-write-wins. Again, it's a lost of complication in the API so I don't know that it's worth it at this stage.

But, we should be careful not to exclude the possibility.


Python API
    item = bucket.getitem("key1")

    # for simple use-cases we have a single "version" string
    item.version
    =>  "XXXYYY-AAABBB"

    #  but it actually identifies all conflicting versions
    item.versions.keys()
    =>  ["XXXYYY", "AAABBB"]

    # and you can get at each particular version if you need
    item.versions["XXXYYY"].value
    => "some value"

    item.versions["AAABBB"].value 
    => "some other value"

    # putting will resolve the conflict
    bucket.put("key1", "resolved", ifmatch=item.version)
    => OK