Services/KeyValueStorage
Contents
Planning Questions
Scope
- are we planning a full-stack solution, where we select a particular backend (e.g. riak/hbase/cassandra/membase/etc) then implement our API on top according to what it offers?
- or are we planning to support multiple backends for apps with different tradeoffs?
Use Cases
- data storage for mozilla appstore
- It would be good to find and isolate some use cases in the existing Services apps.
(see below for strawman implementation of SyncStorage backend using KVStore)
How simple can we get away with while still providing useful functionality?
- maximum key size, value size?
- key => single value?
- key => set of values? (like e.g. memcached, riak with siblings enabled)
- key + column => value? (like e.g. bigtable or cassandra)
- keys in sorted order? (i.e. hash or btree? riak or hbase?)
- bulk insert, update or delete operations?
Consistency
- Always consistent, at cost of partition intolerance? (i.e. a "CA" system like membase)
- Partition tolerant, at cost of eventual consistency? (i.e. a "AP" system like riak)
- Is it possible to let the developer choose on a per-bucket level? On a per-request level?
(e.g. riak's per-request CAP tuning: http://wiki.basho.com/Tunable-CAP-Controls-in-Riak.html)
Durability
- flush-to-disk on each write?
- periodic flush-to-disk in background?
- Is it possible to let the developer choose on a per-bucket level? On a per-request level?
(e.g. set "time before write" on each request, tbw=0 for immediate write, tbw=30 for 30-second write-behind)
Crypto
- can it sensibly be done at this layer, or do we need to defer to the application?
- encrypted keys would make key-ordering useless
Authentication to App
- can AppKeys be used for authentication, e.g. some sort of request signing with the app key?
We will probably want some sort of partitioning or "buckets".
- AppKey => list of buckets?
- AppKey + UserID => list of buckets?
- can a bucket be shared between multiple apps? multiple users?
Management features
- explicit TTLs?
- built-in quota system? having it built-in will probably be more efficient than enforcing it at a higher level.
Strawman Proposal
In order to guide the design process, I have developed a strawman refactoring of the server-storage code to use a KeyValueStore instead of an RDBMS. Uploaded to a user hg repo here: http://hg.mozilla.org/users/ryan_rfk.id.au/server-storage-kvs/
The proposed python API is defined in http://hg.mozilla.org/users/ryan_rfk.id.au/server-storage-kvs/file/tip/syncstorage/storage/kvstore/__init__.py
The dummy KVS implementation is built on top of SQL and is available in http://hg.mozilla.org/users/ryan_rfk.id.au/server-storage-kvs/file/tip/syncstorage/storage/kvstore/sql.py
The refactoring of the storage interface to use KVS is available in http://hg.mozilla.org/users/ryan_rfk.id.au/server-storage-kvs/file/tip/syncstorage/storage/kvs.py
The design is based mostly on my high-level understanding of the functions provided by Riak, but I've also been thinking about how it might be implemented on top of other backend technology e.g. HBase or Cassandra. So it's definitely intended to be backend-agnostic as much as possible.
Key points of the API follow, in increasing level of sophistication (i.e. difficulty to implement).
Namespacing
Most KVStore implementations provide key namespacing via "collections" or "buckets", so that each bucket is an isolated set of key/value pairs.
Based on the needs of server-storage and of other potential applications, I think it makes the most sense to associate a bucket with an (appkey, user_id) pair - i.e. each application can map each user to a set of buckets.
Python API
store = magically_get_kvstore("THIS IS MY APPKEY") store.create_bucket("ryan@rfk.id.au", "notes") store.list_buckets("ryan@rfk.d.au") => ["notes"] bucket = store.get_bucket("ryan@rfk.id.au", "notes")
HTTP API
GET /appkey/notes => 404 Not Found PUT /appkey/notes => 204 No Content GET /appkey/ => 200 OK Content-Type: text/json ["notes"] GET /appkey/notes => 200 OK
Some apps might find it useful to have an application-wide namespace, but I think it should be avoided if possible - as I understand it, much of our infrastructure is predicated on being able to shard on user-id to scale out storage and processing needs.
Note that the HTTP api doesn't specify the user in the URL, since this information will be provided in the auth credentials along with the request. Should we put it in the URL anyway for consistency?
Basic Key-Value Storage
This is the most primitive of all the functionality - a simple map from keys to values. Nothing fancy, and probably too primitive to build interesting apps on it alone.
Python API
bucket.put("my key","my exciting value") bucket.get("my key") => "my exciting value" bucket.delete("my key") bucket.get("my key") => KeyError
HTTP API
PUT /appkey/bucketname/items/my%20key Content-Length: 17 my exciting value GET /appkey/bucketname/items/my%20key => 200 OK Content-Length: 17 my exciting value GET /appkey/bucketname/items/otherkey => 404 Not Found
Atomic Compare-and-Swap
Well, as atomic as is reasonable taking into account e.g. vector clocks etc. This is provided in lieu of transactions so that the application can check that it's not e.g. deleting updates made by another process.
Python API
item = bucket.getitem("my key") item.value # the value previously stored item.version # opqaue version identifier bucket.put("my key", "new value", ifmatch="WRONGVERSION") => VersionMismatchError bucket.put("my key", "new value", item.version) => OK
HTTP API
GET /appkey/bucketname/items/my%20key => 200 OK Content-Length: 17 X-Weave-Version: XXXYYY my exciting value PUT /appkey/bucketname/items/my%20key X-Weave-If-Match: WRONGVALUE Content-Length: 2 Hi => 412 Precondition Failed PUT /appkey/bucketname/items/my%20key X-Weave-If-Match: XXXYYY Content-Length: 2 Hi => 204 No Content
We could treat the version number like etag and use standard HTTP headers for it, but I haven't check if this would violate anything in the RFC. Riak uses a custom header for its vclock thingo.
Secondary Indexes
While we don't want to evolve back to a full-blown RDBMS, the server-storage code does need to be able to query its data based on some simple single-field indices - in particular it needs to perform queries based on last modification time or on a user-defined "sort order".
To assist, we let the application specify secondary indexes against which the data can be queries. It is responsible for constructing and maintaining these indexes manually. The KVStore only provides the ability to query an existing index, and to atomically update indexes when storing a new value.
Python API
bucket.put("key1", "data1", indexes = { "modified": "2011-09-12" }) bucket.put("key2", "data2", indexes = { "modified": "2011-09-13" }) list(bucket.scan("modified")) => [("2011-09-12", "key1"), ("2011-09-13", "key2")] list(bucket.scan("modified", reverse=True)) => [("2011-09-13", "key2"), ("2011-09-12", "key1")] list(bucket.scan("modified", max="2011-09-12")) => [("2011-09-12", "key1")]
HTTP API
PUT /appkey/bucketname/items/key1 Content-Length: 5 X-Weave-Index-modified 2011-09-12 data1 => 204 No Content PUT /appkey/bucketname/items/key2 Content-Length: 5 X-Weave-Index-modified 2011-09-13 data2 => 204 No Content GET /appkey/bucketname/indexes/modified => 200 OK Content-Type: text/json [["2011-09-12", "key1"], ["2011-09-13", "key2"]] GET /appkey/bucketname/indexes/modified?reverse=true => 200 OK Content-Type: text/json [["2011-09-13", "key2"], ["2011-09-12", "key1"]] GET /appkey/bucketname/indexes/modified?max=2011-09-12 => 200 OK Content-Type: text/json [["2011-09-12", "key1"]]
Keys are always returned in sorted order. Accepted arguments to the scan would be "min", "max", "reverse" and "limit".
It may be quite challenging to implement this efficiently, but I think it's worth pursuing as it will make the KVStore much more widely useful. For example, the server-storage app can't be written without indexing of some sort. So we would either have a KVS/DB hybrid or push the indexes into the KVStore.
Furthermore, several popular key-value stores are working on secondary indexes as a built-in feature. The above API is inspired by how Riak does it, but Cassandra also offers something similar. Whether they provide a performant-enough implementation remains to be seen.
Future: Lightweight Meta-Data
If the KVStore being used to store large values, then it may be valuable to offer lightweight metadata alongside the main "body" data. This would make it possible to update just the metadata/indexes without re-uploading the entire body.
Python API
bucket.put("key1", "lots of data here", meta = { "modified": "2011-09-12", "owner": "Ryan Kelly" }, indexes = { "modified": "2011-09-12" }) item = bucket.getitem("key1") item.meta => {"modified": "2011-09-12", "owner": "Ryan Kelly"} # Update metadata without changing the value bucket.put("key1", meta = { "modified": "2011-09-12", "owner": "Ryan Kelly" }, indexes = { "modified": "2011-09-12" })
Is it worth it? I don't know. It's a fair chunk of complication for the API.
Future: Explicit Conflict Resolution
If we build on a KVStore that has vector blocks (e.g. riak) then it might be worthwhile to expose any conflicts to the calling application, rather than going with last-write-wins. Again, it's a lost of complication in the API so I don't know that it's worth it at this stage.
But, we should be careful not to exclude the possibility.
Python API
item = bucket.getitem("key1") # for simple use-cases we have a single "version" string item.version => "XXXYYY-AAABBB" # but it actually identifies all conflicting versions item.versions.keys() => ["XXXYYY", "AAABBB"] # and you can get at each particular version if you need item.versions["XXXYYY"].value => "some value" item.versions["AAABBB"].value => "some other value" # putting will resolve the conflict bucket.put("key1", "resolved", ifmatch=item.version) => OK