Delta-Based Data Synchronization
- Brian Warner, Chris Karlof, Feb-2013
Summary: a scheme to efficiently synchronize browser data (bookmarks, passwords, etc) between multiple devices, with end-to-end encryption.
The goal of PICL is to provide a unified browser experience across multiple devices: all your browsers behave as extensions of the same collective tool. Changes you make on one device should be reflected on the other devices, and any device should be sufficient to give you access to all your data.
This is a specific instance of a data-synchronization problem, not unlike the problem that Git or Dropbox seeks to solve. The data in question contains things like a table of bookmarks, and the different devices are browsers in different places (or other tools which need access to the same data).
There are many aspects to this problem that must be solved, but merging tends to be the most challenging: what to do when multiple devices are changed at the same time. This problem could be eliminated by locking (which is not practical), or minimized by frequent synchronizing (to reduce the opportunities to get out of sync), but when devices can be offline for arbitrary periods of time, the problem cannot be avoided completely. If the user makes change A on one device, and change B on another device, the goal is to provide them with a dataset that looks as if they'd made both changes on the same device. If A and B touch different parts of the data, they should merge without fuss. If they modify the same thing, or A modifies an item that B deletes, then the result should at least be plausible, and should be as predictable and similar to the single-device case as possible.
The collection of data that must be merged depends upon the data type and how it is used. Bookmarks and passwords are conceptually "one big dataset": every device is reading and modifying the same tree or table of data. However "open tabs on other browsers" are currently read-only: each device must publish its open tabs, but merely allows the user to see the tabs from other machines. (In the future we might treat the set of open tabs as a common dataset to keep in sync, so opening a tab on one device would immediately open that same tab on all other devices, but that's not how we use browser today). The merge task is scoped to the collection, and is thus trivial for open-tabs where there is only one writer.
Two other goals are efficiency (transfer and store as little data as possible) and security (expose the user's data to as few machines and parties as possible).
Other constraints may include:
- Partial fetch: on very small devices, it may not be feasible to pull all bookmarks, and perhaps it should be possible to download only the most recently-used subset of them.
- Early access: when pulling a large dataset, it may be nice to provide user access to some frequently-used subset before the entire dataset is available
- Gentle on the server: users with large datasets should not cause database timeouts. Pagination might help, or breaking the dataset into smaller pieces.
Merge Strategies and Locations
We can take a lot of inspiration from Git. In the Git "central canonical repository" approach, "git push" never does any merging (it is only allowed to make fast-forward changes), and "git pull" is really a "git fetch" (which only makes fast-forward changes) plus a "git merge". So each client maintains a shadow copy of the server's current "canonical" version, and merges are performed against that shadow copy. This puts the burden of merge on the client, which must make the decisions before it can update the central repo.
We explore the client-side merge in this document.
An alternative approach would be to make the server retain a shadow copy of each device's state. To push changes, the client first updates the server's shadow (this is always a fast-forward). Then the server performs a merge with the current canonical dataset to produce a new canonical dataset, and sends this proposed version down to all devices. If the first device has made more local changes since updating the shadow, the proposed version is rejected, and the client starts the process again. This puts the burden of the merge on the server.
Git retains version history for the benefit of users, as any good revision-control system should. But it also helps with merging. By comparing the old and new versions of a repository, the tool can deduce the user's intended changes, which can then be applied to a different starting point. Most VCSes treat the revision as the primary state, but some (like Darcs) try to retain the changes themselves, and only build the state when necessary (like doing a checkout).
Git uses hash-based revision identifiers, with zero or more parent pointers, to capture an arbitrary DAG of revisions. This helps manage merging when repositories can be arbitrarily connected in any topology. Our task does not require such a general solution: we can assume a single central repository. So if we use client-side merging, we can use simple integers as revision numbers. (this may turn out to be insufficient, in which case we can fall back to git-style parent-revision-id lists).
We assume that each datatype has a canonical "native store", where the existing browser code expects to find it: e.g. the Places database for bookmarks. We expect that it is possible to enumerate the native store and to observe changes, and that our code can make a "write if not modified since" operation. This write operation must be atomic. The idea is that our code gets an event that says the user has changed the dataset (moving it to version X), then our code talks to the server and figures out merging stuff, then tells the native store "if you're still at version X, please make these changes". If the native store has moved on to version Y (because the user made another change), our code will update and merge and try again.
Our code will maintain a synchronizable representation of the native store using key-value pairs ("KVs"). The values in the pairs are then encrypted, and the resulting table of key-encryptedvalue pairs ("KEVs") is called a "revision". By sorting and hashing this table, we get a repeatable strong "version-ID". Clients then sign (or MAC) this version-ID (including a sequence number) to prevent the server from selectively omitting or replaying individual records.
Clients can compute deltas between these KEV sets to efficiently move the server (or another client) from one revision to another. The deltas contain two kinds of instructions: ADD/MODIFY(key, encryptedvalue) and DELETE(key). The deltas themselves are not signed, but the client which receives them can apply them to a copy of the datastore and compare the hash of the result against the signed hash in the version-ID, before accepting the delta for real.
Server Holds Current Version, and Sometimes Deltas
For each collection, the server holds the signed version-ID for the current version of the dataset, and the complete list of (encrypted) KVEs for that version. It may also retain deltas between various historical versions. At any time, the server can merge two deltas (from X->Y and Y->Z into X->Z), or delete deltas that it no longer thinks will be useful. It simply concatenates the ADD/MODIFY/DELETE instructions and discards all but the last instruction for each key.
Clients can always push or fetch the entire KVE set if the receiving side does not have a convenient starting point for a delta, at the expense of bandwidth.
Since the server will generally know about all client devices, it can remember what version is held by each client, and retain a delta from that client's version to the current one, or to some minimal set of intermediate steps. For example, if the server knows of devices that are at version 1, 5, 10, and 12, it could retain the full set of deltas (1->2,2->3,etc,11->12), or the minimal set (1->5,5->10,10->12), or it could decide that the oldest two devices are not worth providing a fast update path and merely retain 10->12. The older devices can still synchronize correctly, but will have to fetch the full KVE list. The newer devices will be able to fetch a small delta.
The sync client cannot remain connected to the server at all times, nor can it block UI changes to browser state (bookmarks, etc) until it holds a lock on all other copies of the data. As a result, the client may have local changes that do not match other changes on the server.
We can represent these changes on a two-axis graph. The native dataset is at version "A", and as the device accumulates local changes, "A" moves further to the left, away from the common basis version "B". The server's dataset is at version "C", and it moves upwards as the server receives changes.
"B" is used to keep track of the last server version that was incorporated into our local dataset "A" version. We assume that our sync code can perform atomic test-and-set modification operations on the native datastore (failing if the native store has changed since we last looked), but we do not assume that we get synchronous notice of native changes.
In the steady state, when all devices are connected and all versions merged, all three versions A+B+C are the same. This is the starting point ("State 1") of our merge state machine.
While in this happy place, our device may receive a message from the server (either a spontaneous push notification, or a response to a "am I up to date?" poll request) with a new "C" version. This moves our "C" version up the server-changes axis. If we can apply the new version to the native store, we update A and B to the new version and return to rest. If not (because the user made a change in the short window while we were processing the notification), we must merge, by moving to State 2.
State 2 is transient (it really only exists for code factoring purposes). When we enter State 2, we have unreconciled changes that must be merged. We apply a data-type -specifc three-way merge algorithm, using "B" as the common ancestor, and "A" and "C" as the left and right sides. This results in a new merged "A+" version that includes both sets of changes. We then try to update our native store with this version, which might fail if the user made changes during the merge process. If it fails, we abandon the merge results, update A to the new native datastore contents, and start the merge again. If the user stops clicking for a moment, the merge will eventually succeed, and update both A and the native datastore to the merged A+, and updates the basis revision "B" to the server version "C". The FSM then moves to State 3 to prepare to push data to the server.
State 3 is also used when the steady state 1 is disrupted by the user making a local change (editing a bookmark, etc). While in state 3, versions B and C are the same (we think we're up-to-date with the server), but A is different since it contains local changes that have not yet been presented to the server. We hang out in State 3 until we think we can+should contact the server (possibly for a long time if we're offline, or maybe just waiting for a polling timer to expire). Further changes to the native dataset are ignored: we'll include them all in the update message once we get online. If we weren't as offline as we thought and suddenly receive an "update C" message from the server, we move to State 2 to merge the results, then come back to State 3 with a new local version to push. When we finally think we're online, the client constructs a "please replace your version C with my version A" message and sends it to the server, then moves to State 4 to wait for a response.
The server will either accept A, or reject it (if some other device submitted an update first). In either case it will respond by sending down a new C value. This will equal "A" if the update was accepted, or some other new value (from the other device) if it was rejected.
The client waits in State 4 until the server responds or it gets impatient and times out. A timeout indicates we're offline and returns the FSM to State 3, where it will wait until we're online again. If we get a server response that indicates our "A" was accepted, we update B and C to match, and return to State 1 (all versions equal). If the server rejected our update, that indicates that we must merge the new server version with our local one, so we return to State 2 to do the merge.
While waiting for the server to respond in State 4, the user might make changes to the local datastore (moving our proposed "A1" to an even newer "A2"). We ignore these updates, and rely on the C==A check in the server response code to handle the changes. If the server accepted our "A1", it will send down a "C is now A1" message, but since A1!=A2, we'll treat this as if some other device managed to submit their own "A1" value while we were trying to submit "A2", and the FSM will switch to State 2 to three-way merge B->A1 and B->A2. This is a case where more history could help: the simple and correct resolution is to just accept A2 (since A2 is a descendant of A1), but without some ancestry information, the merge algorithm won't know that. If the merge code we write doesn't provide good results here, we should store some hints for this case.
In all cases, when the client sends a new version up to the server, or when the server sends a new version to the client, the sender provides a signed version-ID record, and the client checks the MAC on this record before accepting it. The client also asserts that the proposed new version contains a higher sequence number than the client's current version. This prevents the server from forging datasets or replaying older ones (the only remaining mischief is to pretend no new versions have been created).
The version data itself can be sent as either a full KVE set, or a delta (or sequence of deltas X->Y->Z) when the receiver knows the previous version. The sender should choose whichever approach is convenient and shortest (for deltas which delete the majority of keys, it may be faster to send the full list even if deltas are available). If deltas are used, the client must then copy its KVE set and apply the deltas, then compute the hash and assert that it matches the version-ID.
While the client nominally tracks three or four separate versions of the dataset (A, B, C, and A+ during a merge), implementations may optimize this by storing just one full revision (probably B) and merely record deltas for the others (B->A, B->C, A->A+).
One open question is how to best store this one full revision. For small datasets it can be held in RAM (although it must persist across reboots if the browser is stopped outside of State 1). Also it's slightly tricky (although not hard) to efficiently recompute the hash-based versionIDs of the variants (one approach is to keep both B and the deltas sorted, then walk the keys in B and the keys of the delta at the same time, including them in the hash only if they were deleted in the delta, and always including the newer version).
If the dataset only needs to travel in one direction (e.g. outbound-only "my open tabs" list, or inbound-only "tabs on other devices" lists), the state machine is much simpler:
Most of the time, the client and server can exchange deltas to get from one version of the dataset to another. They'll only need to send a full version when the client is so out-of-date that the server no longer remembers their earlier state (and thus cannot compute or accept a delta from it), or when the client is uploading their dataset for the first time (which is effectively a delta from an empty set).
To tolerate intermittent connections and overloaded servers, the protocol that transfers deltas should also let both sides paginate the data, sending it in smaller chunks.
When the client sends a new version to the server, it sends a series of update messages. Each message contains the endpoints of the delta, where the "from" is expressed as a sequence-number and version-hash, and the "to" is a signed seqnum+verhash. The "from" may be a special "EMPTY" symbol to indicate that this is the initial client upload, so the delta is applied to an empty set.
Each message also indicates the endpoints of the range of keys being sent. The "first=" parameter contains the key of the first element. "upto=" contains the first key that is higher than the last element. The message does not include any data for the "upto=" key: in mathematical terms, the range is [first,upto). For each message, "first=" must equal the previous message's "upto=" value. On the last message, "upto=" will be a special symbol called "END", which tells the server to assemble the messages into a delta, build the target version, and apply it to the server-side dataset.
Each message provokes an "ok" response if all is well. If the server is updated by some other client, the "from" in the message will not match the server's current version, and it returns an error response that includes the new current version. The client then abandons its transfer, fetches the new server version, merges, and starts again.
The server might also, for some reason, forget about the transfer during the process (e.g. if the server is assembling the delta in a memcached process which is then restarted). The server can return an error response that tells the client to re-send the messages starting from the beginning.
When the last message is accepted (i.e. this client successfully updated the server version), the response announces the new server version. The client then updates their view of the server's version ("C").
Fetching versions from the server starts with the client polling to find out the current version. Once it realizes that it needs to get the new version, it can ask the server for a delta from the client's most-recently-fetched version ("B" in the diagrams above) to the server's current version ("C"). If the server cannot provide a delta from B, the client tries again with a special marker called "EMPTY" for the old version: this tells the server to send the entire dataset (expressed as a delta from an empty set). If the server again refuses to help, this indicates that the client is out of date and must poll again (this poll should be rate-limited, as a very confused server could clobber itself by triggering unlimited polling).
The client asks for a batch of delta records starting at "first", which is initially set to a special symbol named "START". The server decides how many records to return, and includes a marker named "next" which is either the next key that should be fetched, or a symbol named "END". The client should keep fetching batches, always using the previous response's next= value in the subsequent request's first= parameter, until it gets next=END. When it receives the END, it has all of the deltas, and can assemble the version and apply/merge it into the local dataset.
Once the client has updated their local state, it should poll again, informing the server that they are no longer using the old version ("B"). The server may then forget about B and any deltas that use it (specifically, the server tracks which clients are at which versions, it can drop a version when no clients are using it, and only needs to retain deltas between versions that are in use).
If the server is updated while the client is fetching an old version, the server should continue to help the client with their transfer (retaining the old delta until the transfer is complete). Once the client finishes the partial update, it will poll again, and begin a transfer to the new state. Another option would be to abandon the fetch and start again with the new updated version. I'm not sure which is better: abandoning the fetch could make it hard for the client to catch up to rapid server-side changes.
Sync efficiency is improved when the native datastore maintains a stable ID for each entry, and modifies entries in place. With IDs, modifying the name of a bookmark can be distinguished from deleting the bookmark and adding one with the same URI but different name. This will also improve the fidelity of merge operations: it is clear how to merge one change (that modifies the bookmark's name) with another (that moves it to a different folder).
This section details how various browser native datastores are organized, and how they should be transformed into a set of key-value pairs.
With luck, a simple "take the latest value for each key" merge algorithm should give good results. But some datatypes may be improved with more sophisticated approaches, also described here.
Open tabs are not a shared dataset: each device uploads its own list of tabs independently. So no merging needs to take place. The client can use a single KV, rather than putting one tab into each KV entry, which contains a list (with one element per tab).
That one element should contain, at first, simply the URL of that tab's currently-displayed page. Later versions should include the full list of URLs (forward- and backward- history), the index into this list of the currently-displayed page, and a scroll offset for each page. (this information is recorded in sessionstore.js in the profile directory).
Using a single KV entry retains the order of displayed tabs.