canmove, Confirmed users
640
edits
No edit summary |
No edit summary |
||
| Line 134: | Line 134: | ||
* Hashes: content-addressing. | * Hashes: content-addressing. | ||
== Garbage collection == | |||
=== Assumptions === | |||
* It's OK to collect objects that aren't referenced by the current ref version. Clients can reupload collected objects if they're needed later (modulo race concerns), and clients don't have to walk history (they can merge with the current tree at the loss of some granularity). | |||
* But: we might not wish to collect all unreachable objects, because it can save client effort, and the longer the stretch of history the more likely it is that a client can follow a history chain. The primary purpose of GC is to *eventually* clean up, not reach zero overhead. (In that respect this is much like TTLs in Sync 1.1.) | |||
=== Three options === | |||
* Reference counting | |||
* Mark and sweep | |||
* Generations | |||
Reference counting would involve transactionally updating counts for all referenced objects when a ref revision is set or garbage collected. I think that rules it out, but it would allow for accuracy wrt whether a record is needed for a particular ref revision. | |||
Strict generation collection is essentially equivalent to reference counting: all records reachable from each ref revision are numbered with that revision's version (where the version is a server 'clock'). To do so involves touching each of those objects for every ref set, which is also undesirable. | |||
Mark and sweep delays this work until GC time: at GC time we'd walk the current revision's tree, look at the entire record set, and collect unreachable records. | |||
Lazy generation collection achieves an approximately time-based collection approach by decoupling record generations from ref revision generations: new records are given the generation of the forthcoming ref revision, and trees refer to the lowest version of any of their children (which implies a bottom-up upload). | |||
All objects with a version lower than the current ref revision's tree's minimum generation number can be automatically collected. The longer the stretch between the current and the oldest version, the more reusable objects stick around, but the greater the space usage. | |||
Efficiency can be improved by bumping record versions when trees are updated, which is like incremental marking. | |||
=== Flaws in this approach === | |||
* If you have one, old, untouched bookmark tree, all of its siblings that come and go will have higher version numbers, and won't ever be collected. For bookmarks this is unlikely to be a problem. There are solutions to detect garbage and force it to be exposed: e.g., tracking subtree counts to allow % garbage to be detected, and explicit re-marking of old records (requires touching trees). | |||
* There is a race between GC and new tree references. However, we don't expect clients to have an accurate set of known server records (and if they do, they can be GC-aware), so this is unlikely to be an issue in practice. A client that knows it's about to reuse a record can bump its generation number without reuploading the whole record, then upload new trees accordingly. | |||
=== Algorithm === | |||
; Upload : | |||
:# If necessary, discover or bump existing leaf object version numbers from server. | |||
:# Submit new leaf object references to the server. Record the lowest returned version number for each set of records contributing to a tree. | |||
:# Submit new trees, each referring to their lowest version number. Each tree will be assigned a version number itself, which will be used in the next layer of trees. | |||
:# Update root ref to topmost tree. It'll be given the new version number. Subsequent object uploads will receive the next generation number. | |||
; Garbage collect : | |||
:# Look at current ref, or earliest desired ref version (e.g., the earliest known ref for your client set), and find their lowest used version number. | |||
:# Delete all records with a lower version number. | |||
; Bump : | |||
:# Find each tree with a too-old lowest version. | |||
:# Walk each tree to find leaf objects hashes. | |||
:# Request that each of a set of hashes be allocated a new version number. | |||
:# Either by client action or through server support, update lowest-version for each modified tree. | |||