DXR Parallel Tree Indexing
Once we have request-time rendering in place and no longer have to drag folders full of static HTML around the FS, indexing trees in parallel becomes feasible—something we'll need if we're going to scale up to dozens of trees. (I'm also assuming elasticsearch here.) Motivations:
- Reduce time to refresh the sum of all indexes so we can keep reasonably up to date.
- Don't let a broken build on one tree scuttle the indexing of the rest.
The config file is always going to exist, at least to point to the ES servers. These settings can stay there; changing them will require a WSGI restart:
ES_HOSTS WWW_ROOT
ES aliases can handle atomic transitions between versions of a tree's indices.
But where do we keep the list of aliases and descriptory data about each tree? In a dedicated index that has 1 shard(?) and replicas all over the place so queries are fast. The docs would look like this:
{name: "mozilla-central", alias_prefix: "dxr_hot_prod_mozilla-central", description: "Mozilla Central is a cool tree.", default: true}
Ordinarily, I'd lean toward memcached for those, but we'd be introducing another server and another lib for just one value. However, it turns out that ES is fast. Locally, in a one-node cluster (i.e. we'll have to re-bench more realistically to be sure), a simple search for all documents returns in 0.458ms on average:
% ab -n 100 -c 3 'http://127.0.0.1:9200/dxr_test/tree/_search' Time per request: 1.374 [ms] (mean) Time per request: 0.458 [ms] (mean, across all concurrent requests) Connection Times (ms) min mean[+/-sd] median max Connect: 0 0 0.1 0 0 Processing: 1 1 0.2 1 2 Waiting: 1 1 0.1 1 1 Total: 1 1 0.2 1 2
Indexing a tree would look like this:
- Make a new index called "dxr_hot_prod_formatversion_mozilla-central_somerandombits". Maybe we'll prepend a timestamp and/or the machine name to the random bits, use a machine-local lock, or some other fancy stuff.
- Index the tree into it.
- Deploy by updating (or creating) the "dxr_hot_prod_formatversion_mozilla-central" ES alias to point to the newly built ES index. (We'd have to worm the format version into someplace webapp-accessible so it would know what to sub in for "formatversion".)
- Update or insert the doc representing this tree (PUT with an id=the alias). Nothing should change, most times.
Deploying new DXR code would look like this:
- Grab all the alias prefixes out of the "tree" index.
- Hit each alias (+ the format version of the code I'm thinking about deploying) to see if it's there.
- If all of them are there—that is, all trees have been built to be compatible with the new code—deploy the new webapp code.