DXR Parallel Tree Indexing

From MozillaWiki
Jump to: navigation, search

Now that we have request-time rendering in place and no longer have to drag folders full of static HTML around the FS, indexing trees in parallel becomes feasible—something we'll need if we're going to scale up to dozens of trees. Motivations:

  1. Don't let a broken build on one tree scuttle the indexing of the rest.
  2. Reduce time to refresh the sum of all indexes so we can keep reasonably up to date.

The config file is always going to exist, at least to point to the ES servers, but we need only the original, user-edited one, not a second one generated as an FS artifact of indexing. The following settings currently pulled from the generated file will be pulled from here. Changing them will require a WSGI restart.

   www_root
   es_hosts
   google_analytics_key
   default_tree
   max_thumbnail_size

ES aliases can handle atomic transitions between versions of a tree's indices.

But how do we know which of the trees in the config file are actually indexed so far? IOW, what if someone adds a new tree, and it takes awhile to index? Or what if somebody enables a new plugin for that tree, and we don't want to start showing filters for it until it's actually been used in an indexing run? We need to freeze certain attributes of a tree as they were at index time. We'll keep the list of these "frozen" attributes in a dedicated index that has 1 shard(?) and replicas all over the place so queries are fast. The docs would look like this:

   {id: '11/mozilla-central',  # "compound index" of format and name, to rule out any possibility of duplication
    name: "mozilla-central",
    format: 11,  # By storing the number here, we can just query for.... We don't have to worry about the alias template string going stale; the deploy script re-reads the config file each time.
    es_alias: "dxr_11_mozilla-central",  # in case es_alias changes in the conf file
    description: "Some tree",  # needed so new trees or edited descriptions can show up without a WSGI restart
    enabled_plugins: ["clang", "pygmentize"],
    generated_date: "2012-05-02 11:04:50",
    maybe some plugin config  # TODO: how do we pluggably serialize this? We don't need any of these yet, so it can wait.
    }

Ordinarily, I'd lean toward memcached in front of those, but we'd be introducing another server and another lib for just one value. However, it turns out that ES is fast. Locally, in a one-node cluster (i.e. we'll have to re-bench more realistically to be sure), a simple search for all documents returns in 0.458ms on average:

   % ab -n 100 -c 3 'http://127.0.0.1:9200/dxr_test/tree/_search'
   
   Time per request:       1.374 [ms] (mean)
   Time per request:       0.458 [ms] (mean, across all concurrent requests)
   
   Connection Times (ms)
                 min  mean[+/-sd] median   max
   Connect:        0    0   0.1      0       0
   Processing:     1    1   0.2      1       2
   Waiting:        1    1   0.1      1       1
   Total:          1    1   0.2      1       2

Indexing a tree would look like this:

  1. Make a new index called "dxr_hot_prod_formatversion_mozilla-central_somerandombits".
  2. Index the tree into it.
  3. Deploy by updating (or creating) the "dxr_hot_prod_formatversion_mozilla-central" ES alias to point to the newly built ES index. (We'd have to worm the format version into someplace webapp-accessible so it would know what to sub in for "formatversion".)
  4. Update or insert the doc representing this tree (PUT with an id=the alias). Nothing should change, most times.

Deploying new DXR code would look like this:

  1. Figure out the currently-deployed version, probably from just looking at the `format` file on the FS.
  2. Get the "tree" docs of both that (say, 11) and the new format version we want to deploy (12).
  3. If the version-12 tree docs are at least the intersection of the version-11 tree docs and the trees from the config file, deploy the new webapp code. (This will ensure that we're never decreasing tree coverage, even if someone deletes a tree from the config file or adds a new one. It also won't hold up deploys if version 11 of a tree never got around to being built.) Also, delete all the tree docs of version 11, just to keep the index from growing forever.