DXR Storages
trilite, the trigram index which powers DXR searches, is extremely powerful and was ahead of its time, but I wouldn't mind getting off it, now that the rest of the world has started to catch up. Motivations:
- We have to maintain it.
- It's bound to SQLite, which imposes scalability challenges. It requires the webhead to be the one to execute the expensive queries. It can't parallelize beyond a single machine (or processor?).
- Contributors have trouble building it and messing with
LD_LIBRARY_PATH. - It cores on my development VM (but not for anybody else) when I pass it a bad regex, and I don't know why.
- We haven't implemented delete or update support in it, which we'd need for incremental indexing.
- A niggle: can't count total results without fetching them all or running a (possibly expensive) query twice. Lame.
Here are some alternatives. Let's go shopping.
Elasticsearch
Pros:
- Parallelization of even single queries
- Redundancy
- Caching of intermediate filter results. (It will transparently store the set of, say, all *.cpp files as a bitmap for quick set operations later. This means that common compound queries will likely be much faster than in any RDBMS.)
- Suggesters for "did you mean" and autocomplete
- Sparse attribute storage, which could come in handy for supporting new languages
- Scoring
- Extreme flexibility of indexing: index whatever chars we want, index them any number of ways (case-sensitive or not), and entirely too much more
Challenge: Fast Regex Matching
ES lacks out-of-the-box support for trigram-accelerated regex searches. It does offer some regexp query support which uses term indices and some clever enumeration to deliver non-naive performance, but I wonder if it would be enough. I don't see how you can use a term index to accelerate a regex that may span terms. I suppose you could index each line as a single term and then start every non-anchored regex search with .*, but that seems like a great way to have an enormous number of unique terms and be super slow.
Here's how we could get ES up to trilite speeds without too much trouble. If we're willing to extract trigrams ourselves, we can easily tell ES to filter down to the docs containing those, then run a regex prosaically across them. And it just so happens that Python's stdlib regex machinery is written in Python. (I couldn't believe it either.) Take a look at sre_compile.py and sre_parse.py. They literally build a Python list of instructions, like ['charset', 'groupref_ignore', ...], and then pass that to _sre.compile(), which is the only C speedup I see. Presumably, that returns a fast matcher of some kind, with guts in machine language. So, we harness that existing regex parser, extract trigrams, and go to town. This actually gives us flexibility beyond what trilite provides, in that we have the option of running non-accelerated regex queries, foolish though that may be. And we have full control over the complexity of regexes that we allow, since that isn't buried in re2 any longer. At any rate, it's something to consider.
PostgreSQL
Postgres is the other contender. As of 9.3 it supports trigram-accelerated regex searches, but, sadly, they hard-coded a limitation to word-like chars. We'd probably have to hack the source to circumvent that.
Pros:
- An easy port from our existing relational-based storage
- Out-of-the-box crazy-smart extraction of trigrams from regexes (see page 49), comparable to Google Code Search. We'd have to do this in Python (which could be fun) to use ES.
Trilite Unique Advantages
These are really due to re2.
- Ability to cap memory use for regex searches. Currently set at 8MB.
- Guaranteed linear-time searching (with relation to the corpus size), because it makes automatons rather than doing backtracking—nice for fending off DOSes. [Actually, I'm not sure ES or PG don't do that as well. The Lucene RegexpQuery is a subclass of AutomaticQuery.]
Keeping Outboard Storages Synced
Deployment is easy right now because static HTML files and the SQLite DB are all in the same folder. We do an atomic move of the whole thing into place, and that's our deploy process. For a data store off on a different machine someplace, this gets slightly more (though not too) interesting. We can dump a random identifier into config.py in the instance and make that identifier part of the ES (or whatever) index we build as part of that instance-build. Just make sure the deploy script deletes the old index after it pushes the instance using the new one into production.
Ultimately, it would be nice to get the static HTML out of the instance and build more dynamically, at request time, and cache. We should time our current page builds and see if they are short enough to feasibly do during the request.