DXR Storages

trilite, the trigram index which powers DXR searches, is extremely powerful and was ahead of its time, but I wouldn't mind getting off it, now that the rest of the world has started to catch up. Motivations:

  • We have to maintain it.
  • It's bound to SQLite, which imposes scalability challenges. It requires the webhead to be the one to execute the expensive queries. It can't parallelize beyond a single machine (or processor?).
  • Contributors have trouble building it and messing with LD_LIBRARY_PATH.
  • It cores on my development VM (but not for anybody else) when I pass it a bad regex, and I don't know why.
  • We haven't implemented delete or update support in it, which we'd need for incremental indexing.
  • A niggle: can't count total results without fetching them all or running a (possibly expensive) query twice. Lame.

Here are some alternatives. Let's go shopping.

Contents

Elasticsearch

Pros:

  • Line-based searching without icky hacks (nested queries are being a pain in the neck, but parent-child ones seem like a workable solution.)
  • Caching of intermediate filter results. (It will transparently store the set of, say, all *.cpp files as a bitmap for quick set operations later. This means that common compound queries will likely be much faster than in any RDBMS.)
  • Suggesters for "did you mean" and autocomplete
  • Sparse attribute storage, which could come in handy for supporting new languages
  • Extreme flexibility of indexing: index whatever chars we want, index them any number of ways (case-sensitive or not), and entirely too much more
  • Makes showing (syntax-colored) context around search results feasible—just keep the highlight offsets in the line doc.
  • Efficient counting of the total (not just returned) results, a common request
  • Facilitates generic-identifier searching, an MXR parity feature. (Either merge all the identifiers into a field on each line, or search the function/ref/etc. fields separately.)

Nice to have:

  • Parallelization of even single queries
  • Redundancy
  • Scoring

How do we represent 4-dimensional constructs, as for "Find uses of smoo() introduced in the last 2 weeks?" Maybe we could put a nested doc into the line with the function name and the timestamp when it was introduced. That kicks all the hard bits to the indexing phase, which can do some kind of magic with the VCS.

Challenge: Fast Regex Matching

ES lacks out-of-the-box support for trigram-accelerated regex searches. It does offer some regexp query support which uses term indices and some clever enumeration to deliver non-naive performance, but I wonder if it would be enough. I don't see how you can use a term index to accelerate a regex that may span terms. I suppose you could index each line as a single term and then start every non-anchored regex search with .*, but that seems like a great way to have an enormous number of unique terms and be super slow.

Here's how we could get ES up to trilite speeds without too much trouble. If we're willing to extract trigrams ourselves, we could easily tell ES to filter down to the docs containing those, then run a regex prosaically across them. I've done some work toward this. Here's where I left off, and here are some passing tests.

This actually gives us flexibility beyond what trilite provides, in that we have the option of running non-accelerated regex queries, foolish though that may be. And we have full control over the complexity of regexes that we allow, since that isn't buried in re2 any longer.

Tentative Roadmap

If we moved to ES, here's what our order of operations could be:

  1. Retool query machinery to run on ES and to be line-based. (If speed is awesome even with pathological regexes (unlikely), we can deploy here.)
    • Here’s what I want, eventually. We type a bunch of words into a search field. Along the way, it suggests completions that make identifier names. A search then looks for identifiers (which would now tend to be complete), content substrings, and path segments (or substrings, or sequences of segments?). We AND them together. OR support may come later. ES should be able to support all of that.
    • We could index the pathnames into each line, denormalizing, and always search on lines. That would make those easy to AND together. We don’t even need to mget the files afterward as with parent-child relationships, since every line contains the full path (but not icon or encoding—important? Probably not, if highlighting works, which it should without parent-child.). We could even support search-by-color: just index all the green stuff into an array stored as a separate property of the line. (Highlighting would probably have to be done app-side.)
  2. Build routine to extract trigrams from regexes. (There is no existing work apparent in Python. We could require re2 and call through to its Prefilter::Info::TakeMatch etc., but it doesn't look too hard to implement or too CPU-intense (when you start from the sre_parse.parse() in stdlib); I'd have to do some work in any case to bridge Python to that C routine; and fewer build steps, git submodules, and build-time checkouts make for a lower contributor support load.) Add trigram indices for lines and switch to a filtered query for regexes. Deploy.
  3. Get rid of the rest of the on-disk instance, embed necessary region and ref offsets and payloads into the ES index (out of band with the source code), and build pages at request time. Add caching if needed. Something like config.py might still hang around so we don't have to fetch trivial things like WWW_ROOT over a socket and so we know which ES hosts to connect to.

PostgreSQL

Postgres is the other contender.

Pros:

Trilite Unique Advantages

These are really due to re2.

  • Ability to cap memory use for regex searches. Currently set at 8MB.

Guaranteed linear-time searching with relation to the corpus size—nice for fending off DOSes—is probably not a unique advantage anymore. PG and ES seem to use DFAs rather than backtracking for regex execution these days; the Lucene RegexpQuery is a subclass of AutomatonQuery.

Keeping Outboard Storages Synced

Deployment is easy right now because static HTML files and the SQLite DB are all in the same folder. We do an atomic move of the whole thing into place, and that's our deploy process. For a data store off on a different machine someplace, this gets slightly more (though not too) interesting. We can dump a random identifier into config.py in the instance and make that identifier part of the ES (or whatever) index we build as part of that instance-build. Just make sure the deploy script deletes the old index after it pushes the instance using the new one into production.

Ultimately, it would be nice to get the static HTML out of the instance and build more dynamically, at request time, and cache. We should time our current page builds and see if they are short enough to feasibly do during the request.

Pro/Con Matrix

This helped me get a fresh perspective on my gut feelings about the suitability of each storage.

Task

ES

PG

SQLite

Request-time rendering

Very good. Trivial to store all regions and/or refs in the Line as nested documents and fetch them in one query.

Fine. There's no reason this shouldn't be as fast as SQLite.

It's a lot of queries (and dependent subqueries to turn file IDs into pathnames), because the data needed is decidedly non-rectangular. It does seem to be fast, however.

Searching by content

Fine. We have to extract trigrams outselves, but then ES has good trigram indices and automaton-based regex executors.

Pretty. Do a simple regex search, which the server builds trigrams from all on its own, and do algebra on sets all you want.

Ugly. Line-based searching requires either trilite code changes which degeneralize it or a 1-1 table to relate (file, line) to each trilite row. Performance is an unknown. Set math is easy, though.

Searching by structure

Good? Model function definitions as child or nested docs of lines. Do highlighting app-side.

Fine

Fine

Sorting results by file attrs

Bad. The only way is to inline the attr you want to sort by into the child document. That could be > an extra 1GB on moz-central for the pathnames + indexes on them.

Good. A simple ORDER BY.

Good. A simple ORDER BY.

Indexing structure

Fine. The clang indexer's generate_callgraph() actually works around SQLite for some reason, loading the whole functions and variables tables into hashes. The inserts it does translate exactly to ES.

Same as SQLite.

Fine

Data locality

Good. Computation is done near the data.

Good. Computation is done near the data.

Weird. Computation is done on the webhead, away from the data, but it performs well because the data is on a super-expensive NAS.

Maintenance

Fine. We have to maintain our own trigram extractor. Should be pretty stable, as regex syntax doesn't change.

Annoying. We have to fork pg_trgm to make it recognize more chars. Then we have to get it loaded onto presumably our own DB cluster and update it whenever we fix a bug. I don't think we'd have to make changes regularly, but the world is full of surprises.

Hard. We're the only ones maintaining trilite. It doesn't yet have update/delete support, nor does it store line numbers.

Contributor Impact

Good. ES runs fine on openjdk. We can automate the installation of both it and ES itself in the Vagrant VM, and it's easier to set up both than to get trilite to compile off-VM.

Probably a little annoying to build. Need PG dev headers.

Bad. It's hard for people to compile and persuade to load.

Future flexibility

Limited join support requires denormalization, such as duplicating file paths into individual lines. What are the space implications?

Good: flexible indexing, the ad hoc-ness of the relational model, an excellent query planner and optimizer, inline types like hstore and JSON, and mature MVCC for incremental indexing

Questionable: scaling, concurrency and constraint checking for incremental indexing (which updates).

Result mixing and highlighting

Very good. All the refs and regions can be right there, inline in the line doc. Good highlighting support. Should be very easy to render docs at request time.

Meh. Would have to do a lot of separate queries for different kinds of results (pathmame, content, structure). Would need to use a big JSON field on line to efficientiy get 1:n results of line:{structural element} out of the DB. Otherwise, it'd be n queries for each of m results.

Meh. Same as PG.

ANDing and ORing found subsets (so we can do "caller:smoo OR caller:bar")

Very good if we don't use parent/child/nested/inner queries

Good

Good