Changes

Jump to: navigation, search

DXR Storages

699 bytes added, 05:52, 24 February 2014
Elasticsearch: Add a tentative roadmap for moving to ES.
Here's how we could get ES up to trilite speeds without too much trouble. If we're willing to extract trigrams ourselves, we can easily tell ES to filter down to the docs containing those, then run a regex prosaically across them. And it just so happens that Python's stdlib regex machinery is written in Python. (I couldn't believe it either.) Take a look at sre_compile.py and sre_parse.py. They literally build a Python list of instructions, like <code>['charset', 'groupref_ignore', ...]</code>, and then pass that to <code>_sre.compile()</code>, which is the only C speedup I see. Presumably, that returns a fast matcher of some kind, with guts in machine language. So, we harness that existing regex parser, extract trigrams, and go to town. This actually gives us flexibility beyond what trilite provides, in that we have the option of running non-accelerated regex queries, foolish though that may be. And we have full control over the complexity of regexes that we allow, since that isn't buried in <code>re2</code> any longer. At any rate, it's something to consider.
 
=== Tentative Roadmap ===
 
If we moved to ES, here's what our order of operations could be:
 
# Retool query machinery to run on ES and to be line-based. (If speed is awesome even with pathological regexes (unlikely), we can deploy here.)
# Build routine to extract trigrams from regexes. Add trigram indices for lines and switch to a filtered query for regexes. Deploy.
# Get rid of the rest of the on-disk instance, embed necessary region and ref offsets and payloads into the ES index (out of band with the source code), and build pages at request time. Add caching if needed. Something like config.py might still hang around so we don't have to fetch trivial things like WWW_ROOT over a socket.
== PostgreSQL ==
Confirm
574
edits

Navigation menu