DXR Storages: Difference between revisions

Jump to navigation Jump to search
→‎Challenge: Fast Regex Matching: Linked to my half-completed work on regex trigram extraction.
(→‎Challenge: Fast Regex Matching: Linked to my half-completed work on regex trigram extraction.)
Line 31: Line 31:
ES lacks out-of-the-box support for trigram-accelerated regex searches. It does offer [https://lucene.apache.org/core/4_3_0/core/index.html?org%2Fapache%2Flucene%2Futil%2Fautomaton%2FRegExp.html some regexp query support] which uses term indices and [https://lucene.apache.org/core/4_3_0/core/index.html?org%2Fapache%2Flucene%2Futil%2Fautomaton%2FRegExp.html some clever enumeration] to deliver non-naive performance, but I wonder if it would be enough. I don't see how you can use a term index to accelerate a regex that may span terms. I suppose you could index each line as a single term and then start every non-anchored regex search with <code>.*</code>, but that seems like a great way to have an enormous number of unique terms and be super slow.
ES lacks out-of-the-box support for trigram-accelerated regex searches. It does offer [https://lucene.apache.org/core/4_3_0/core/index.html?org%2Fapache%2Flucene%2Futil%2Fautomaton%2FRegExp.html some regexp query support] which uses term indices and [https://lucene.apache.org/core/4_3_0/core/index.html?org%2Fapache%2Flucene%2Futil%2Fautomaton%2FRegExp.html some clever enumeration] to deliver non-naive performance, but I wonder if it would be enough. I don't see how you can use a term index to accelerate a regex that may span terms. I suppose you could index each line as a single term and then start every non-anchored regex search with <code>.*</code>, but that seems like a great way to have an enormous number of unique terms and be super slow.


Here's how we could get ES up to trilite speeds without too much trouble. If we're willing to extract trigrams ourselves, we can easily tell ES to filter down to the docs containing those, then run a regex prosaically across them. And it just so happens that Python's stdlib regex machinery is written in Python. (I couldn't believe it either.) Take a look at sre_compile.py and sre_parse.py. They literally build a Python list of instructions, like <code>['charset', 'groupref_ignore', ...]</code>, and then pass that to <code>_sre.compile()</code>, which is the only C speedup I see. Presumably, that returns a fast matcher of some kind, with guts in machine language. So, we harness that existing regex parser, extract trigrams, and go to town. This actually gives us flexibility beyond what trilite provides, in that we have the option of running non-accelerated regex queries, foolish though that may be. And we have full control over the complexity of regexes that we allow, since that isn't buried in <code>re2</code> any longer. At any rate, it's something to consider.
Here's how we could get ES up to trilite speeds without too much trouble. If we're willing to extract trigrams ourselves, we could easily tell ES to filter down to the docs containing those, then run a regex prosaically across them. I've done some work toward this. Here's [https://github.com/erikrose/dxr/blob/trigrammer/dxr/trigrammer.py#L218 where I left off], and here are [https://github.com/erikrose/dxr/blob/trigrammer/tests/test_trigrammer.py some passing tests].
 
This actually gives us flexibility beyond what trilite provides, in that we have the option of running non-accelerated regex queries, foolish though that may be. And we have full control over the complexity of regexes that we allow, since that isn't buried in <code>re2</code> any longer.


=== Tentative Roadmap ===
=== Tentative Roadmap ===
Confirmed users
574

edits

Navigation menu