DXR Language-Independent Schema

From MozillaWiki
Jump to navigation Jump to search

As DXR gains the ability to index more languages, we're going to need a non-C++-centric schema to hold the data. I don't think the existing schema is horrible, but I'm sure there are C++-isms hiding out in there. This page is a place to enumerate those isms and develop alternatives.

Strawman: jcranmer's documentation schema preview

I started work on developing a schema for the purposes of adding Doxygen-like support to DXR. The original documentation schema is focused on how to organize documentation and doesn't consider need for other DXR features, so I've extended it slightly to conform to those requirements.

The core of the schema is that everything is organized into "entities," which anything that can be reasoned about in documentation (or a cross-reference). Entities break down into four main categories as far as documentation is concerned: files (self-explanatory), namespaces (things that contain other entities but don't have a well-defined single location), aggregates (things that contain other entities but have a meaningful definition), and leaves (things that don't contain other entities). Note that entities also have kind names to distinguish between different aggregates (e.g., C++ structs/classes/macros). All entities are resolved and named by fully-qualified names, which are unique... almost.

XXX: Explain how to merge entities

From the standpoint of DXR's more advanced features, the documentation breakdown is insufficient. A more informative breakdown classifies entities into:

  • Types -- entities that represent a declarable type of a variable, like typedefs, classes, structs, etc.
  • Macros -- entities that represent things that expand into code. C++ templates are a better example here than C macros. Parameters here don't refer to specific values (e.g., they can refer to types or AST subtrees). This means that the code expanded by a macro may do things like define several types, functions, or etc.
  • Functions -- entities that can be called. Needed for the calltree support.
  • Variables -- entities that represent things that can take on values. Types may or may not be statically known (although, even in dynamic languages, we would still like to infer types if they are statically knowable).
  • Constants -- variables whose values are constant and known at compile-time (C/C++ enums, a subset of C macros, etc.)

Elasticsearch As A Storage

I'd like to get off trilite eventually, for various reasons:

  • We have to maintain it.
  • It's bound to SQLite, which imposes scalability challenges. It requires the webhead to be the one to execute the expensive queries. It can't parallelize beyond a single machine (or processor?).
  • Contributors have trouble building it and messing with LD_LIBRARY_PATH.

Elasticsearch is an attractive replacement, having advantages like...

  • Parallelization of even single queries
  • Redundancy
  • Already-implemented update and delete support
  • Caching of intermediate filter results. (It will transparently store the set of, say, all *.cpp files as a bitmap for quick set operations later.)
  • Suggesters for "did you mean"-type functionality
  • Sparse attribute storage

However, it lacks, out of the box, support for trigram-accelerated regex searches. It does offer some regexp query support which uses term indices and some clever enumeration to deliver non-naive performance, but I wonder if it would be enough. I don't see how you can use a term index to accelerate a regex that may span terms. I suppose you could index each line as a single term and then start every non-anchored regex search with .*, but that seems like a great way to have an enormous number of unique terms and be super slow.

Here's how we could get ES up to trilite speeds without too much trouble. If we're willing to extract trigrams ourselves, we can easily tell ES to filter down to the docs containing those, then run a regex prosaically across them. And it just so happens that Python's stdlib regex machinery is written in Python. (I couldn't believe it either.) Take a look at sre_compile.py and sre_parse.py. They literally build a Python list of instructions, like ['charset', 'groupref_ignore', ...], and then pass that to _sre.compile(), which is the only C speedup I see. Presumably, that returns a fast matcher of some kind, with guts in machine language. So, we harness that existing regex parser, extract trigrams, and go to town. This actually gives us flexibility beyond what trilite provides, in that we have the option of running non-accelerated regex queries, foolish though that may be. And we have full control over the complexity of regexes that we allow, since that isn't buried in re2 any longer. At any rate, it's something to consider.

PostgreSQL

Postgres is the other contender. As of 9.3 it supports trigram-accelerated regex searches, but they sadly hard-coded a limitation to word-like chars. We'd probably have to hack the source to circumvent that.