User:Jcranmer/DXR API proposal

From MozillaWiki
Jump to: navigation, search

DXR is currently heavily geared towards indexing of C++ on Mozilla source code. Working with non-Mozilla or even non-C++ code currently requires a fair amount of hacking. This page is a proposal for changes that need to be made to support a wider variety of programs.

For the most part, I am focusing on the back-end indexing steps here. Obviously, extensions would need to support front-end web stuff as well, although since the HTML is statically generated for now, the major front-end APIs that need to be dealt with are direct queries and controlling what is returned when clicking on the links.

Extension points

As I see it, there are several extension points to consider:

  1. Custom build steps.
  2. Indexing multiple languages. This needs to be careful:
    • Languages like C++ or Java could most likely be compiled by effectively replacing the compiler and building
    • Languages like Perl or JavaScript (+ HTML) are not compiled, so one would need to run a tool on some list of files
    • In any case, indexed files also need to be processed for display, which means essentially producing a syntax highlighted build of them.
  3. Different information. These also come in many forms:
    1. Grep the stdout/stderr of the build: warning extraction would do this.
    2. Entity extraction. This is what the basic DXR focuses on for now, finding the types of variables and functions.
    3. Other information extractable during a compile phase: for example, callgraph uses dehydra to figure out which functions a certain function calls
    4. Run a tool on the source files: this is basically the last step in the interpreted language case
    5. Run some more steps and process the results (both stdout/stderr and post-processing). gcov information is the example here
    6. Pull data from an external source, such as a bugzilla instance
    7. A merge of two DXR databases (e.g., one in release mode and the other in debug mode. Or per-OS configuration, etc.)
  4. Using different databases. SQLite may be easy to pass around, but if I'm indexing something of the scale of "all of KDE", I would probably want a more powerful database like MySQL

Build support

There appears to be a basic build procedure which consists of:

  1. Unpatch source
  2. Update source
  3. Clobber build
  4. Patch source
  5. Configure build
  6. Build
  7. Post-process

Basic steps to include universally:

  •  ??? how to do unpatch? Patch -R?
  • svn, hg, git, bzr, cvs update appears to be most of them. uscan does stuff for watching releases...
  • Clobber (allows both rm -rf and make {insert target})
  • PatchSource
  • StandardConfigure (./configure)
  • CmakeConfigure (cmake)
  • StandardMake (for build step)
  • CustomShell, CustomMake (for "other")

The kinds of steps present is dependent on how language and analysis intersect with the build process. This is what I imagine as what a build step would look like in C++, API-wise:

class BuildStep {
  /* called to initialize */
  void setInformation(int stepNumber, Environment env);
  /* Get something from the config file */
  string getConfigOption(string option, string default);
  /* Set an environment variable, NULL is remove */
  void setEnvironment(string option, string value);
  /* Runs a process, in cwd with the environment */
  void executeProcess(string executable, string cwd, string[] args);

  /* Build, Patch, Configure, Clobber, Update, ??? */
  virtual StepType getType();
  /* Actually run stuff. */
  virtual void run();
};

Language support

I think a distinction needs to be made between compilable languages (those where we just need to replace the compiler with a static-analysis-aware one) and noncompilable languages (those where we have to run a tool on some list of source code). These names aren't exactly the best, since the true distinction is in if we can override the tool used via environment or configure options.

For maximum pluggability, it makes sense to have the language run analysis scripts which merely load and run multiple scripts (configurably, of course). This means that the default script would probably be something like static-checking.js.

What is a language responsible for?

  1. If it's compilable, it knows the environment variables it needs to set to run the tools.
  2. If it's not compilable, it knows which files to run the tool on (and the tool to run, of course). But how do we indicate in the build procedure that these needs to be done? Mozilla JS would probably need a full make jsexport in order to work properly; a smaller project like my extension can probably be had with a grep. We should optimize for the common case, though.
  3. Languages may require post processing.
  4. Languages know how to convert the source code to HTML.
  5. Languages know how to find the source files; this is not necessarily the files they were run on (*cough* jsexport *cough)

Ultimately, languages are about extracting entities from code. We probably need to define what entities we actually care about. This may be theoretically impossible.

Source to HTML

In terms of syntax highlighting, vim divides things into comment, constant (such as "asdf", true, 1230), identifier (not used in C++ highlighting, but basically known identifier names), statement (pretty much any keyword that's not a constant or a type), type (typenames and storage classes, etc.), and special (preprocessor mostly).

I'm not particularly attached to colors (I use a black background in my terminal, so the blue/yellow show up better), but, excluding identifier, the progression seems nice. If it's good enough to work with vim, it's probably good enough to be generalizable.

The language plugin would need a method to take a source file and HTMLify it. The result would be the following tuple:

(
  [line1, line2, line3, etc. ],
  [((1,1), (36,35), COMMENT),((37,1), (37,30), SPECIAL), etc. ],
  [((37,11), (37, 29), something), etc.]
)

The first entry is the lines of the file, without newlines, or basically read().split('\n'). The second is the list of syntax-highlighting tokens: the first element in the tuple is (line, col) start, the second the end, and the third the type (comment, special, keyword, constant, type). The last entry is basically the list of things to click on. I'm not sure what is actually passed back in the last portion of the tuple, since I'm not sure how DXR handles the links. Clickable links must be totally within a line and not overlapping multiple syntax highlighting statements, and syntax highlighting elements must be disjoint regions.

Whoever calls the method will take the results and turn them into the standard HTML output. The header comes from a standard page template, and the left sidebar is produced from reading the database. The main body is produced by combining the three lists of results into one HTML. In short, the language plugin itself doesn't need to know HTML.

One thing to note, though, is that there may need to be a sentiment of "linking" different syntax highlighters. For example, HTML pages can contain embedded JS. This may be sufficiently uncommon that it's better just to let the specific plugin codecs worry about it.

Analysis support

Information support

Database support

The database plugin needs to be able to open up a database for a specific tree. Each database would need to support the modification of schemas as appropriate, and also allow queries.

Runtime consideration

Above all, using DXR should be simple. From a packaging perspective, the user would edit the config file to indicate information about the source (being able to say "you want one of these <n where n is a small number> setups for a tree for most cases" would be wonderful)