Changes

Jump to: navigation, search

DXR Storages

4,902 bytes added, 21:35, 6 March 2014
Add pro/con matrix.
Ultimately, it would be nice to get the static HTML out of the instance and build more dynamically, at request time, and cache. We should time our current page builds and see if they are short enough to feasibly do during the request.
 
== Pro/Con Matrix ==
This was supposed to help me get a fresh perspective on my gut feelings about the suitability of each storage.
 
<table cellspacing="0" cellpadding="5" border="1">
<tr>
<th>
Task
</th>
<th>
ES
</th>
<th>
PG
</th>
<th>
SQLite
</th>
</tr>
<tr>
<th>
Request-time rendering
</th>
<td>
Very good. Trivial to store all regions and/or refs in the Line as nested documents and fetch them in one query.
</td>
<td>
Fine. There's no reason this shouldn't be as fast as SQLite.
</td>
<td>
It's a lot of queries (and dependent subqueries to turn file IDs into pathnames), because the data needed is decidedly non-rectangular. It does seem to be fast, however.
</td>
</tr>
<tr>
<th>
Searching by content
</th>
<td>
Fine. We have to extract trigrams outselves, but then ES has good trigram indices and automaton-based regex executors.
</td>
<td>
Pretty. Do a simple regex search, which the server builds trigrams from all on its own, and do algebra on sets all you want.
</td>
<td>
Ugly. Line-based searching requires either trilite code changes which degeneralize it or a 1-1 table to relate (file, line) to each trilite row. Performance is an unknown. Set math is easy, though.
</td>
</tr>
<tr>
<th>
Searching by structure
</th>
<td>
Good? Model function definitions as child or nested docs of lines. Do highlighting app-side.
</td>
<td>
Fine
</td>
<td>
Fine
</td>
</tr>
<tr>
<th>
Sorting results by file attrs
</th>
<td>
Bad. The only way is to inline the attr you want to sort by into the child document. That could be &gt; an extra 1GB on moz-central for the pathnames + indexes on them.
</td>
<td>
Good. A simple ORDER BY.
</td>
<td>
Good. A simple ORDER BY.
</td>
</tr>
<tr>
<th>
Indexing structure
</th>
<td>
Fine. The clang indexer's generate_callgraph() actually works
<span class="" style="font-style: italic;">
around
</span>
SQLite for some reason, loading the whole functions and variables tables into hashes. The inserts it does translate exactly to ES.
</td>
<td>
Same as SQLite.
</td>
<td>
Fine
</td>
</tr>
<tr>
<th>
Data locality
</th>
<td>
Good. Computation is done near the data.
</td>
<td>
Good. Computation is done near the data.
</td>
<td>
Weird. Computation is done on the webhead, away from the data, but it performs well because the data is on a super-expensive NAS.
</td>
</tr>
<tr>
<th>
Maintenance
</th>
<td>
Fine. We have to maintain our own trigram extractor. Should be pretty stable, as regex syntax doesn't change.
</td>
<td>
Annoying. We have to fork pg_trgm to make it recognize more chars. Then we have to get it loaded onto presumably our own DB cluster and update it whenever we fix a bug. I don't think we'd have to make changes regularly, but the world is full of surprises.
</td>
<td>
Hard. We're the only ones maintaining trilite. It doesn't yet have update/delete support, nor does it store line numbers.
</td>
</tr>
<tr>
<th>
Contributor Impact
</th>
<td>
Good. ES runs fine on openjdk. We can automate the installation of both it and ES itself in the Vagrant VM, and it's easier to set up both than to get trilite to compile off-VM.
</td>
<td>
Probably a little annoying to build. Need PG dev headers.
</td>
<td>
Bad. It's hard for people to compile and persuade to load.
</td>
</tr>
<tr>
<th>
Future flexibility
</th>
<td>
Limited join support requires denormalization, such as duplicating file paths into individual lines. What are the space implications?
</td>
<td>
Good: flexible indexing, the ad hoc-ness of the relational model, an excellent query planner and optimizer, inline types like hstore and JSON, and mature MVCC for incremental indexing
</td>
<td>
Questionable: scaling, concurrency and constraint checking for incremental indexing (which updates).
</td>
</tr>
<tr>
<th>
Result mixing and highlighting
</th>
<td>
Very good. All the refs and regions can be right there, inline in the line doc. Good highlighting support. Should be very easy to render docs at request time.
</td>
<td>
Meh. Would have to do a lot of separate queries for different kinds of results (pathmame, content, structure). Would need to use a big JSON field on line to efficientiy get 1:n results of line:{structural element} out of the DB. Otherwise, it'd be n queries for each of m results.
</td>
<td>
Meh. Same as PG.
</td>
</tr>
<tr>
<th>
ANDing and ORing found subsets (so we can do "caller:smoo OR caller:bar")
</th>
<td>
Very good if we don't use parent/child/nested/inner queries
</td>
<td>
Good
</td>
<td>
Good
</td>
</tr>
</table>
Confirm
574
edits

Navigation menu