24
edits
Mindboggler (talk | contribs) No edit summary |
Mindboggler (talk | contribs) |
||
| Line 53: | Line 53: | ||
== Detailed Design == | == Detailed Design == | ||
Classes | Classes<br> | ||
There are essentially five classes for the back-end: | There are essentially five classes for the back-end:<br> | ||
1) nsNavHistoryQuery | 1) nsNavHistoryQuery<br> | ||
2) nsNavFullTextIndexHelper | 2) nsNavFullTextIndexHelper<br> | ||
3) nsNavFullTextIndex | 3) nsNavFullTextIndex<br> | ||
4) nsNavFullTextTokenizer | 4) nsNavFullTextTokenizer<br> | ||
5) nsNavFullTextAnalyzer | 5) nsNavFullTextAnalyzer<br> | ||
nsNavHistoryQuery | nsNavHistoryQuery<br> | ||
This class is already implemented with all features except searching text. The mechanism of searching is no different from what described in http://developer.mozilla.org/en/docs/Places:Query_System. The result of nsNavHistoryQuery::search will do a full-text search with nsNavHistoryQuery::searchTerms. This is very powerful with the number of options that you can specify. Conjunctive queries can be executed. | This class is already implemented with all features except searching text. The mechanism of searching is no different from what described in http://developer.mozilla.org/en/docs/Places:Query_System. The result of nsNavHistoryQuery::search will do a full-text search with nsNavHistoryQuery::searchTerms. This is very powerful with the number of options that you can specify. Conjunctive queries can be executed. | ||
nsNavFullTextIndex | nsNavFullTextIndex<br> | ||
This class interacts with SQLite database. This implements the algorithm for adding document to the index, removing document from the index and search for a given term. This search function is used by nsNavHistoryQuery::search function. Look at [2] for the algorithm used. | This class interacts with SQLite database. This implements the algorithm for adding document to the index, removing document from the index and search for a given term. This search function is used by nsNavHistoryQuery::search function. Look at [2] for the algorithm used. | ||
Block is a struct used to encode and decode block. | Block is a struct used to encode and decode block. A block is variable length delta encoded. | ||
Variable Length delta Encoding, compresses very efficiently balancing speed and storage requirement | |||
struct Block { | struct Block {<br> | ||
: //The width of the field is 250 bytes. The return value is an int indicating number of elements in the data that were encoded in the out byte array.<br> | |||
: int encode(in int[] data, out byte[] encodedBlock) {<br> | |||
:: //How to encode more efficiently, any idea?<br> | |||
:: int[] bigEndian;<br> | |||
:: int k = 0;<br> | |||
:: data[i - 1] = 0;<br> | |||
:: for(int i = 0; i < data.length; i++) {<br> | |||
::: data[i] -= data[i - 1];<br> | |||
::: int j = 0;<br> | |||
} | ::: while(data[i] != 0) {<br> | ||
:::: bigEndian[j++] = data[i] % 128;<br> | |||
:::: data[i] /= 128;<br> | |||
::: }<br> | |||
::: for( ; j > 0; j--, k++) {<br> | |||
:::: encodedBlock[k] = (1 << 8) & bigEndian[j];<br> | |||
::: }<br> | |||
::: encodedBlock[k++] = bigEndian[0];<br> | |||
::: if (k > 255) <br> | |||
:::: return i - 1<br> | |||
::: }<br> | |||
::}<br> | |||
: void decode(in byte[] encodedBlock, out int[] data) {<br> | |||
:: int j = 0;<br> | |||
:: for(int i = 0; i < encodedBlock.length; i++) {<br> | |||
::: if (encodedBlock[i] && (1 << 8)) {<br> | |||
::: data[j] *= 128;<br> | |||
::: data[j] += encodedBlock[i] & ((1 << 8) - 1);<br> | |||
:: }<br> | |||
:: else {<br> | |||
::: data[j] += data[j - 1]; //Because it is delta encoded <br> | |||
::: data[j++] = encodedBlock[i];<br> | |||
:: }<br> | |||
: }<br> | |||
}<br> | |||
}<br> | |||
nsNavFullTextIndexHelper | AddDocument(connection, document, analyzer) {<br> | ||
AddDocument in two pass. Scan the document for terms and then invert. | |||
we have the doc id. We will require a hash map library to make it very efficient | |||
term is a hash map.<br> | |||
Usage: term['termname'] = array of pos. | |||
while(analyzer.hasTerms()) {<br> | |||
cTerm = analyzer.nextTerm();<br> | |||
term[cTerm.Name].add(cTerm.pos);<br> | |||
}<br> | |||
iterator i = term.iterator();<br> | |||
while(i.hasNext()) {<br> | |||
termName = i.next();<br> | |||
termPos[] = term[termName];<br> | |||
//hopefully sqlite caches query results, it will be inefficient otherwise.<br> | |||
if (term already in table) <br> | |||
retrieve the last record, update the record(if necessary split)<br> | |||
else<br> | |||
create a new record, if necessary cache it for performance; <br> | |||
TODO: Expand the algorithm, too generic here <br> | |||
if (term already in table) { <br> | |||
records = executeQuery("SELECT firstdoc, flags, block FROM postings WHERE word = “box” AND firstdoc >= (SELECT max (firstdoc) FROM postings WHERE word = “box” AND flags < 128)");<br> | |||
Block b = new Block();<br> | |||
}<br> | |||
else {<br> | |||
}<br> | |||
}<br> | |||
commit to the database<br> | |||
}<br> | |||
RemoveDocument() {<br> | |||
inherently inefficient because of the structure that we have adopted. Needs to be optimized.<br> | |||
}<br> | |||
nsNavFullTextIndexHelper<br> | |||
This class contains mechanism for scheduling documents for indexing. It works pretty much like nsNavHistoryExpire. The scheduling balances indexing with overall firefox performance. It has a timer which when expires picks an unindexed document from the history and calls nsNavFullTextIndex::addDocument to index. When and how the timer is set, dictates the overall efficiency. Whenever a user visits a page, the timer is triggered. ** fill in here how it works ** | This class contains mechanism for scheduling documents for indexing. It works pretty much like nsNavHistoryExpire. The scheduling balances indexing with overall firefox performance. It has a timer which when expires picks an unindexed document from the history and calls nsNavFullTextIndex::addDocument to index. When and how the timer is set, dictates the overall efficiency. Whenever a user visits a page, the timer is triggered. ** fill in here how it works ** | ||
nsNavFullTextTokenizer | nsNavFullTextTokenizer<br> | ||
The tokenizer is used by nsNavFullTextIndex::AddDocument function. Tokenizer is an abstract class. Tokenizer splits the text into tokens. Tokenizer and Analyzer enable support for other languages. | The tokenizer is used by nsNavFullTextIndex::AddDocument function. Tokenizer is an abstract class. Tokenizer splits the text into tokens. Tokenizer and Analyzer enable support for other languages. | ||
nsNavFullTextAnalyzer | nsNavFullTextAnalyzer<br> | ||
The Analyzer takes the tokens generated from the tokenizer and discards those that are not required for indexing. E.g. is, a, an the can be discarded. | The Analyzer takes the tokens generated from the tokenizer and discards those that are not required for indexing. E.g. is, a, an the can be discarded. | ||
edits