Places:Full Text Indexing: Difference between revisions

Jump to navigation Jump to search
No edit summary
Line 53: Line 53:
== Detailed Design ==
== Detailed Design ==


Classes
Classes<br>


There are essentially five classes for the back-end:
There are essentially five classes for the back-end:<br>
1) nsNavHistoryQuery
1) nsNavHistoryQuery<br>
2) nsNavFullTextIndexHelper
2) nsNavFullTextIndexHelper<br>
3) nsNavFullTextIndex
3) nsNavFullTextIndex<br>
4) nsNavFullTextTokenizer
4) nsNavFullTextTokenizer<br>
5) nsNavFullTextAnalyzer
5) nsNavFullTextAnalyzer<br>


nsNavHistoryQuery
nsNavHistoryQuery<br>


This class is already implemented with all features except searching text. The mechanism of searching is no different from what described in http://developer.mozilla.org/en/docs/Places:Query_System. The result of nsNavHistoryQuery::search will do a full-text search with nsNavHistoryQuery::searchTerms. This is very powerful with the number of options that you can specify. Conjunctive queries can be executed.  
This class is already implemented with all features except searching text. The mechanism of searching is no different from what described in http://developer.mozilla.org/en/docs/Places:Query_System. The result of nsNavHistoryQuery::search will do a full-text search with nsNavHistoryQuery::searchTerms. This is very powerful with the number of options that you can specify. Conjunctive queries can be executed.  


nsNavFullTextIndex
nsNavFullTextIndex<br>


This class interacts with SQLite database. This implements the algorithm for adding document to the index, removing document from the index and search for a given term. This search function is used by nsNavHistoryQuery::search function. Look at [2] for the algorithm used.
This class interacts with SQLite database. This implements the algorithm for adding document to the index, removing document from the index and search for a given term. This search function is used by nsNavHistoryQuery::search function. Look at [2] for the algorithm used.


Block is a struct used to encode and decode block. The structure of a block is as described in [2]
Block is a struct used to encode and decode block. A block is variable length delta encoded.
Variable Length delta Encoding, compresses very efficiently balancing speed and storage requirement


struct Block {
struct Block {<br>
int docCount;
: //The width of the field is 250 bytes. The return value is an int indicating number of elements in the data that were encoded in the out byte array.<br>
int[] docId;
: int encode(in int[] data, out byte[] encodedBlock) {<br>
int[] docFreq;
:: //How to encode more efficiently, any idea?<br>
int[][] docPos;
:: int[] bigEndian;<br>
//Variable Length Encoding, compresses very efficiently balancing speed and storage requirement
:: int k = 0;<br>
void encode(out byte[] encodedBlock) //encodes the block from docCount, docId, docFreq and docPos
:: data[i - 1] = 0;<br>
void decode(in byte[] encodedBlock); //Decodes the block and fills docCount, docId, docFreq and docPos
:: for(int i = 0; i < data.length; i++) {<br>
void addDoc(in int id, in int freq, in int[] pos) //Updates the internal structure
::: data[i] -= data[i - 1];<br>
TODO: Fill the algorithm here
::: int j = 0;<br>
}
::: while(data[i] != 0) {<br>
:::: bigEndian[j++] = data[i] % 128;<br>
:::: data[i] /= 128;<br>
::: }<br>
::: for( ; j > 0; j--, k++) {<br>
:::: encodedBlock[k] = (1 << 8) & bigEndian[j];<br>
::: }<br>
::: encodedBlock[k++] = bigEndian[0];<br>
::: if (k > 255) <br>
:::: return i - 1<br>
::: }<br>
::}<br>
: void decode(in byte[] encodedBlock, out int[] data) {<br>
:: int j = 0;<br>
:: for(int i = 0; i < encodedBlock.length; i++) {<br>
::: if (encodedBlock[i] && (1 << 8)) {<br>
::: data[j] *= 128;<br>
::: data[j] += encodedBlock[i] & ((1 << 8) - 1);<br>
:: }<br>
:: else {<br>
::: data[j] += data[j - 1]; //Because it is delta encoded <br>
::: data[j++] = encodedBlock[i];<br>
:: }<br>
: }<br>
}<br>
}<br>


AddDocument(connection, document, analyzer) {
while(analyzer.hasTerms()) {
term = analyzer.nextTerm();
if (term already in table)
retrieve the last record, update the record(if necessary split) and commit to the database
else
create a new record, if necessary cache it for performance and commit to the database;
TODO: Expand the algorithm, too generic here
}
}


nsNavFullTextIndexHelper
AddDocument(connection, document, analyzer) {<br>
AddDocument in two pass. Scan the document for terms and then invert.
we have the doc id. We will require a hash map library to make it very efficient
term is a hash map.<br>
Usage: term['termname'] = array of pos.
while(analyzer.hasTerms()) {<br>
cTerm = analyzer.nextTerm();<br>
term[cTerm.Name].add(cTerm.pos);<br>
}<br>
iterator i = term.iterator();<br>
while(i.hasNext()) {<br>
termName = i.next();<br>
termPos[] = term[termName];<br>
//hopefully sqlite caches query results, it will be inefficient otherwise.<br>
 
if (term already in table) <br>
retrieve the last record, update the record(if necessary split)<br>
else<br>
create a new record, if necessary cache it for performance; <br>
TODO: Expand the algorithm, too generic here <br>
if (term already in table) { <br>
records = executeQuery("SELECT firstdoc, flags, block FROM postings WHERE word = “box” AND firstdoc >= (SELECT max (firstdoc) FROM postings     WHERE word = “box” AND flags < 128)");<br>
Block b = new Block();<br>
}<br>
else {<br>
 
}<br>
}<br>
commit to the database<br>
}<br>
 
RemoveDocument() {<br>
inherently inefficient because of the structure that we have adopted. Needs to be optimized.<br>
}<br>
 
nsNavFullTextIndexHelper<br>


This class contains mechanism for scheduling documents for indexing. It works pretty much like nsNavHistoryExpire. The scheduling balances indexing with overall firefox performance. It has a timer which when expires picks an unindexed document from the history and calls nsNavFullTextIndex::addDocument to index. When and how the timer is set, dictates the overall efficiency. Whenever a user visits a page, the timer is triggered. ** fill in here how it works **
This class contains mechanism for scheduling documents for indexing. It works pretty much like nsNavHistoryExpire. The scheduling balances indexing with overall firefox performance. It has a timer which when expires picks an unindexed document from the history and calls nsNavFullTextIndex::addDocument to index. When and how the timer is set, dictates the overall efficiency. Whenever a user visits a page, the timer is triggered. ** fill in here how it works **


nsNavFullTextTokenizer
nsNavFullTextTokenizer<br>


The tokenizer is used by nsNavFullTextIndex::AddDocument function. Tokenizer is an abstract class. Tokenizer splits the text into tokens. Tokenizer and Analyzer enable support for other languages.
The tokenizer is used by nsNavFullTextIndex::AddDocument function. Tokenizer is an abstract class. Tokenizer splits the text into tokens. Tokenizer and Analyzer enable support for other languages.


nsNavFullTextAnalyzer
nsNavFullTextAnalyzer<br>


The Analyzer takes the tokens generated from the tokenizer and discards those that are not required for indexing. E.g. is, a, an the can be discarded.
The Analyzer takes the tokens generated from the tokenizer and discards those that are not required for indexing. E.g. is, a, an the can be discarded.
24

edits

Navigation menu