Confirmed users
396
edits
Mindboggler (talk | contribs) mNo edit summary |
m (→Detailed Design: wikification) |
||
| Line 65: | Line 65: | ||
There are essentially five classes for the back-end: | There are essentially five classes for the back-end: | ||
# nsNavHistoryQuery | |||
# nsNavFullTextIndexHelper | |||
# nsNavFullTextIndex | |||
# nsNavFullTextAnalyzer | |||
nsNavHistoryQuery | ===nsNavHistoryQuery=== | ||
This class is already implemented with all features except searching text. The mechanism of searching is no different from what described in http://developer.mozilla.org/en/docs/Places:Query_System. The result of nsNavHistoryQuery::search will do a full-text search with nsNavHistoryQuery::searchTerms. This is very powerful with the number of options that you can specify. Conjunctive queries can be executed. | This class is already implemented with all features except searching text. The mechanism of searching is no different from what described in http://developer.mozilla.org/en/docs/Places:Query_System. The result of nsNavHistoryQuery::search will do a full-text search with nsNavHistoryQuery::searchTerms. This is very powerful with the number of options that you can specify. Conjunctive queries can be executed. | ||
| Line 76: | Line 76: | ||
The search function uses the searchTerms to call nsNavFullTextIndex::searchDocument(term). This returns a url list based on a certain ranking. This function further filters based on other criteria such as date etc.. provided in the query options. The filtered url list is returned | The search function uses the searchTerms to call nsNavFullTextIndex::searchDocument(term). This returns a url list based on a certain ranking. This function further filters based on other criteria such as date etc.. provided in the query options. The filtered url list is returned | ||
nsNavFullTextIndex | ===nsNavFullTextIndex=== | ||
This class interacts with SQLite database. This implements the algorithm for adding document to the index, removing document from the index and search for a given term. This search function is used by nsNavHistoryQuery::search function. Look at [2] for the algorithm used. | This class interacts with SQLite database. This implements the algorithm for adding document to the index, removing document from the index and search for a given term. This search function is used by nsNavHistoryQuery::search function. Look at [2] for the algorithm used. | ||
Block is a struct used to encode and decode block. A block is variable length delta encoded. Variable Length delta Encoding, compresses very efficiently balancing speed and storage requirement. | Block is a struct used to encode and decode block. A block is variable length delta encoded. Variable Length delta Encoding, compresses very efficiently balancing speed and storage requirement. | ||
<pre>struct Block { | |||
struct Block { | |||
//The width of the field is 255 bytes. The return value is an int indicating number of elements in the data that were encoded in the out byte array. | //The width of the field is 255 bytes. The return value is an int indicating number of elements in the data that were encoded in the out byte array. | ||
int encode(in int[] data, out byte[] encodedBlock) { | int encode(in int[] data, out byte[] encodedBlock) { | ||
| Line 122: | Line 121: | ||
AddDocument(connection, document, analyzer) { | AddDocument(connection, document, analyzer) { | ||
AddDocument in two pass. Scan the document for terms and then invert. We have the doc id. We will require a hash map library to make it very efficient. Term is a hash map. Usage: term['termname'] = array of pos. | AddDocument in two pass. Scan the document for terms and then invert. We have the doc id. We will require a hash map library to make it very efficient. Term is a hash map. Usage: term['termname'] = array of pos. | ||
<pre> while(analyzer.hasTerms()) { | |||
cTerm = analyzer.nextTerm(); | cTerm = analyzer.nextTerm(); | ||
term[cTerm.Name].add(cTerm.pos); | term[cTerm.Name].add(cTerm.pos); | ||
| Line 175: | Line 174: | ||
Inherently inefficient because of the structure that we have adopted. Needs to be optimized. The general algorithm revolves around finding the record whose firstDoc is immediately less or same as the docId we are searching for. | Inherently inefficient because of the structure that we have adopted. Needs to be optimized. The general algorithm revolves around finding the record whose firstDoc is immediately less or same as the docId we are searching for. | ||
<pre> e.g. Say we want to delete docid = 20. We got two records | |||
firstDoc = 18, block="blahblah" | firstDoc = 18, block="blahblah" | ||
firstDoc = 22, block="blahblah" | firstDoc = 22, block="blahblah" | ||
| Line 213: | Line 212: | ||
terms is something like: "mac apple panther". Basically a collection of terms | terms is something like: "mac apple panther". Basically a collection of terms | ||
The ranking algorithm as described in this document is used. | The ranking algorithm as described in this document is used. | ||
} | }</pre> | ||
nsNavFullTextIndexHelper | ===nsNavFullTextIndexHelper=== | ||
This class contains mechanism for scheduling documents for indexing. It works pretty much like nsNavHistoryExpire. The scheduling balances indexing with overall firefox performance. It has a timer which when expires picks an unindexed document from the history and calls nsNavFullTextIndex::addDocument to index. When and how the timer is set, dictates the overall efficiency. Whenever a user visits a page, the timer is triggered. | This class contains mechanism for scheduling documents for indexing. It works pretty much like nsNavHistoryExpire. The scheduling balances indexing with overall firefox performance. It has a timer which when expires picks an unindexed document from the history and calls nsNavFullTextIndex::addDocument to index. When and how the timer is set, dictates the overall efficiency. Whenever a user visits a page, the timer is triggered. | ||
| Line 221: | Line 220: | ||
When user visits a page, the history service calls OnAddURI function of this class. This function starts the timer. When timer expires the call back function is called. The function checks If there are any more document to be indexed and it has not been a long while since the user called addURI. If so, then it resets the timer and waits to expire again. | When user visits a page, the history service calls OnAddURI function of this class. This function starts the timer. When timer expires the call back function is called. The function checks If there are any more document to be indexed and it has not been a long while since the user called addURI. If so, then it resets the timer and waits to expire again. | ||
nsNavFullTextAnalyzer | ===nsNavFullTextAnalyzer=== | ||
The analyzer design is similar to the way Lucene works. The Lucene design enables support for multiple languages. | The analyzer design is similar to the way Lucene works. The Lucene design enables support for multiple languages. | ||
The Analyzer takes the tokens generated from the tokenizer and discards those that are not required for indexing. E.g. is, a, an the can be discarded. This is enabled by by tokenStream classes. Refer to nsNavFullTextTokenStream and Lucene API for more detail. | The Analyzer takes the tokens generated from the tokenizer and discards those that are not required for indexing. E.g. is, a, an the can be discarded. This is enabled by by tokenStream classes. Refer to nsNavFullTextTokenStream and Lucene API for more detail. | ||
nsNavFullTextTokenStream | ===nsNavFullTextTokenStream=== | ||
TokenStream is an abstract class with three methods next, reset and close. The next method returns a token. Token is a class representing startoffset, endoffset and termtext. TokenStream needs input to tokenize. There are two concrete class for TokenStream: | TokenStream is an abstract class with three methods next, reset and close. The next method returns a token. Token is a class representing startoffset, endoffset and termtext. TokenStream needs input to tokenize. There are two concrete class for TokenStream: | ||
# Tokenizer: The input is an inputstream. | |||
# TokenFilter: The input is another TokenStream. This works like pipes. | |||
== Front-End == | == Front-End == | ||