Places:Full Text Indexing: Difference between revisions

Places:Full Text Indexing (view source)

Revision as of 15:43, 26 May 2007

1,723 bytes added , 26 May 2007

→‎Detailed Design

Mindboggler

24

edits

@@ Line 53: / Line 53: @@
 == Detailed Design ==
-Classes
+Classes<br>
-There are essentially five classes for the back-end:
+There are essentially five classes for the back-end:<br>
-) nsNavHistoryQuery
+) nsNavHistoryQuery<br>
-) nsNavFullTextIndexHelper
+) nsNavFullTextIndexHelper<br>
-) nsNavFullTextIndex
+) nsNavFullTextIndex<br>
-) nsNavFullTextTokenizer
+) nsNavFullTextTokenizer<br>
-) nsNavFullTextAnalyzer
+) nsNavFullTextAnalyzer<br>
-nsNavHistoryQuery
+nsNavHistoryQuery<br>
 This class is already implemented with all features except searching text. The mechanism of searching is no different from what described in http://developer.mozilla.org/en/docs/Places:Query_System. The result of nsNavHistoryQuery::search will do a full-text search with nsNavHistoryQuery::searchTerms. This is very powerful with the number of options that you can specify. Conjunctive queries can be executed.
-nsNavFullTextIndex
+nsNavFullTextIndex<br>
 This class interacts with SQLite database. This implements the algorithm for adding document to the index, removing document from the index and search for a given term. This search function is used by nsNavHistoryQuery::search function. Look at [2] for the algorithm used.
-Block is a struct used to encode and decode block. The structure of a block is as described in [2]
+Block is a struct used to encode and decode block. A block is variable length delta encoded.
+Variable Length delta Encoding, compresses very efficiently balancing speed and storage requirement
-struct Block {
+struct Block {<br>
-	int docCount;
+: //The width of the field is 250 bytes. The return value is an int indicating number of elements in the data that were encoded in the out byte array.<br>
-	int[] docId;
+: int encode(in int[] data, out byte[] encodedBlock) {<br>
-	int[] docFreq;
+:: //How to encode more efficiently, any idea?<br>
-	int[][] docPos;
+:: int[] bigEndian;<br>
-	//Variable Length Encoding, compresses very efficiently balancing speed and storage requirement
+:: int k = 0;<br>
-	void encode(out byte[] encodedBlock) //encodes the block from docCount, docId, docFreq and docPos
+:: data[i - 1] = 0;<br>
-	void decode(in byte[] encodedBlock); //Decodes the block and fills docCount, docId, docFreq and docPos
+:: for(int i = 0; i < data.length; i++) {<br>
-	void addDoc(in int id, in int freq, in int[] pos) //Updates the internal structure
+::: data[i] -= data[i - 1];<br>
-	TODO: Fill the algorithm here
+::: int j = 0;<br>
-}
+::: while(data[i] != 0) {<br>
+:::: bigEndian[j++] = data[i] % 128;<br>
+:::: data[i] /= 128;<br>
+::: }<br>
+::: for( ; j > 0; j--, k++) {<br>
+:::: encodedBlock[k] = (1 << 8) & bigEndian[j];<br>
+::: }<br>
+::: encodedBlock[k++] = bigEndian[0];<br>
+::: if (k > 255) <br>
+:::: return i - 1<br>
+::: }<br>
+::}<br>
+: void decode(in byte[] encodedBlock, out int[] data) {<br>
+:: int j = 0;<br>
+:: for(int i = 0; i < encodedBlock.length; i++) {<br>
+::: if (encodedBlock[i] && (1 << 8)) {<br>
+::: data[j] *= 128;<br>
+::: data[j] += encodedBlock[i] & ((1 << 8) - 1);<br>
+:: }<br>
+:: else {<br>
+::: data[j] += data[j - 1]; //Because it is delta encoded <br>
+::: data[j++] = encodedBlock[i];<br>
+:: }<br>
+: }<br>
+	}<br>
+}<br>
-AddDocument(connection, document, analyzer) {
-	while(analyzer.hasTerms()) {
-		term = analyzer.nextTerm();
-		if (term already in table)
-			retrieve the last record, update the record(if necessary split) and commit to the database
-		else
-			create a new record, if necessary cache it for performance and commit to the database;
-		TODO: Expand the algorithm, too generic here
-	}
-}
-nsNavFullTextIndexHelper
+AddDocument(connection, document, analyzer) {<br>
+	AddDocument in two pass. Scan the document for terms and then invert.
+	we have the doc id. We will require a hash map library to make it very efficient
+	term is a hash map.<br>
+	Usage: 	term['termname'] = array of pos.
+	while(analyzer.hasTerms()) {<br>
+		cTerm = analyzer.nextTerm();<br>
+		term[cTerm.Name].add(cTerm.pos);<br>
+	}<br>
+	iterator i = term.iterator();<br>
+	while(i.hasNext()) {<br>
+		termName = i.next();<br>
+		termPos[] = term[termName];<br>
+		//hopefully sqlite caches query results, it will be inefficient otherwise.<br>
+		if (term already in table) <br>
+			retrieve the last record, update the record(if necessary split)<br>
+		else<br>
+			create a new record, if necessary cache it for performance; <br>
+		TODO: Expand the algorithm, too generic here <br>
+		if (term already in table) { <br>
+			records = executeQuery("SELECT ﬁrstdoc, ﬂags, block FROM postings WHERE word = “box” AND ﬁrstdoc >= (SELECT max (ﬁrstdoc) FROM postings	    WHERE word = “box” AND ﬂags < 128)");<br>
+			Block b = new Block();<br>
+		}<br>
+		else {<br>
+		}<br>
+	}<br>
+	commit to the database<br>
+}<br>
+RemoveDocument() {<br>
+	inherently inefficient because of the structure that we have adopted. Needs to be optimized.<br>
+}<br>
+nsNavFullTextIndexHelper<br>
 This class contains mechanism for scheduling documents for indexing. It works pretty much like nsNavHistoryExpire. The scheduling balances indexing with overall firefox performance. It has a timer which when expires picks an unindexed document from the history and calls nsNavFullTextIndex::addDocument to index. When and how the timer is set, dictates the overall efficiency. Whenever a user visits a page, the timer is triggered. ** fill in here how it works **
-nsNavFullTextTokenizer
+nsNavFullTextTokenizer<br>
 The tokenizer is used by nsNavFullTextIndex::AddDocument function. Tokenizer is an abstract class. Tokenizer splits the text into tokens. Tokenizer and Analyzer enable support for other languages.
-nsNavFullTextAnalyzer
+nsNavFullTextAnalyzer<br>
 The Analyzer takes the tokens generated from the tokenizer and discards those that are not required for indexing. E.g. is, a, an the can be discarded.

Places:Full Text Indexing: Difference between revisions

Places:Full Text Indexing (view source)

Revision as of 15:43, 26 May 2007

Navigation menu

Search