Places:Full Text Indexing: Difference between revisions

Places:Full Text Indexing (view source)

Revision as of 06:25, 26 May 2007

4,073 bytes added , 26 May 2007

no edit summary

Mindboggler

24

edits

@@ Line 14: / Line 14: @@
 A naive implementation of full-text indexing is very costly in terms of storage. I'll briefly explain how it is so. Let us define term. A term is any word that appear in a page. So a relational database contains a table with two columns, term and id. Another table contains two columsn term id and doc id(id of the document the term appeard in). The GNU manuals were analyzed [1]. It is 5.15 Mb of text containing 958,774 occurrences of word out of which 27,554 are unique. But table 2 will require that every occurrence has a corresponding doc id. If term id were stored as int, the amount of space required to store the first column alone, would be 958,774 * 4 bytes, which is about 3 Mb. A B+ Tree implementation is atleast 3Mb more efficient. However a nice encoding scheme and storage model proposed by [2] is almost as efficient as a B+ Tree implementation. This algorithm also leverages the capabilites of relational database system while not losing too much in terms of storage and performance. Hence I propose to implement this algorithm.
+TODO: Formatting. Learn how wiki formatting is done
+== Use Case ==
+Actor: User
+- Visit Page
+- Search
+- Clear History
+Actor: Browser
+- Expire Page
+The use cases above will be used to validate the design.
+== Database Design ==
+<nowiki>
+Word Table
+columnn		type		bytes	description
+word		varchar		<=100	term for indexing(Shouldn't it be unicode? how do i store unicode?)
+wordnum		integer		4	unique id. Integer works cause the number of unique words will be atmost a million. Non-english language?
+doc_count	integer		4	number of documents the word occurred in
+word_count	integer		4	number of occurrences of the word
+Posting Table
+column		type		bytes	description
+wordnum		integer		4	This is the foreign key matching that in the word table
+firstdoc	integer		4	lowest doc id referenced in the block
+flags		tinyint		1	indicates the block type, length of doc list, sequence number
+block 		varbinary	<=255	contains encoded document and/or position postings for the word
+To Do
+) We might need a table or two more for ranking related data
+) See how docid will referenced
+</nowiki>
+Note that the tables structure is subject to change to improve efficiency. New Tables might be formed and/or the table might add/remove column
 == Detailed Design ==
+Classes
+There are essentially five classes for the back-end:
+) nsNavHistoryQuery
+) nsNavFullTextIndexHelper
+) nsNavFullTextIndex
+) nsNavFullTextTokenizer
+) nsNavFullTextAnalyzer
+nsNavHistoryQuery
+This class is already implemented with all features except searching text. The mechanism of searching is no different from what described in http://developer.mozilla.org/en/docs/Places:Query_System. The result of nsNavHistoryQuery::search will do a full-text search with nsNavHistoryQuery::searchTerms. This is very powerful with the number of options that you can specify. Conjunctive queries can be executed.
+nsNavFullTextIndex
+This class interacts with SQLite database. This implements the algorithm for adding document to the index, removing document from the index and search for a given term. This search function is used by nsNavHistoryQuery::search function. Look at [2] for the algorithm used.
+Block is a struct used to encode and decode block. The structure of a block is as described in [2]
+struct Block {
+	int docCount;
+	int[] docId;
+	int[] docFreq;
+	int[][] docPos;
+	//Variable Length Encoding, compresses very efficiently balancing speed and storage requirement
+	void encode(out byte[] encodedBlock) //encodes the block from docCount, docId, docFreq and docPos
+	void decode(in byte[] encodedBlock); //Decodes the block and fills docCount, docId, docFreq and docPos
+	void addDoc(in int id, in int freq, in int[] pos) //Updates the internal structure
+	TODO: Fill the algorithm here
+}
+AddDocument(connection, document, analyzer) {
+	while(analyzer.hasTerms()) {
+		term = analyzer.nextTerm();
+		if (term already in table)
+			retrieve the last record, update the record(if necessary split) and commit to the database
+		else
+			create a new record, if necessary cache it for performance and commit to the database;
+		TODO: Expand the algorithm, too generic here
+	}
+}
+nsNavFullTextIndexHelper
+This class contains mechanism for scheduling documents for indexing. It works pretty much like nsNavHistoryExpire. The scheduling balances indexing with overall firefox performance. It has a timer which when expires picks an unindexed document from the history and calls nsNavFullTextIndex::addDocument to index. When and how the timer is set, dictates the overall efficiency. Whenever a user visits a page, the timer is triggered. ** fill in here how it works **
+nsNavFullTextTokenizer
+The tokenizer is used by nsNavFullTextIndex::AddDocument function. Tokenizer is an abstract class. Tokenizer splits the text into tokens. Tokenizer and Analyzer enable support for other languages.
+nsNavFullTextAnalyzer
+The Analyzer takes the tokens generated from the tokenizer and discards those that are not required for indexing. E.g. is, a, an the can be discarded.
+== Front-End ==
+TODO: Write about the front-end design
+) nsNavQueryParser
 == References ==
 [1] An Efficient Indexing Technique for full-text database systems: Justin Jobel, Alistair Moffet, Ron Sacks Davis
 [2]Using a relational database for full-text index, Steve Putz, Xerox PARC

Places:Full Text Indexing: Difference between revisions

Places:Full Text Indexing (view source)

Revision as of 06:25, 26 May 2007

Navigation menu

Search