Places:Full Text Indexing: Difference between revisions

Jump to navigation Jump to search
no edit summary
(The complete design of Places: Full Text Indexing feature)
No edit summary
Line 14: Line 14:


A naive implementation of full-text indexing is very costly in terms of storage. I'll briefly explain how it is so. Let us define term. A term is any word that appear in a page. So a relational database contains a table with two columns, term and id. Another table contains two columsn term id and doc id(id of the document the term appeard in). The GNU manuals were analyzed [1]. It is 5.15 Mb of text containing 958,774 occurrences of word out of which 27,554 are unique. But table 2 will require that every occurrence has a corresponding doc id. If term id were stored as int, the amount of space required to store the first column alone, would be 958,774 * 4 bytes, which is about 3 Mb. A B+ Tree implementation is atleast 3Mb more efficient. However a nice encoding scheme and storage model proposed by [2] is almost as efficient as a B+ Tree implementation. This algorithm also leverages the capabilites of relational database system while not losing too much in terms of storage and performance. Hence I propose to implement this algorithm.
A naive implementation of full-text indexing is very costly in terms of storage. I'll briefly explain how it is so. Let us define term. A term is any word that appear in a page. So a relational database contains a table with two columns, term and id. Another table contains two columsn term id and doc id(id of the document the term appeard in). The GNU manuals were analyzed [1]. It is 5.15 Mb of text containing 958,774 occurrences of word out of which 27,554 are unique. But table 2 will require that every occurrence has a corresponding doc id. If term id were stored as int, the amount of space required to store the first column alone, would be 958,774 * 4 bytes, which is about 3 Mb. A B+ Tree implementation is atleast 3Mb more efficient. However a nice encoding scheme and storage model proposed by [2] is almost as efficient as a B+ Tree implementation. This algorithm also leverages the capabilites of relational database system while not losing too much in terms of storage and performance. Hence I propose to implement this algorithm.
TODO: Formatting. Learn how wiki formatting is done
== Use Case ==
Actor: User
- Visit Page
- Search
- Clear History
Actor: Browser
- Expire Page
The use cases above will be used to validate the design.
== Database Design ==
<nowiki>
Word Table
columnn type bytes description
word varchar <=100 term for indexing(Shouldn't it be unicode? how do i store unicode?)
wordnum integer 4 unique id. Integer works cause the number of unique words will be atmost a million. Non-english language?
doc_count integer 4 number of documents the word occurred in
word_count integer 4 number of occurrences of the word
Posting Table
column type bytes description
wordnum integer 4 This is the foreign key matching that in the word table
firstdoc integer 4 lowest doc id referenced in the block
flags tinyint 1 indicates the block type, length of doc list, sequence number
block varbinary <=255 contains encoded document and/or position postings for the word
To Do
1) We might need a table or two more for ranking related data
2) See how docid will referenced
</nowiki>
Note that the tables structure is subject to change to improve efficiency. New Tables might be formed and/or the table might add/remove column


== Detailed Design ==
== Detailed Design ==
Classes
There are essentially five classes for the back-end:
1) nsNavHistoryQuery
2) nsNavFullTextIndexHelper
3) nsNavFullTextIndex
4) nsNavFullTextTokenizer
5) nsNavFullTextAnalyzer
nsNavHistoryQuery
This class is already implemented with all features except searching text. The mechanism of searching is no different from what described in http://developer.mozilla.org/en/docs/Places:Query_System. The result of nsNavHistoryQuery::search will do a full-text search with nsNavHistoryQuery::searchTerms. This is very powerful with the number of options that you can specify. Conjunctive queries can be executed.
nsNavFullTextIndex
This class interacts with SQLite database. This implements the algorithm for adding document to the index, removing document from the index and search for a given term. This search function is used by nsNavHistoryQuery::search function. Look at [2] for the algorithm used.
Block is a struct used to encode and decode block. The structure of a block is as described in [2]
struct Block {
int docCount;
int[] docId;
int[] docFreq;
int[][] docPos;
//Variable Length Encoding, compresses very efficiently balancing speed and storage requirement
void encode(out byte[] encodedBlock) //encodes the block from docCount, docId, docFreq and docPos
void decode(in byte[] encodedBlock); //Decodes the block and fills docCount, docId, docFreq and docPos
void addDoc(in int id, in int freq, in int[] pos) //Updates the internal structure
TODO: Fill the algorithm here
}
AddDocument(connection, document, analyzer) {
while(analyzer.hasTerms()) {
term = analyzer.nextTerm();
if (term already in table)
retrieve the last record, update the record(if necessary split) and commit to the database
else
create a new record, if necessary cache it for performance and commit to the database;
TODO: Expand the algorithm, too generic here
}
}
nsNavFullTextIndexHelper
This class contains mechanism for scheduling documents for indexing. It works pretty much like nsNavHistoryExpire. The scheduling balances indexing with overall firefox performance. It has a timer which when expires picks an unindexed document from the history and calls nsNavFullTextIndex::addDocument to index. When and how the timer is set, dictates the overall efficiency. Whenever a user visits a page, the timer is triggered. ** fill in here how it works **
nsNavFullTextTokenizer
The tokenizer is used by nsNavFullTextIndex::AddDocument function. Tokenizer is an abstract class. Tokenizer splits the text into tokens. Tokenizer and Analyzer enable support for other languages.
nsNavFullTextAnalyzer
The Analyzer takes the tokens generated from the tokenizer and discards those that are not required for indexing. E.g. is, a, an the can be discarded.
== Front-End ==
TODO: Write about the front-end design
1) nsNavQueryParser


== References ==
== References ==
[1] An Efficient Indexing Technique for full-text database systems: Justin Jobel, Alistair Moffet, Ron Sacks Davis
[1] An Efficient Indexing Technique for full-text database systems: Justin Jobel, Alistair Moffet, Ron Sacks Davis
[2]Using a relational database for full-text index, Steve Putz, Xerox PARC
[2]Using a relational database for full-text index, Steve Putz, Xerox PARC
24

edits

Navigation menu