24
edits
Mindboggler (talk | contribs) (The complete design of Places: Full Text Indexing feature) |
Mindboggler (talk | contribs) No edit summary |
||
| Line 14: | Line 14: | ||
A naive implementation of full-text indexing is very costly in terms of storage. I'll briefly explain how it is so. Let us define term. A term is any word that appear in a page. So a relational database contains a table with two columns, term and id. Another table contains two columsn term id and doc id(id of the document the term appeard in). The GNU manuals were analyzed [1]. It is 5.15 Mb of text containing 958,774 occurrences of word out of which 27,554 are unique. But table 2 will require that every occurrence has a corresponding doc id. If term id were stored as int, the amount of space required to store the first column alone, would be 958,774 * 4 bytes, which is about 3 Mb. A B+ Tree implementation is atleast 3Mb more efficient. However a nice encoding scheme and storage model proposed by [2] is almost as efficient as a B+ Tree implementation. This algorithm also leverages the capabilites of relational database system while not losing too much in terms of storage and performance. Hence I propose to implement this algorithm. | A naive implementation of full-text indexing is very costly in terms of storage. I'll briefly explain how it is so. Let us define term. A term is any word that appear in a page. So a relational database contains a table with two columns, term and id. Another table contains two columsn term id and doc id(id of the document the term appeard in). The GNU manuals were analyzed [1]. It is 5.15 Mb of text containing 958,774 occurrences of word out of which 27,554 are unique. But table 2 will require that every occurrence has a corresponding doc id. If term id were stored as int, the amount of space required to store the first column alone, would be 958,774 * 4 bytes, which is about 3 Mb. A B+ Tree implementation is atleast 3Mb more efficient. However a nice encoding scheme and storage model proposed by [2] is almost as efficient as a B+ Tree implementation. This algorithm also leverages the capabilites of relational database system while not losing too much in terms of storage and performance. Hence I propose to implement this algorithm. | ||
TODO: Formatting. Learn how wiki formatting is done | |||
== Use Case == | |||
Actor: User | |||
- Visit Page | |||
- Search | |||
- Clear History | |||
Actor: Browser | |||
- Expire Page | |||
The use cases above will be used to validate the design. | |||
== Database Design == | |||
<nowiki> | |||
Word Table | |||
columnn type bytes description | |||
word varchar <=100 term for indexing(Shouldn't it be unicode? how do i store unicode?) | |||
wordnum integer 4 unique id. Integer works cause the number of unique words will be atmost a million. Non-english language? | |||
doc_count integer 4 number of documents the word occurred in | |||
word_count integer 4 number of occurrences of the word | |||
Posting Table | |||
column type bytes description | |||
wordnum integer 4 This is the foreign key matching that in the word table | |||
firstdoc integer 4 lowest doc id referenced in the block | |||
flags tinyint 1 indicates the block type, length of doc list, sequence number | |||
block varbinary <=255 contains encoded document and/or position postings for the word | |||
To Do | |||
1) We might need a table or two more for ranking related data | |||
2) See how docid will referenced | |||
</nowiki> | |||
Note that the tables structure is subject to change to improve efficiency. New Tables might be formed and/or the table might add/remove column | |||
== Detailed Design == | == Detailed Design == | ||
Classes | |||
There are essentially five classes for the back-end: | |||
1) nsNavHistoryQuery | |||
2) nsNavFullTextIndexHelper | |||
3) nsNavFullTextIndex | |||
4) nsNavFullTextTokenizer | |||
5) nsNavFullTextAnalyzer | |||
nsNavHistoryQuery | |||
This class is already implemented with all features except searching text. The mechanism of searching is no different from what described in http://developer.mozilla.org/en/docs/Places:Query_System. The result of nsNavHistoryQuery::search will do a full-text search with nsNavHistoryQuery::searchTerms. This is very powerful with the number of options that you can specify. Conjunctive queries can be executed. | |||
nsNavFullTextIndex | |||
This class interacts with SQLite database. This implements the algorithm for adding document to the index, removing document from the index and search for a given term. This search function is used by nsNavHistoryQuery::search function. Look at [2] for the algorithm used. | |||
Block is a struct used to encode and decode block. The structure of a block is as described in [2] | |||
struct Block { | |||
int docCount; | |||
int[] docId; | |||
int[] docFreq; | |||
int[][] docPos; | |||
//Variable Length Encoding, compresses very efficiently balancing speed and storage requirement | |||
void encode(out byte[] encodedBlock) //encodes the block from docCount, docId, docFreq and docPos | |||
void decode(in byte[] encodedBlock); //Decodes the block and fills docCount, docId, docFreq and docPos | |||
void addDoc(in int id, in int freq, in int[] pos) //Updates the internal structure | |||
TODO: Fill the algorithm here | |||
} | |||
AddDocument(connection, document, analyzer) { | |||
while(analyzer.hasTerms()) { | |||
term = analyzer.nextTerm(); | |||
if (term already in table) | |||
retrieve the last record, update the record(if necessary split) and commit to the database | |||
else | |||
create a new record, if necessary cache it for performance and commit to the database; | |||
TODO: Expand the algorithm, too generic here | |||
} | |||
} | |||
nsNavFullTextIndexHelper | |||
This class contains mechanism for scheduling documents for indexing. It works pretty much like nsNavHistoryExpire. The scheduling balances indexing with overall firefox performance. It has a timer which when expires picks an unindexed document from the history and calls nsNavFullTextIndex::addDocument to index. When and how the timer is set, dictates the overall efficiency. Whenever a user visits a page, the timer is triggered. ** fill in here how it works ** | |||
nsNavFullTextTokenizer | |||
The tokenizer is used by nsNavFullTextIndex::AddDocument function. Tokenizer is an abstract class. Tokenizer splits the text into tokens. Tokenizer and Analyzer enable support for other languages. | |||
nsNavFullTextAnalyzer | |||
The Analyzer takes the tokens generated from the tokenizer and discards those that are not required for indexing. E.g. is, a, an the can be discarded. | |||
== Front-End == | |||
TODO: Write about the front-end design | |||
1) nsNavQueryParser | |||
== References == | == References == | ||
[1] An Efficient Indexing Technique for full-text database systems: Justin Jobel, Alistair Moffet, Ron Sacks Davis | [1] An Efficient Indexing Technique for full-text database systems: Justin Jobel, Alistair Moffet, Ron Sacks Davis | ||
[2]Using a relational database for full-text index, Steve Putz, Xerox PARC | [2]Using a relational database for full-text index, Steve Putz, Xerox PARC | ||
edits