Places:Full Text Indexing: Difference between revisions
Mindboggler (talk | contribs) (The complete design of Places: Full Text Indexing feature) |
Mindboggler (talk | contribs) No edit summary |
||
Line 14: | Line 14: | ||
A naive implementation of full-text indexing is very costly in terms of storage. I'll briefly explain how it is so. Let us define term. A term is any word that appear in a page. So a relational database contains a table with two columns, term and id. Another table contains two columsn term id and doc id(id of the document the term appeard in). The GNU manuals were analyzed [1]. It is 5.15 Mb of text containing 958,774 occurrences of word out of which 27,554 are unique. But table 2 will require that every occurrence has a corresponding doc id. If term id were stored as int, the amount of space required to store the first column alone, would be 958,774 * 4 bytes, which is about 3 Mb. A B+ Tree implementation is atleast 3Mb more efficient. However a nice encoding scheme and storage model proposed by [2] is almost as efficient as a B+ Tree implementation. This algorithm also leverages the capabilites of relational database system while not losing too much in terms of storage and performance. Hence I propose to implement this algorithm. | A naive implementation of full-text indexing is very costly in terms of storage. I'll briefly explain how it is so. Let us define term. A term is any word that appear in a page. So a relational database contains a table with two columns, term and id. Another table contains two columsn term id and doc id(id of the document the term appeard in). The GNU manuals were analyzed [1]. It is 5.15 Mb of text containing 958,774 occurrences of word out of which 27,554 are unique. But table 2 will require that every occurrence has a corresponding doc id. If term id were stored as int, the amount of space required to store the first column alone, would be 958,774 * 4 bytes, which is about 3 Mb. A B+ Tree implementation is atleast 3Mb more efficient. However a nice encoding scheme and storage model proposed by [2] is almost as efficient as a B+ Tree implementation. This algorithm also leverages the capabilites of relational database system while not losing too much in terms of storage and performance. Hence I propose to implement this algorithm. | ||
TODO: Formatting. Learn how wiki formatting is done | |||
== Use Case == | |||
Actor: User | |||
- Visit Page | |||
- Search | |||
- Clear History | |||
Actor: Browser | |||
- Expire Page | |||
The use cases above will be used to validate the design. | |||
== Database Design == | |||
<nowiki> | |||
Word Table | |||
columnn type bytes description | |||
word varchar <=100 term for indexing(Shouldn't it be unicode? how do i store unicode?) | |||
wordnum integer 4 unique id. Integer works cause the number of unique words will be atmost a million. Non-english language? | |||
doc_count integer 4 number of documents the word occurred in | |||
word_count integer 4 number of occurrences of the word | |||
Posting Table | |||
column type bytes description | |||
wordnum integer 4 This is the foreign key matching that in the word table | |||
firstdoc integer 4 lowest doc id referenced in the block | |||
flags tinyint 1 indicates the block type, length of doc list, sequence number | |||
block varbinary <=255 contains encoded document and/or position postings for the word | |||
To Do | |||
1) We might need a table or two more for ranking related data | |||
2) See how docid will referenced | |||
</nowiki> | |||
Note that the tables structure is subject to change to improve efficiency. New Tables might be formed and/or the table might add/remove column | |||
== Detailed Design == | == Detailed Design == | ||
Classes | |||
There are essentially five classes for the back-end: | |||
1) nsNavHistoryQuery | |||
2) nsNavFullTextIndexHelper | |||
3) nsNavFullTextIndex | |||
4) nsNavFullTextTokenizer | |||
5) nsNavFullTextAnalyzer | |||
nsNavHistoryQuery | |||
This class is already implemented with all features except searching text. The mechanism of searching is no different from what described in http://developer.mozilla.org/en/docs/Places:Query_System. The result of nsNavHistoryQuery::search will do a full-text search with nsNavHistoryQuery::searchTerms. This is very powerful with the number of options that you can specify. Conjunctive queries can be executed. | |||
nsNavFullTextIndex | |||
This class interacts with SQLite database. This implements the algorithm for adding document to the index, removing document from the index and search for a given term. This search function is used by nsNavHistoryQuery::search function. Look at [2] for the algorithm used. | |||
Block is a struct used to encode and decode block. The structure of a block is as described in [2] | |||
struct Block { | |||
int docCount; | |||
int[] docId; | |||
int[] docFreq; | |||
int[][] docPos; | |||
//Variable Length Encoding, compresses very efficiently balancing speed and storage requirement | |||
void encode(out byte[] encodedBlock) //encodes the block from docCount, docId, docFreq and docPos | |||
void decode(in byte[] encodedBlock); //Decodes the block and fills docCount, docId, docFreq and docPos | |||
void addDoc(in int id, in int freq, in int[] pos) //Updates the internal structure | |||
TODO: Fill the algorithm here | |||
} | |||
AddDocument(connection, document, analyzer) { | |||
while(analyzer.hasTerms()) { | |||
term = analyzer.nextTerm(); | |||
if (term already in table) | |||
retrieve the last record, update the record(if necessary split) and commit to the database | |||
else | |||
create a new record, if necessary cache it for performance and commit to the database; | |||
TODO: Expand the algorithm, too generic here | |||
} | |||
} | |||
nsNavFullTextIndexHelper | |||
This class contains mechanism for scheduling documents for indexing. It works pretty much like nsNavHistoryExpire. The scheduling balances indexing with overall firefox performance. It has a timer which when expires picks an unindexed document from the history and calls nsNavFullTextIndex::addDocument to index. When and how the timer is set, dictates the overall efficiency. Whenever a user visits a page, the timer is triggered. ** fill in here how it works ** | |||
nsNavFullTextTokenizer | |||
The tokenizer is used by nsNavFullTextIndex::AddDocument function. Tokenizer is an abstract class. Tokenizer splits the text into tokens. Tokenizer and Analyzer enable support for other languages. | |||
nsNavFullTextAnalyzer | |||
The Analyzer takes the tokens generated from the tokenizer and discards those that are not required for indexing. E.g. is, a, an the can be discarded. | |||
== Front-End == | |||
TODO: Write about the front-end design | |||
1) nsNavQueryParser | |||
== References == | == References == | ||
[1] An Efficient Indexing Technique for full-text database systems: Justin Jobel, Alistair Moffet, Ron Sacks Davis | [1] An Efficient Indexing Technique for full-text database systems: Justin Jobel, Alistair Moffet, Ron Sacks Davis | ||
[2]Using a relational database for full-text index, Steve Putz, Xerox PARC | [2]Using a relational database for full-text index, Steve Putz, Xerox PARC |
Revision as of 06:25, 26 May 2007
Overview
Full Text Indexing feature will allow user to search for a word/phrase from the pages that he has visited. The search query will be tightly integrated with Places's nsNavHistoryService. The tighter integration will allow queries like "search for pages visited between 01/05/07(dd/mm/yy) to 20/05/07(dd/mm/yy) containing the word 'places'"
Design Descision
A number of options were looked into before proposing this design. The options included implementing using CLucene(like flock), SQLite's FTS1 and FTS2 module, implementation using B+ Trees, using relational database etc.. The following text will briefly describe the advantage and disadvantage of all the implementation methods.
CLucene is a full-text indexing engine that stores the index as B+ Trees in files. It uses a very efficient method for storage and retrieval. It has an excellent support for CJK languages. The Places system is a new incorporation into firefox. Hence, it is important that during its initial stages all the code that is written or used is flexible, small and tightly integrated with it. Tighter integration would allow future enhancements specific to firefox. Hence this approach was dropped.
SQLite's FTS1 and FTS2 module are open source implementation of full-text indexing integrated with SQLite. FTS1 and FTS2 stores the text in B+ Trees and the access to the full-text index is using SQL. However, there are number of short-comings for our usage. FTS2 is still in stage where the API and data might change without backward compatibility. FTS1 does not have custom tokenizer which means no CJK support. Also, FTS1 stores the entire page duplicating what is aleady stored in the cache.
A custom implementation using B+ Tree is a very good option but however, it would require additional B+ Tree engine. In light of availability of an efficient algorithm for implementing full-text indexing using relational database, this method is used.
A naive implementation of full-text indexing is very costly in terms of storage. I'll briefly explain how it is so. Let us define term. A term is any word that appear in a page. So a relational database contains a table with two columns, term and id. Another table contains two columsn term id and doc id(id of the document the term appeard in). The GNU manuals were analyzed [1]. It is 5.15 Mb of text containing 958,774 occurrences of word out of which 27,554 are unique. But table 2 will require that every occurrence has a corresponding doc id. If term id were stored as int, the amount of space required to store the first column alone, would be 958,774 * 4 bytes, which is about 3 Mb. A B+ Tree implementation is atleast 3Mb more efficient. However a nice encoding scheme and storage model proposed by [2] is almost as efficient as a B+ Tree implementation. This algorithm also leverages the capabilites of relational database system while not losing too much in terms of storage and performance. Hence I propose to implement this algorithm.
TODO: Formatting. Learn how wiki formatting is done
Use Case
Actor: User - Visit Page - Search - Clear History
Actor: Browser - Expire Page
The use cases above will be used to validate the design.
Database Design
Word Table columnn type bytes description word varchar <=100 term for indexing(Shouldn't it be unicode? how do i store unicode?) wordnum integer 4 unique id. Integer works cause the number of unique words will be atmost a million. Non-english language? doc_count integer 4 number of documents the word occurred in word_count integer 4 number of occurrences of the word Posting Table column type bytes description wordnum integer 4 This is the foreign key matching that in the word table firstdoc integer 4 lowest doc id referenced in the block flags tinyint 1 indicates the block type, length of doc list, sequence number block varbinary <=255 contains encoded document and/or position postings for the word To Do 1) We might need a table or two more for ranking related data 2) See how docid will referenced Note that the tables structure is subject to change to improve efficiency. New Tables might be formed and/or the table might add/remove column
Detailed Design
Classes
There are essentially five classes for the back-end: 1) nsNavHistoryQuery 2) nsNavFullTextIndexHelper 3) nsNavFullTextIndex 4) nsNavFullTextTokenizer 5) nsNavFullTextAnalyzer
nsNavHistoryQuery
This class is already implemented with all features except searching text. The mechanism of searching is no different from what described in http://developer.mozilla.org/en/docs/Places:Query_System. The result of nsNavHistoryQuery::search will do a full-text search with nsNavHistoryQuery::searchTerms. This is very powerful with the number of options that you can specify. Conjunctive queries can be executed.
nsNavFullTextIndex
This class interacts with SQLite database. This implements the algorithm for adding document to the index, removing document from the index and search for a given term. This search function is used by nsNavHistoryQuery::search function. Look at [2] for the algorithm used.
Block is a struct used to encode and decode block. The structure of a block is as described in [2]
struct Block { int docCount; int[] docId; int[] docFreq; int[][] docPos; //Variable Length Encoding, compresses very efficiently balancing speed and storage requirement void encode(out byte[] encodedBlock) //encodes the block from docCount, docId, docFreq and docPos void decode(in byte[] encodedBlock); //Decodes the block and fills docCount, docId, docFreq and docPos void addDoc(in int id, in int freq, in int[] pos) //Updates the internal structure TODO: Fill the algorithm here }
AddDocument(connection, document, analyzer) { while(analyzer.hasTerms()) { term = analyzer.nextTerm(); if (term already in table) retrieve the last record, update the record(if necessary split) and commit to the database else create a new record, if necessary cache it for performance and commit to the database; TODO: Expand the algorithm, too generic here } }
nsNavFullTextIndexHelper
This class contains mechanism for scheduling documents for indexing. It works pretty much like nsNavHistoryExpire. The scheduling balances indexing with overall firefox performance. It has a timer which when expires picks an unindexed document from the history and calls nsNavFullTextIndex::addDocument to index. When and how the timer is set, dictates the overall efficiency. Whenever a user visits a page, the timer is triggered. ** fill in here how it works **
nsNavFullTextTokenizer
The tokenizer is used by nsNavFullTextIndex::AddDocument function. Tokenizer is an abstract class. Tokenizer splits the text into tokens. Tokenizer and Analyzer enable support for other languages.
nsNavFullTextAnalyzer
The Analyzer takes the tokens generated from the tokenizer and discards those that are not required for indexing. E.g. is, a, an the can be discarded.
Front-End
TODO: Write about the front-end design 1) nsNavQueryParser
References
[1] An Efficient Indexing Technique for full-text database systems: Justin Jobel, Alistair Moffet, Ron Sacks Davis [2]Using a relational database for full-text index, Steve Putz, Xerox PARC