Places:Full Text Indexing: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
No edit summary
 
(6 intermediate revisions by 2 users not shown)
Line 1: Line 1:
ToDO: Formatting, it is badly formatted. Any idea on how to effectively format code?
== Overview ==
== Overview ==


Full Text Indexing feature will allow user to search for a word/phrase from the pages that he has visited. The search query will be tightly integrated with Places's nsNavHistoryService. The tighter integration will allow queries like "search for pages visited between 01/05/07(dd/mm/yy) to 20/05/07(dd/mm/yy) containing the word 'places'"
Full Text Indexing feature will allow user to search for a word/phrase from the pages that he has visited. The search query will be tightly integrated with Places's nsNavHistoryService. The tighter integration will allow queries like "search for pages visited between 01/05/07(dd/mm/yy) to 20/05/07(dd/mm/yy) containing the word 'places'"


== Design Descision ==
== Status ==
 
Work-in-progress patch in {{bug|377244}}.
 
== Design Decisions ==


A number of options were looked into before proposing this design. The options included implementing using CLucene(like flock), SQLite's FTS1 and FTS2 module, implementation using B+ Trees, using relational database etc.. The following text will briefly describe the advantage and disadvantage of all the implementation methods.
A number of options were looked into before proposing this design. The options included implementing using CLucene(like flock), SQLite's FTS1 and FTS2 module, implementation using B+ Trees, using relational database etc.. The following text will briefly describe the advantage and disadvantage of all the implementation methods.


CLucene is a full-text indexing engine that stores the index as B+ Trees in files. It uses a very efficient method for storage and retrieval. It has an excellent support for CJK languages. The Places system is a new incorporation into firefox. Hence, it is important that during its initial stages all the code that is written or used is flexible, small and tightly integrated with it. Tighter integration would allow future enhancements specific to firefox. Hence this approach was dropped.
CLucene is a full-text indexing engine that stores the index as B+ Trees in files. It uses a very efficient method for storage and retrieval. It has an excellent support for CJK languages. The Places system is a new incorporation into firefox. Hence, it is important that during its initial stages all the code that is written or used is flexible, small and tightly integrated with it. Tighter integration would allow future enhancements specific to firefox. Hence this approach was dropped.
SQLite's FTS1 and FTS2 module are open source implementation of full-text indexing integrated with SQLite. FTS1 and FTS2 stores the text in B+ Trees and the access to the full-text index is using SQL. However, there are number of short-comings for our usage. FTS2 is still in stage where the API and data might change without backward compatibility. FTS1 does not have custom tokenizer which means no CJK support. Also, FTS1 stores the entire page duplicating what is aleady stored in the cache.


A custom implementation using B+ Tree is a very good option but however, it would require additional B+ Tree engine. In light of availability of an efficient algorithm for implementing full-text indexing using relational database, this method is used.
A custom implementation using B+ Tree is a very good option but however, it would require additional B+ Tree engine. In light of availability of an efficient algorithm for implementing full-text indexing using relational database, this method is used.


A naive implementation of full-text indexing is very costly in terms of storage. I'll briefly explain how it is so. Let us define term. A term is any word that appear in a page. So a relational database contains a table with two columns, term and id. Another table contains two columsn term id and doc id(id of the document the term appeard in). The GNU manuals were analyzed [1]. It is 5.15 Mb of text containing 958,774 occurrences of word out of which 27,554 are unique. But table 2 will require that every occurrence has a corresponding doc id. If term id were stored as int, the amount of space required to store the first column alone, would be 958,774 * 4 bytes, which is about 3 Mb. A B+ Tree implementation is atleast 3Mb more efficient. However a nice encoding scheme and storage model proposed by [2] is almost as efficient as a B+ Tree implementation. This algorithm also leverages the capabilites of relational database system while not losing too much in terms of storage and performance. Hence I propose to implement this algorithm.
A naive implementation of full-text indexing is very costly in terms of storage. I'll briefly explain how it is so. Let us define term. A term is any word that appear in a page. So a relational database contains a table with two columns, term and id. Another table contains two columsn term id and doc id(id of the document the term appeard in). The GNU manuals were analyzed [1]. It is 5.15 Mb of text containing 958,774 occurrences of word out of which 27,554 are unique. But table 2 will require that every occurrence has a corresponding doc id. If term id were stored as int, the amount of space required to store the first column alone, would be 958,774 * 4 bytes, which is about 3 Mb. A B+ Tree implementation is atleast 3Mb more efficient. However a nice encoding scheme and storage model proposed by [2] is almost as efficient as a B+ Tree implementation. This algorithm also leverages the capabilites of relational database system while not losing too much in terms of storage and performance.  


SQLite's FTS1 and FTS2 module are open source implementation of full-text indexing integrated with SQLite. According Scott Hess, FTS developer, "It sounds like a lot of what you're discussing here matches what we did for fts1 in SQLite. fts2 was a great improvement in terms of performance, and has no breaking changes expected" Hence Fts2 is a great option. I have built mozilla with sqlite and fts2 and it was easy. Moreover, FTS2 is integrated nicely with SQLite requiring no change in sqlite3file.h and sqlite.def files which are essentially exports. One can give queries like
<pre> Create virtual table history_index using fts2(title, meta, content) </pre>
The index gets created. Further to insert,
<pre> insert into history_index(title, meta, content) values('some value', 'some value', 'some value') </pre>
And to search,
<pre> select * from history_index where content matches 'value'</pre>


== Use Case ==
== Use Case ==
Line 28: Line 35:


The use cases above will be used to validate the design.
The use cases above will be used to validate the design.
== Database Design ==
TODO: Check the url_table and put it here. The url_table acts as the document table. The url table will contain additionally the document length
{| border="1" cellpadding="2"
|+'''Word Table'''
|-
!columnn!!type!!bytes!!description
|-
|word||varchar||<=100||term for indexing(Shouldn't it be unicode? how do i store unicode?)
|-
|wordnum||integer||4||unique id. Integer works cause the number of unique words will be atmost a million. Non-english language?
|-
|doc_count||integer||4||number of documents the word occurred in
|-
|word_count||integer||4||number of occurrences of the word
|}
<br>
{| border="1" cellpadding="2"
|+'''Posting Table'''
|-
!column!!type!!bytes!!description
|-
|wordnum||integer||4||This is the foreign key matching that in the word table
|-
|firstdoc||integer||4||lowest doc id referenced in the block
|-
|flags||tinyint||1||indicates the block type, length of doc list, sequence number
|-
|block||varbinary||<=255||contains encoded document and/or position postings for the word
|}
To Do
# We might need a table or two more for ranking efficiently
# Check if SQLite has varbinary datatype. There is a BLOB data type, I am sure.
Note that the tables structure is subject to change to improve efficiency. New Tables might be formed and/or the table might add/remove column


== Detailed Design ==
== Detailed Design ==
Line 98: Line 67:
===nsNavFullTextIndex===
===nsNavFullTextIndex===


This class interacts with SQLite database. This implements the algorithm for adding document to the index, removing document from the index and search for a given term. This search function is used by nsNavHistoryQuery::search function. Look at [2] for the algorithm used.
This class interacts with FTS2. It is responsible for executing queries that will insert content into the index and generate  URI of matching documents on a search request. This class will be private and will not be exposed outside. A search request will also generate text snippets to be displayed for UI to display
 
Block is a struct used to encode and decode block. A block is variable length delta encoded. Variable Length delta Encoding, compresses very efficiently balancing speed and storage requirement.
<pre>struct Block {
//The width of the field is 255 bytes. The return value is an int indicating number of elements in the data that were encoded in the out byte array.
int encode(in int[] data, out byte[] encodedBlock) {
//How to encode more efficiently, any idea?
int[] bigEndian;
int k = 0;
data[i - 1] = 0;
for(int i = 0; i < data.length; i++) {
data[i] -= data[i - 1];
int j = 0;
while(data[i] != 0) {
bigEndian[j++] = data[i] % 128;
data[i] /= 128;
}
for( ; j > 0; j--, k++) {
encodedBlock[k] = (1 << 8) & bigEndian[j];
}
encodedBlock[k++] = bigEndian[0];
if (k > 255)
return i - 1
}
}
void decode(in byte[] encodedBlock, out int[] data) {
int j = 0;
for(int i = 0; i < encodedBlock.length; i++) {
if (encodedBlock[i] && (1 << 8)) {
data[j] *= 128;
data[j] += encodedBlock[i] & ((1 << 8) - 1);
}
else {
data[j] += data[j - 1]; //Because it was delta encoded
data[j++] = encodedBlock[i];
}
}
}
}
 
 
AddDocument(connection, document, analyzer) {
AddDocument in two pass. Scan the document for terms and then invert. We have the doc id. We will require a hash map library to make it very efficient. Term is a hash map. Usage: term['termname'] = array of pos.
<pre> while(analyzer.hasTerms()) {
cTerm = analyzer.nextTerm();
term[cTerm.Name].add(cTerm.pos);
}
iterator i = term.iterator();
while(i.hasNext()) {
termName = i.next();
termPos[] = term[termName];
//hopefully sqlite caches query results, it will be inefficient otherwise.
 
if (term already in table) {
record = executeQuery("SELECT firstdoc, flags, block FROM postings
WHERE word = termName
AND firstdoc == (Given as >= in [2], == is correct in my opinion)
  (SELECT max (firstdoc) FROM postings
    WHERE word = termName and flags < 128)");
//Refer to [2] for explanation of this.
//only one record is retrieved with flags == 0 or flags between 0 and 128
//when flag == 0, the block contains only document list
if (flag == 0) {
1) Decode the block
2) See if one more document can be fitted in this.
3) Yes? add to it
i) find the position list
positionList = executeQuery("SELECT firstdoc, flags, block FROM positings
where firstdoc =
SELECT max(firstdoc) FROM postings
WHERE word = termName
AND firstdoc >= record.firstdoc AND flags >= 128
ii) Try to add as many position in this block
iii) when the block is done, create a new row, with firstdoc == currentdoc
and flags == 128 or 129 depending on whether the prev firstdoc was same as this.
 
iv) goto ii) if there are more left.
4) no?
i) create a new row with firstdoc = docid, flags=2
ii) To the block  add, doc id and doc freq. And all the pos. Note the position listings are never split when flag == 2.
We must try to fit all the position listing in this block.99% of the case this should be possible. Make a small calculation, you'll find out i am correct
iii) The rare case it is not possible? create two rows one with flags==0 and flags==128
}
else {
//This is slightly complex in that there is both document list and position list in the same block. We have to decode the block. Try to add document id and and all the position to the position list. This might not be possible. And we'll have to create two new rows, one with flags == 0 and other with flags == 128
}
update the word count in the word list table appropriately
}
}
commit to the database
}
 
RemoveDocument() {
Inherently inefficient because of the structure that we have adopted. Needs to be optimized. The general algorithm revolves around finding the record whose firstDoc is immediately less or same as the docId we are searching for.
 
<pre> e.g. Say we want to delete docid = 20. We got two records
firstDoc = 18, block="blahblah"
firstDoc = 22, block="blahblah"
So we select the record with docid = 18 that is immediately before docid = 20;
query to achieve this: SELECT word, firstdoc, block FROM postings where
firstdoc = SELECT max(firstdoc) FROM postings where
firstdoc < docIdWeAreSearchingFor
This returns a number of records with flag == 0 or 0 < flag < 128 or flag == 128 or flag > 128.
for each record we found do the following:
docAndPostingTable = decodeBlock(block);
we have just decoded the block using Block::decodeBlock().
docAndPostingTable.find(docId) //we check the decode block if it contains docId of our interest.
if docId found
Check the flag
when flag == 0 //only document list
remove the document and freq from the block
update the word count for the term in word table
update the delta coding for other doc in the block.
if docId == firstDoc update firstDoc with the immediately following doc
if no more docs left in this row, delete the row
when 0 < flag < 128, contains both document list and posting table
remove the document from the block.
update the word count for the term in word table
update the delta coding for other doc in the block.
remove all the postings of the document for the term in the block
if (docId == firstDoc) update firstDoc with immediately following doc
if no more docs left in the row, delete the row
when flag >= 128 //only postings table
remove all the postings corresponding to the doc
update the firstDoc for the record
delete the record if block is empty
}
 
SearchDocument(terms) {
terms is something like: "mac apple panther". Basically a collection of terms
The ranking algorithm as described in this document http://lucene.apache.org/java/2_1_0/api/org/apache/lucene/search/Similarity.html is used.
}</pre>


===nsNavFullTextIndexHelper===
===nsNavFullTextIndexHelper===


This class contains mechanism for scheduling documents for indexing. It works pretty much like nsNavHistoryExpire. The scheduling balances indexing with overall firefox performance. It has a timer which when expires picks an unindexed document from the history and calls nsNavFullTextIndex::addDocument to index. When and how the timer is set, dictates the overall efficiency. Whenever a user visits a page, the timer is triggered.
This class implements nsParserDataListener and is registered to listen to data. The class is called every time there is data available upon page request. After aggregating the data, nsFullTextIndex::indexDocument(documentData) is called. This function will execute FTS2 queries.
 
When user visits a page, the history service calls OnAddURI function of this class. This function starts the timer. When timer expires the call back function is called. The function checks If there are any more document to be indexed and it has not been a long while since the user called addURI. If so, then it resets the timer and waits to expire again.
 
===nsNavFullTextAnalyzer===
 
The analyzer design is similar to the way Lucene works. The Lucene design enables support for multiple languages.
The Analyzer takes the tokens generated from the tokenizer and discards those that are not required for indexing. E.g. is, a, an the can be discarded. This is enabled by by tokenStream classes. Refer to nsNavFullTextTokenStream and Lucene API for more detail.


===nsNavFullTextTokenStream===
===nsNavFullTextTokenizer===


TokenStream is an abstract class with three methods next, reset and close. The next method returns a token. Token is a class representing startoffset, endoffset and termtext. TokenStream needs input to tokenize. There are two concrete class for TokenStream:
These are bunch of classes that will be registered to FTS2 module allowing custom tokenizers and different languages. Before the standarad tokenizer, the stream has to pass through a html tag stripper.
# Tokenizer: The input is an inputstream.
# TokenFilter: The input is another TokenStream. This works like pipes.


== Front-End ==
== Front-End ==

Latest revision as of 23:30, 18 August 2008

Overview

Full Text Indexing feature will allow user to search for a word/phrase from the pages that he has visited. The search query will be tightly integrated with Places's nsNavHistoryService. The tighter integration will allow queries like "search for pages visited between 01/05/07(dd/mm/yy) to 20/05/07(dd/mm/yy) containing the word 'places'"

Status

Work-in-progress patch in bug 377244.

Design Decisions

A number of options were looked into before proposing this design. The options included implementing using CLucene(like flock), SQLite's FTS1 and FTS2 module, implementation using B+ Trees, using relational database etc.. The following text will briefly describe the advantage and disadvantage of all the implementation methods.

CLucene is a full-text indexing engine that stores the index as B+ Trees in files. It uses a very efficient method for storage and retrieval. It has an excellent support for CJK languages. The Places system is a new incorporation into firefox. Hence, it is important that during its initial stages all the code that is written or used is flexible, small and tightly integrated with it. Tighter integration would allow future enhancements specific to firefox. Hence this approach was dropped.

A custom implementation using B+ Tree is a very good option but however, it would require additional B+ Tree engine. In light of availability of an efficient algorithm for implementing full-text indexing using relational database, this method is used.

A naive implementation of full-text indexing is very costly in terms of storage. I'll briefly explain how it is so. Let us define term. A term is any word that appear in a page. So a relational database contains a table with two columns, term and id. Another table contains two columsn term id and doc id(id of the document the term appeard in). The GNU manuals were analyzed [1]. It is 5.15 Mb of text containing 958,774 occurrences of word out of which 27,554 are unique. But table 2 will require that every occurrence has a corresponding doc id. If term id were stored as int, the amount of space required to store the first column alone, would be 958,774 * 4 bytes, which is about 3 Mb. A B+ Tree implementation is atleast 3Mb more efficient. However a nice encoding scheme and storage model proposed by [2] is almost as efficient as a B+ Tree implementation. This algorithm also leverages the capabilites of relational database system while not losing too much in terms of storage and performance.

SQLite's FTS1 and FTS2 module are open source implementation of full-text indexing integrated with SQLite. According Scott Hess, FTS developer, "It sounds like a lot of what you're discussing here matches what we did for fts1 in SQLite. fts2 was a great improvement in terms of performance, and has no breaking changes expected" Hence Fts2 is a great option. I have built mozilla with sqlite and fts2 and it was easy. Moreover, FTS2 is integrated nicely with SQLite requiring no change in sqlite3file.h and sqlite.def files which are essentially exports. One can give queries like

 Create virtual table history_index using fts2(title, meta, content) 

The index gets created. Further to insert,

 insert into history_index(title, meta, content) values('some value', 'some value', 'some value') 

And to search,

 select * from history_index where content matches 'value'

Use Case

Actor: User
- Visit Page
- Search
- Clear History

Actor: Browser
- Expire Page

The use cases above will be used to validate the design.

Detailed Design

Classes

There are essentially four classes for the back-end:

  1. nsNavHistoryQuery
  2. nsNavFullTextIndexHelper
  3. nsNavFullTextIndex
  4. nsNavFullTextAnalyzer

nsNavHistoryQuery

This class is already implemented with all features except searching text. The mechanism of searching is no different from what described in http://developer.mozilla.org/en/docs/Places:Query_System.

var historyService = Components.classes["@mozilla.org/browser/nav-history-service;1"].getService(Components.interfaces.nsINavHistoryService);

var options = historyService.getNewQueryOptions();
var query = historyService.getNewQuery();
query.searchTerms = "Mozilla Firefox";

// execute the query
var result = historyService.executeQuery(query, options);

The result will contain a list of URI. A number of options can be specified in query and options making it very powerful. Conjunctive queries can be also be executed with historyService.executeQueries and a list of query as parameter.

Internally the function calls nsNavFullTextIndex::searchDocument(searchTerms) which returns a list of URI ranked according to algorithm described in SearchDocument(terms) function that will be described later in this document. The list of URI is further filtered by the other parameters set in query and options variable. In case of executeQueries method, the list is aggregated with results from multiple queries.

nsNavFullTextIndex

This class interacts with FTS2. It is responsible for executing queries that will insert content into the index and generate URI of matching documents on a search request. This class will be private and will not be exposed outside. A search request will also generate text snippets to be displayed for UI to display

nsNavFullTextIndexHelper

This class implements nsParserDataListener and is registered to listen to data. The class is called every time there is data available upon page request. After aggregating the data, nsFullTextIndex::indexDocument(documentData) is called. This function will execute FTS2 queries.

nsNavFullTextTokenizer

These are bunch of classes that will be registered to FTS2 module allowing custom tokenizers and different languages. Before the standarad tokenizer, the stream has to pass through a html tag stripper.

Front-End

nsNavQueryParser

The function of this is to break the query given in human language into a graph of TermQuery and BooleanQuery. BooleanQuery is a struct with two operand(each a TermQuery or BooleanQuery) and an operator(AND, OR, NOT, XOR). Although the idea here is to implement all kind of queries as in: http://lucene.apache.org/java/docs/queryparsersyntax.html eventually. nsNavHistoryQuery is used to query using the query struct. The results of which is url_list which is displayed using view.

References

  1. An Efficient Indexing Technique for full-text database systems Justin Jobel, Alistair Moffet, Ron Sacks Davis
  2. Using a relational database for an Inverted Text index Steve Putz, Xerox PARC