MozillaWiki - User contributions [en]

Places:Full Text Indexing

2007-08-15T10:38:39Z

Mindboggler: /* Detailed Design */

Places:Full Text Indexing

2007-08-15T06:45:07Z

Mindboggler: /* nsNavFullTextIndex */

Places:Full Text Indexing

2007-08-15T06:37:32Z

Mindboggler:

== Overview ==

Full Text Indexing feature will allow user to search for a word/phrase from the pages that he has visited. The search query will be tightly integrated with Places's nsNavHistoryService. The tighter integration will allow queries like "search for pages visited between 01/05/07(dd/mm/yy) to 20/05/07(dd/mm/yy) containing the word 'places'"

== Design Descision ==

A number of options were looked into before proposing this design. The options included implementing using CLucene(like flock), SQLite's FTS1 and FTS2 module, implementation using B+ Trees, using relational database etc.. The following text will briefly describe the advantage and disadvantage of all the implementation methods.

CLucene is a full-text indexing engine that stores the index as B+ Trees in files. It uses a very efficient method for storage and retrieval. It has an excellent support for CJK languages. The Places system is a new incorporation into firefox. Hence, it is important that during its initial stages all the code that is written or used is flexible, small and tightly integrated with it. Tighter integration would allow future enhancements specific to firefox. Hence this approach was dropped.

A custom implementation using B+ Tree is a very good option but however, it would require additional B+ Tree engine. In light of availability of an efficient algorithm for implementing full-text indexing using relational database, this method is used.

A naive implementation of full-text indexing is very costly in terms of storage. I'll briefly explain how it is so. Let us define term. A term is any word that appear in a page. So a relational database contains a table with two columns, term and id. Another table contains two columsn term id and doc id(id of the document the term appeard in). The GNU manuals were analyzed [1]. It is 5.15 Mb of text containing 958,774 occurrences of word out of which 27,554 are unique. But table 2 will require that every occurrence has a corresponding doc id. If term id were stored as int, the amount of space required to store the first column alone, would be 958,774 * 4 bytes, which is about 3 Mb. A B+ Tree implementation is atleast 3Mb more efficient. However a nice encoding scheme and storage model proposed by [2] is almost as efficient as a B+ Tree implementation. This algorithm also leverages the capabilites of relational database system while not losing too much in terms of storage and performance.

SQLite's FTS1 and FTS2 module are open source implementation of full-text indexing integrated with SQLite. According Scott Hess, FTS developer, "It sounds like a lot of what you're discussing here matches what we did for fts1 in SQLite. fts2 was a great improvement in terms of performance, and has no breaking changes expected" Hence Fts2 is a great option. I have built mozilla with sqlite and fts2 and it was easy. Moreover, FTS2 is integrated nicely with SQLite requiring no change in sqlite3file.h and sqlite.def files which are essentially exports. One can give queries like
<pre> Create virtual table history_index using fts2(title, meta, content) </pre>
The index gets created. Further to insert,
<pre> insert into history_index(title, meta, content) values('some value', 'some value', 'some value') </pre>
And to search,
<pre> select * from history_index where content matches 'value'</pre>

== Use Case ==

'''Actor: User''' 
- Visit Page 
- Search 
- Clear History 

'''Actor: Browser''' 
- Expire Page 

The use cases above will be used to validate the design.

== Detailed Design ==

Classes

There are essentially four classes for the back-end:
# nsNavHistoryQuery
# nsNavFullTextIndexHelper
# nsNavFullTextIndex
# nsNavFullTextAnalyzer

===nsNavHistoryQuery===

This class is already implemented with all features except searching text. The mechanism of searching is no different from what described in http://developer.mozilla.org/en/docs/Places:Query_System.

<pre>
var historyService = Components.classes["@mozilla.org/browser/nav-history-service;1"].getService(Components.interfaces.nsINavHistoryService);

var options = historyService.getNewQueryOptions();
var query = historyService.getNewQuery();
query.searchTerms = "Mozilla Firefox";

// execute the query
var result = historyService.executeQuery(query, options);
</pre>

The result will contain a list of URI. A number of options can be specified in query and options making it very powerful. Conjunctive queries can be also be executed with historyService.executeQueries and a list of query as parameter.

Internally the function calls nsNavFullTextIndex::searchDocument(searchTerms) which returns a list of URI ranked according to algorithm described in SearchDocument(terms) function that will be described later in this document. The list of URI is further filtered by the other parameters set in query and options variable. In case of executeQueries method, the list is aggregated with results from multiple queries.

===nsNavFullTextIndex===

This class interacts with SQLite database. This implements the algorithm for adding document to the index, removing document from the index and search for a given term. This search function is used by nsNavHistoryQuery::search function. Look at [2] for the algorithm used.

Block is a struct used to encode and decode block. A block is variable length delta encoded. Variable Length delta Encoding, compresses very efficiently balancing speed and storage requirement.
<pre>struct Block {
//The width of the field is 255 bytes. The return value is an int indicating number of elements in the data that were encoded in the out byte array.
int encode(in int[] data, out byte[] encodedBlock) {
//How to encode more efficiently, any idea?
int[] bigEndian;
int k = 0;
data[i - 1] = 0;
for(int i = 0; i < data.length; i++) {
data[i] -= data[i - 1];
int j = 0;
while(data[i] != 0) {
bigEndian[j++] = data[i] % 128;
data[i] /= 128;
}
for( ; j > 0; j--, k++) {
encodedBlock[k] = (1 << 8) & bigEndian[j];
}
encodedBlock[k++] = bigEndian[0];
if (k > 255)
return i - 1
}
}
void decode(in byte[] encodedBlock, out int[] data) {
int j = 0;
for(int i = 0; i < encodedBlock.length; i++) {
if (encodedBlock[i] && (1 << 8)) {
data[j] *= 128;
data[j] += encodedBlock[i] & ((1 << 8) - 1);
}
else {
data[j] += data[j - 1]; //Because it was delta encoded
data[j++] = encodedBlock[i];
}
}
}
}

AddDocument(connection, document, analyzer) {
AddDocument in two pass. Scan the document for terms and then invert. We have the doc id. We will require a hash map library to make it very efficient. Term is a hash map. Usage: term['termname'] = array of pos.
<pre> while(analyzer.hasTerms()) {
cTerm = analyzer.nextTerm();
term[cTerm.Name].add(cTerm.pos);
}
iterator i = term.iterator();
while(i.hasNext()) {
termName = i.next();
termPos[] = term[termName];

//hopefully sqlite caches query results, it will be inefficient otherwise.

if (term already in table) {
record = executeQuery("SELECT ﬁrstdoc, ﬂags, block FROM postings
WHERE word = termName
AND ﬁrstdoc == (Given as >= in [2], == is correct in my opinion)
(SELECT max (ﬁrstdoc) FROM postings
WHERE word = termName and flags < 128)");
//Refer to [2] for explanation of this.
//only one record is retrieved with flags == 0 or flags between 0 and 128
//when flag == 0, the block contains only document list
if (flag == 0) {
1) Decode the block
2) See if one more document can be fitted in this.
3) Yes? add to it
i) find the position list
positionList = executeQuery("SELECT firstdoc, flags, block FROM positings
where firstdoc =
SELECT max(ﬁrstdoc) FROM postings
WHERE word = termName
AND ﬁrstdoc >= record.firstdoc AND flags >= 128
ii) Try to add as many position in this block
iii) when the block is done, create a new row, with firstdoc == currentdoc
and flags == 128 or 129 depending on whether the prev firstdoc was same as this.

iv) goto ii) if there are more left.
4) no?
i) create a new row with firstdoc = docid, flags=2
ii) To the block add, doc id and doc freq. And all the pos. Note the position listings are never split when flag == 2.
We must try to fit all the position listing in this block.99% of the case this should be possible. Make a small calculation, you'll find out i am correct
iii) The rare case it is not possible? create two rows one with flags==0 and flags==128
}
else {
//This is slightly complex in that there is both document list and position list in the same block. We have to decode the block. Try to add document id and and all the position to the position list. This might not be possible. And we'll have to create two new rows, one with flags == 0 and other with flags == 128
}
update the word count in the word list table appropriately
}
}
commit to the database
}

RemoveDocument() {
Inherently inefficient because of the structure that we have adopted. Needs to be optimized. The general algorithm revolves around finding the record whose firstDoc is immediately less or same as the docId we are searching for.

<pre> e.g. Say we want to delete docid = 20. We got two records
firstDoc = 18, block="blahblah"
firstDoc = 22, block="blahblah"
So we select the record with docid = 18 that is immediately before docid = 20;

query to achieve this: SELECT word, firstdoc, block FROM postings where
firstdoc = SELECT max(firstdoc) FROM postings where
firstdoc < docIdWeAreSearchingFor
This returns a number of records with flag == 0 or 0 < flag < 128 or flag == 128 or flag > 128.
for each record we found do the following:
docAndPostingTable = decodeBlock(block);
we have just decoded the block using Block::decodeBlock().
docAndPostingTable.find(docId) //we check the decode block if it contains docId of our interest.
if docId found
Check the flag
when flag == 0 //only document list
remove the document and freq from the block
update the word count for the term in word table
update the delta coding for other doc in the block.
if docId == firstDoc update firstDoc with the immediately following doc
if no more docs left in this row, delete the row
when 0 < flag < 128, contains both document list and posting table
remove the document from the block.
update the word count for the term in word table
update the delta coding for other doc in the block.
remove all the postings of the document for the term in the block
if (docId == firstDoc) update firstDoc with immediately following doc
if no more docs left in the row, delete the row
when flag >= 128 //only postings table
remove all the postings corresponding to the doc
update the firstDoc for the record
delete the record if block is empty

}

SearchDocument(terms) {
terms is something like: "mac apple panther". Basically a collection of terms
The ranking algorithm as described in this document http://lucene.apache.org/java/2_1_0/api/org/apache/lucene/search/Similarity.html is used.
}</pre>

===nsNavFullTextIndexHelper===

This class contains mechanism for scheduling documents for indexing. It works pretty much like nsNavHistoryExpire. The scheduling balances indexing with overall firefox performance. It has a timer which when expires picks an unindexed document from the history and calls nsNavFullTextIndex::addDocument to index. When and how the timer is set, dictates the overall efficiency. Whenever a user visits a page, the timer is triggered.

When user visits a page, the history service calls OnAddURI function of this class. This function starts the timer. When timer expires the call back function is called. The function checks If there are any more document to be indexed and it has not been a long while since the user called addURI. If so, then it resets the timer and waits to expire again.

===nsNavFullTextAnalyzer===

The analyzer design is similar to the way Lucene works. The Lucene design enables support for multiple languages.
The Analyzer takes the tokens generated from the tokenizer and discards those that are not required for indexing. E.g. is, a, an the can be discarded. This is enabled by by tokenStream classes. Refer to nsNavFullTextTokenStream and Lucene API for more detail.

===nsNavFullTextTokenStream===

TokenStream is an abstract class with three methods next, reset and close. The next method returns a token. Token is a class representing startoffset, endoffset and termtext. TokenStream needs input to tokenize. There are two concrete class for TokenStream:
# Tokenizer: The input is an inputstream.
# TokenFilter: The input is another TokenStream. This works like pipes.

== Front-End ==
===nsNavQueryParser===
The function of this is to break the query given in human language into a graph of TermQuery and BooleanQuery. BooleanQuery is a struct with two operand(each a TermQuery or BooleanQuery) and an operator(AND, OR, NOT, XOR). Although the idea here is to implement all kind of queries as in: http://lucene.apache.org/java/docs/queryparsersyntax.html eventually. nsNavHistoryQuery is used to query using the query struct. The results of which is url_list which is displayed using view.

== References ==
# [http://www.vldb.org/conf/1992/P353.PDF An Efficient Indexing Technique for full-text database systems] Justin Jobel, Alistair Moffet, Ron Sacks Davis
# [http://www.n3labs.com/pdf/putz91using-inverted-DBMS.pdf Using a relational database for an Inverted Text index] Steve Putz, Xerox PARC

Places:Full Text Indexing

2007-08-15T06:36:28Z

Mindboggler: /* Design Descision */

ToDO: Formatting, it is badly formatted. Any idea on how to effectively format code?
== Overview ==

Full Text Indexing feature will allow user to search for a word/phrase from the pages that he has visited. The search query will be tightly integrated with Places's nsNavHistoryService. The tighter integration will allow queries like "search for pages visited between 01/05/07(dd/mm/yy) to 20/05/07(dd/mm/yy) containing the word 'places'"

== Design Descision ==

A number of options were looked into before proposing this design. The options included implementing using CLucene(like flock), SQLite's FTS1 and FTS2 module, implementation using B+ Trees, using relational database etc.. The following text will briefly describe the advantage and disadvantage of all the implementation methods.

CLucene is a full-text indexing engine that stores the index as B+ Trees in files. It uses a very efficient method for storage and retrieval. It has an excellent support for CJK languages. The Places system is a new incorporation into firefox. Hence, it is important that during its initial stages all the code that is written or used is flexible, small and tightly integrated with it. Tighter integration would allow future enhancements specific to firefox. Hence this approach was dropped.

A custom implementation using B+ Tree is a very good option but however, it would require additional B+ Tree engine. In light of availability of an efficient algorithm for implementing full-text indexing using relational database, this method is used.

A naive implementation of full-text indexing is very costly in terms of storage. I'll briefly explain how it is so. Let us define term. A term is any word that appear in a page. So a relational database contains a table with two columns, term and id. Another table contains two columsn term id and doc id(id of the document the term appeard in). The GNU manuals were analyzed [1]. It is 5.15 Mb of text containing 958,774 occurrences of word out of which 27,554 are unique. But table 2 will require that every occurrence has a corresponding doc id. If term id were stored as int, the amount of space required to store the first column alone, would be 958,774 * 4 bytes, which is about 3 Mb. A B+ Tree implementation is atleast 3Mb more efficient. However a nice encoding scheme and storage model proposed by [2] is almost as efficient as a B+ Tree implementation. This algorithm also leverages the capabilites of relational database system while not losing too much in terms of storage and performance.

SQLite's FTS1 and FTS2 module are open source implementation of full-text indexing integrated with SQLite. According Scott Hess, FTS developer, "It sounds like a lot of what you're discussing here matches what we did for fts1 in SQLite. fts2 was a great improvement in terms of performance, and has no breaking changes expected" Hence Fts2 is a great option. I have built mozilla with sqlite and fts2 and it was easy. Moreover, FTS2 is integrated nicely with SQLite requiring no change in sqlite3file.h and sqlite.def files which are essentially exports. One can give queries like
<pre> Create virtual table history_index using fts2(title, meta, content) </pre>
The index gets created. Further to insert,
<pre> insert into history_index(title, meta, content) values('some value', 'some value', 'some value') </pre>
And to search,
<pre> select * from history_index where content matches 'value'</pre>

== Use Case ==

'''Actor: User''' 
- Visit Page 
- Search 
- Clear History 

'''Actor: Browser''' 
- Expire Page 

The use cases above will be used to validate the design.

== Database Design ==

TODO: Check the url_table and put it here. The url_table acts as the document table. The url table will contain additionally the document length

{| border="1" cellpadding="2"
|+'''Word Table'''
|-
!columnn!!type!!bytes!!description
|-
|word||varchar||<=100||term for indexing(Shouldn't it be unicode? how do i store unicode?)
|-
|wordnum||integer||4||unique id. Integer works cause the number of unique words will be atmost a million. Non-english language?
|-
|doc_count||integer||4||number of documents the word occurred in
|-
|word_count||integer||4||number of occurrences of the word
|}
 
{| border="1" cellpadding="2"
|+'''Posting Table'''
|-
!column!!type!!bytes!!description
|-
|wordnum||integer||4||This is the foreign key matching that in the word table
|-
|firstdoc||integer||4||lowest doc id referenced in the block
|-
|flags||tinyint||1||indicates the block type, length of doc list, sequence number
|-
|block||varbinary||<=255||contains encoded document and/or position postings for the word
|}

To Do
# We might need a table or two more for ranking efficiently
# Check if SQLite has varbinary datatype. There is a BLOB data type, I am sure.

Note that the tables structure is subject to change to improve efficiency. New Tables might be formed and/or the table might add/remove column

== Detailed Design ==

Classes

There are essentially four classes for the back-end:
# nsNavHistoryQuery
# nsNavFullTextIndexHelper
# nsNavFullTextIndex
# nsNavFullTextAnalyzer

===nsNavHistoryQuery===

This class is already implemented with all features except searching text. The mechanism of searching is no different from what described in http://developer.mozilla.org/en/docs/Places:Query_System.

<pre>
var historyService = Components.classes["@mozilla.org/browser/nav-history-service;1"].getService(Components.interfaces.nsINavHistoryService);

var options = historyService.getNewQueryOptions();
var query = historyService.getNewQuery();
query.searchTerms = "Mozilla Firefox";

// execute the query
var result = historyService.executeQuery(query, options);
</pre>

The result will contain a list of URI. A number of options can be specified in query and options making it very powerful. Conjunctive queries can be also be executed with historyService.executeQueries and a list of query as parameter.

Internally the function calls nsNavFullTextIndex::searchDocument(searchTerms) which returns a list of URI ranked according to algorithm described in SearchDocument(terms) function that will be described later in this document. The list of URI is further filtered by the other parameters set in query and options variable. In case of executeQueries method, the list is aggregated with results from multiple queries.

===nsNavFullTextIndex===

This class interacts with SQLite database. This implements the algorithm for adding document to the index, removing document from the index and search for a given term. This search function is used by nsNavHistoryQuery::search function. Look at [2] for the algorithm used.

Block is a struct used to encode and decode block. A block is variable length delta encoded. Variable Length delta Encoding, compresses very efficiently balancing speed and storage requirement.
<pre>struct Block {
//The width of the field is 255 bytes. The return value is an int indicating number of elements in the data that were encoded in the out byte array.
int encode(in int[] data, out byte[] encodedBlock) {
//How to encode more efficiently, any idea?
int[] bigEndian;
int k = 0;
data[i - 1] = 0;
for(int i = 0; i < data.length; i++) {
data[i] -= data[i - 1];
int j = 0;
while(data[i] != 0) {
bigEndian[j++] = data[i] % 128;
data[i] /= 128;
}
for( ; j > 0; j--, k++) {
encodedBlock[k] = (1 << 8) & bigEndian[j];
}
encodedBlock[k++] = bigEndian[0];
if (k > 255)
return i - 1
}
}
void decode(in byte[] encodedBlock, out int[] data) {
int j = 0;
for(int i = 0; i < encodedBlock.length; i++) {
if (encodedBlock[i] && (1 << 8)) {
data[j] *= 128;
data[j] += encodedBlock[i] & ((1 << 8) - 1);
}
else {
data[j] += data[j - 1]; //Because it was delta encoded
data[j++] = encodedBlock[i];
}
}
}
}

AddDocument(connection, document, analyzer) {
AddDocument in two pass. Scan the document for terms and then invert. We have the doc id. We will require a hash map library to make it very efficient. Term is a hash map. Usage: term['termname'] = array of pos.
<pre> while(analyzer.hasTerms()) {
cTerm = analyzer.nextTerm();
term[cTerm.Name].add(cTerm.pos);
}
iterator i = term.iterator();
while(i.hasNext()) {
termName = i.next();
termPos[] = term[termName];

//hopefully sqlite caches query results, it will be inefficient otherwise.

if (term already in table) {
record = executeQuery("SELECT ﬁrstdoc, ﬂags, block FROM postings
WHERE word = termName
AND ﬁrstdoc == (Given as >= in [2], == is correct in my opinion)
(SELECT max (ﬁrstdoc) FROM postings
WHERE word = termName and flags < 128)");
//Refer to [2] for explanation of this.
//only one record is retrieved with flags == 0 or flags between 0 and 128
//when flag == 0, the block contains only document list
if (flag == 0) {
1) Decode the block
2) See if one more document can be fitted in this.
3) Yes? add to it
i) find the position list
positionList = executeQuery("SELECT firstdoc, flags, block FROM positings
where firstdoc =
SELECT max(ﬁrstdoc) FROM postings
WHERE word = termName
AND ﬁrstdoc >= record.firstdoc AND flags >= 128
ii) Try to add as many position in this block
iii) when the block is done, create a new row, with firstdoc == currentdoc
and flags == 128 or 129 depending on whether the prev firstdoc was same as this.

iv) goto ii) if there are more left.
4) no?
i) create a new row with firstdoc = docid, flags=2
ii) To the block add, doc id and doc freq. And all the pos. Note the position listings are never split when flag == 2.
We must try to fit all the position listing in this block.99% of the case this should be possible. Make a small calculation, you'll find out i am correct
iii) The rare case it is not possible? create two rows one with flags==0 and flags==128
}
else {
//This is slightly complex in that there is both document list and position list in the same block. We have to decode the block. Try to add document id and and all the position to the position list. This might not be possible. And we'll have to create two new rows, one with flags == 0 and other with flags == 128
}
update the word count in the word list table appropriately
}
}
commit to the database
}

RemoveDocument() {
Inherently inefficient because of the structure that we have adopted. Needs to be optimized. The general algorithm revolves around finding the record whose firstDoc is immediately less or same as the docId we are searching for.

<pre> e.g. Say we want to delete docid = 20. We got two records
firstDoc = 18, block="blahblah"
firstDoc = 22, block="blahblah"
So we select the record with docid = 18 that is immediately before docid = 20;

query to achieve this: SELECT word, firstdoc, block FROM postings where
firstdoc = SELECT max(firstdoc) FROM postings where
firstdoc < docIdWeAreSearchingFor
This returns a number of records with flag == 0 or 0 < flag < 128 or flag == 128 or flag > 128.
for each record we found do the following:
docAndPostingTable = decodeBlock(block);
we have just decoded the block using Block::decodeBlock().
docAndPostingTable.find(docId) //we check the decode block if it contains docId of our interest.
if docId found
Check the flag
when flag == 0 //only document list
remove the document and freq from the block
update the word count for the term in word table
update the delta coding for other doc in the block.
if docId == firstDoc update firstDoc with the immediately following doc
if no more docs left in this row, delete the row
when 0 < flag < 128, contains both document list and posting table
remove the document from the block.
update the word count for the term in word table
update the delta coding for other doc in the block.
remove all the postings of the document for the term in the block
if (docId == firstDoc) update firstDoc with immediately following doc
if no more docs left in the row, delete the row
when flag >= 128 //only postings table
remove all the postings corresponding to the doc
update the firstDoc for the record
delete the record if block is empty

}

SearchDocument(terms) {
terms is something like: "mac apple panther". Basically a collection of terms
The ranking algorithm as described in this document http://lucene.apache.org/java/2_1_0/api/org/apache/lucene/search/Similarity.html is used.
}</pre>

===nsNavFullTextIndexHelper===

This class contains mechanism for scheduling documents for indexing. It works pretty much like nsNavHistoryExpire. The scheduling balances indexing with overall firefox performance. It has a timer which when expires picks an unindexed document from the history and calls nsNavFullTextIndex::addDocument to index. When and how the timer is set, dictates the overall efficiency. Whenever a user visits a page, the timer is triggered.

When user visits a page, the history service calls OnAddURI function of this class. This function starts the timer. When timer expires the call back function is called. The function checks If there are any more document to be indexed and it has not been a long while since the user called addURI. If so, then it resets the timer and waits to expire again.

===nsNavFullTextAnalyzer===

The analyzer design is similar to the way Lucene works. The Lucene design enables support for multiple languages.
The Analyzer takes the tokens generated from the tokenizer and discards those that are not required for indexing. E.g. is, a, an the can be discarded. This is enabled by by tokenStream classes. Refer to nsNavFullTextTokenStream and Lucene API for more detail.

===nsNavFullTextTokenStream===

TokenStream is an abstract class with three methods next, reset and close. The next method returns a token. Token is a class representing startoffset, endoffset and termtext. TokenStream needs input to tokenize. There are two concrete class for TokenStream:
# Tokenizer: The input is an inputstream.
# TokenFilter: The input is another TokenStream. This works like pipes.

== Front-End ==
===nsNavQueryParser===
The function of this is to break the query given in human language into a graph of TermQuery and BooleanQuery. BooleanQuery is a struct with two operand(each a TermQuery or BooleanQuery) and an operator(AND, OR, NOT, XOR). Although the idea here is to implement all kind of queries as in: http://lucene.apache.org/java/docs/queryparsersyntax.html eventually. nsNavHistoryQuery is used to query using the query struct. The results of which is url_list which is displayed using view.

== References ==
# [http://www.vldb.org/conf/1992/P353.PDF An Efficient Indexing Technique for full-text database systems] Justin Jobel, Alistair Moffet, Ron Sacks Davis
# [http://www.n3labs.com/pdf/putz91using-inverted-DBMS.pdf Using a relational database for an Inverted Text index] Steve Putz, Xerox PARC

Places:Full Text Indexing

2007-08-15T03:02:06Z

Mindboggler: /* nsNavFullTextIndex */

Talk:Places:Full Text Indexing

2007-07-07T08:08:46Z

Mindboggler:

It sounds like a lot of what you're discussing here matches what we did for fts1 in SQLite. fts2 was a great improvement in terms of performance, and has no breaking changes expected. I'm considering folding in ability to index external content for fts3, but customers actually talking to me get some priority :-).

[shess at google dot com]

Have gone through fts2 again. Looks promising. FTS2 does not support indexing extenal content. Although, this is not ideal, we are still going for this looking at other benefits. Also, FTS3, Scott hints, will have an API similar to FTS2. Will update the wiki with update design very soon. However the basic tasks are as follows:

# Build Mozilla with FTS2 support
# Design Database Table
# Code for table generation in nsNavHistory.cpp
# nsNavFullTextIndexHelper will remain the same
# UI
# Tokenizer System
# Support for other Languages
# Performance
# Test

[vertex3d2004 at gmail dot com]

Talk:Places:Full Text Indexing

2007-07-07T08:03:35Z

Mindboggler:

It sounds like a lot of what you're discussing here matches what we did for fts1 in SQLite. fts2 was a great improvement in terms of performance, and has no breaking changes expected. I'm considering folding in ability to index external content for fts3, but customers actually talking to me get some priority :-).

[shess at google dot com]

Have gone through fts2 again. Looks promising. FTS2 does not support indexing extenal content. Although, this is not ideal, we are still going for this looking at other benefits. Also, FTS3, Scott hints, will have an API similar to FTS2. Will update the wiki with update design very soon. However the basic tasks are as follows:

1) Build Mozilla with FTS2 support
2) Design Database Table
3) Code for table generation in nsNavHistory.cpp
3) nsNavFullTextIndexHelper will remain the same
4) UI
5) Tokenizer System
6) Support for other Languages
7) Performance
8) Test

Places:Full Text Indexing

2007-06-13T05:22:25Z

Mindboggler: /* Front-End */

ToDO: Formatting, it is badly formatted. Any idea on how to effectively format code?
== Overview ==

Full Text Indexing feature will allow user to search for a word/phrase from the pages that he has visited. The search query will be tightly integrated with Places's nsNavHistoryService. The tighter integration will allow queries like "search for pages visited between 01/05/07(dd/mm/yy) to 20/05/07(dd/mm/yy) containing the word 'places'"

== Design Descision ==

A number of options were looked into before proposing this design. The options included implementing using CLucene(like flock), SQLite's FTS1 and FTS2 module, implementation using B+ Trees, using relational database etc.. The following text will briefly describe the advantage and disadvantage of all the implementation methods.

CLucene is a full-text indexing engine that stores the index as B+ Trees in files. It uses a very efficient method for storage and retrieval. It has an excellent support for CJK languages. The Places system is a new incorporation into firefox. Hence, it is important that during its initial stages all the code that is written or used is flexible, small and tightly integrated with it. Tighter integration would allow future enhancements specific to firefox. Hence this approach was dropped.

SQLite's FTS1 and FTS2 module are open source implementation of full-text indexing integrated with SQLite. FTS1 and FTS2 stores the text in B+ Trees and the access to the full-text index is using SQL. However, there are number of short-comings for our usage. FTS2 is still in stage where the API and data might change without backward compatibility. FTS1 does not have custom tokenizer which means no CJK support. Also, FTS1 stores the entire page duplicating what is aleady stored in the cache.

A custom implementation using B+ Tree is a very good option but however, it would require additional B+ Tree engine. In light of availability of an efficient algorithm for implementing full-text indexing using relational database, this method is used.

A naive implementation of full-text indexing is very costly in terms of storage. I'll briefly explain how it is so. Let us define term. A term is any word that appear in a page. So a relational database contains a table with two columns, term and id. Another table contains two columsn term id and doc id(id of the document the term appeard in). The GNU manuals were analyzed [1]. It is 5.15 Mb of text containing 958,774 occurrences of word out of which 27,554 are unique. But table 2 will require that every occurrence has a corresponding doc id. If term id were stored as int, the amount of space required to store the first column alone, would be 958,774 * 4 bytes, which is about 3 Mb. A B+ Tree implementation is atleast 3Mb more efficient. However a nice encoding scheme and storage model proposed by [2] is almost as efficient as a B+ Tree implementation. This algorithm also leverages the capabilites of relational database system while not losing too much in terms of storage and performance. Hence I propose to implement this algorithm.

== Use Case ==

'''Actor: User''' 
- Visit Page 
- Search 
- Clear History 

'''Actor: Browser''' 
- Expire Page 

The use cases above will be used to validate the design.

== Database Design ==

TODO: Check the url_table and put it here. The url_table acts as the document table. The url table will contain additionally the document length

{| border="1" cellpadding="2"
|+'''Word Table'''
|-
!columnn!!type!!bytes!!description
|-
|word||varchar||<=100||term for indexing(Shouldn't it be unicode? how do i store unicode?)
|-
|wordnum||integer||4||unique id. Integer works cause the number of unique words will be atmost a million. Non-english language?
|-
|doc_count||integer||4||number of documents the word occurred in
|-
|word_count||integer||4||number of occurrences of the word
|}
 
{| border="1" cellpadding="2"
|+'''Posting Table'''
|-
!column!!type!!bytes!!description
|-
|wordnum||integer||4||This is the foreign key matching that in the word table
|-
|firstdoc||integer||4||lowest doc id referenced in the block
|-
|flags||tinyint||1||indicates the block type, length of doc list, sequence number
|-
|block||varbinary||<=255||contains encoded document and/or position postings for the word
|}

To Do
# We might need a table or two more for ranking efficiently
# Check if SQLite has varbinary datatype. There is a BLOB data type, I am sure.

Note that the tables structure is subject to change to improve efficiency. New Tables might be formed and/or the table might add/remove column

== Detailed Design ==

Classes

There are essentially four classes for the back-end:
# nsNavHistoryQuery
# nsNavFullTextIndexHelper
# nsNavFullTextIndex
# nsNavFullTextAnalyzer

===nsNavHistoryQuery===

This class is already implemented with all features except searching text. The mechanism of searching is no different from what described in http://developer.mozilla.org/en/docs/Places:Query_System.

<pre>
var historyService = Components.classes["@mozilla.org/browser/nav-history-service;1"].getService(Components.interfaces.nsINavHistoryService);

var options = historyService.getNewQueryOptions();
var query = historyService.getNewQuery();
query.searchTerms = "Mozilla Firefox";

// execute the query
var result = historyService.executeQuery(query, options);
</pre>

The result will contain a list of URI. A number of options can be specified in query and options making it very powerful. Conjunctive queries can be also be executed with historyService.executeQueries and a list of query as parameter.

Internally the function calls nsNavFullTextIndex::searchDocument(searchTerms) which returns a list of URI ranked according to algorithm described in SearchDocument(terms) function that will be described later in this document. The list of URI is further filtered by the other parameters set in query and options variable. In case of executeQueries method, the list is aggregated with results from multiple queries.

===nsNavFullTextIndex===

This class interacts with SQLite database. This implements the algorithm for adding document to the index, removing document from the index and search for a given term. This search function is used by nsNavHistoryQuery::search function. Look at [2] for the algorithm used.

Block is a struct used to encode and decode block. A block is variable length delta encoded. Variable Length delta Encoding, compresses very efficiently balancing speed and storage requirement.
<pre>struct Block {
//The width of the field is 255 bytes. The return value is an int indicating number of elements in the data that were encoded in the out byte array.
int encode(in int[] data, out byte[] encodedBlock) {
//How to encode more efficiently, any idea?
int[] bigEndian;
int k = 0;
data[i - 1] = 0;
for(int i = 0; i < data.length; i++) {
data[i] -= data[i - 1];
int j = 0;
while(data[i] != 0) {
bigEndian[j++] = data[i] % 128;
data[i] /= 128;
}
for( ; j > 0; j--, k++) {
encodedBlock[k] = (1 << 8) & bigEndian[j];
}
encodedBlock[k++] = bigEndian[0];
if (k > 255)
return i - 1
}
}
void decode(in byte[] encodedBlock, out int[] data) {
int j = 0;
for(int i = 0; i < encodedBlock.length; i++) {
if (encodedBlock[i] && (1 << 8)) {
data[j] *= 128;
data[j] += encodedBlock[i] & ((1 << 8) - 1);
}
else {
data[j] += data[j - 1]; //Because it was delta encoded
data[j++] = encodedBlock[i];
}
}
}
}

AddDocument(connection, document, analyzer) {
AddDocument in two pass. Scan the document for terms and then invert. We have the doc id. We will require a hash map library to make it very efficient. Term is a hash map. Usage: term['termname'] = array of pos.
<pre> while(analyzer.hasTerms()) {
cTerm = analyzer.nextTerm();
term[cTerm.Name].add(cTerm.pos);
}
iterator i = term.iterator();
while(i.hasNext()) {
termName = i.next();
termPos[] = term[termName];

//hopefully sqlite caches query results, it will be inefficient otherwise.

if (term already in table) {
record = executeQuery("SELECT ﬁrstdoc, ﬂags, block FROM postings
WHERE word = termName
AND ﬁrstdoc == (Given as >= in [2], == is correct in my opinion)
(SELECT max (ﬁrstdoc) FROM postings
WHERE word = termName and flags < 128)");
//Refer to [2] for explanation of this.
//only one record is retrieved with flags == 0 or flags between 0 and 128
//when flag == 0, the block contains only document list
if (flag == 0) {
1) Decode the block
2) See if one more document can be fitted in this.
3) Yes? add to it
i) find the position list
positionList = executeQuery("SELECT firstdoc, flags, block FROM positings
where firstdoc =
SELECT max(ﬁrstdoc) FROM postings
WHERE word = termName
AND ﬁrstdoc >= record.firstdoc AND flags >= 128
ii) Try to add as many position in this block
iii) when the block is done, create a new row, with firstdoc == currentdoc
and flags == 128 or 129 depending on whether the prev firstdoc was same as this.

iv) goto ii) if there are more left.
4) no?
i) create a new row with firstdoc = docid, flags=2
ii) To the block add, doc id and doc freq. And all the pos. Note the position listings are never split when flag == 2.
We must try to fit all the position listing in this block.99% of the case this should be possible. Make a small calculation, you'll find out i am correct
iii) The rare case it is not possible? create two rows one with flags==0 and flags==128
}
else {
//This is slightly complex in that there is both document list and position list in the same block. We have to decode the block. Try to add document id and and all the position to the position list. This might not be possible. And we'll have to create two new rows, one with flags == 0 and other with flags == 128
}
update the word count in the word list table appropriately
}
}
commit to the database
}

RemoveDocument() {
Inherently inefficient because of the structure that we have adopted. Needs to be optimized. The general algorithm revolves around finding the record whose firstDoc is immediately less or same as the docId we are searching for.

<pre> e.g. Say we want to delete docid = 20. We got two records
firstDoc = 18, block="blahblah"
firstDoc = 22, block="blahblah"
So we select the record with docid = 18 that is immediately before docid = 20;

query to achieve this: SELECT word, firstdoc, block FROM postings where
firstdoc = SELECT max(firstdoc) FROM postings where
firstdoc < docIdWeAreSearchingFor
This returns a number of records with flag == 0 or 0 < flag < 128 or flag == 128 or flag > 128.
for each record we found do the following:
docAndPostingTable = decodeBlock(block);
we have just decoded the block using Block::decodeBlock().
docAndPostingTable.find(docId) //we check the decode block if it contains docId of our interest.
if docId found
Check the flag
when flag == 0 //only document list
remove the document and freq from the block
update the word count for the term in word table
update the delta coding for other doc in the block.
if docId == firstDoc update firstDoc with the immediately following doc
if no more docs left in this row, delete the row
when 0 < flag < 128, contains both document list and posting table
remove the document from the block.
update the word count for the term in word table
update the delta coding for other doc in the block.
remove all the postings of the document for the term in the block
if (docId == firstDoc) update firstDoc with immediately following doc
if no more docs left in the row, delete the row
when flag >= 128 //only postings table
remove all the postings corresponding to the doc
update the firstDoc for the record
delete the record if block is empty

}

SearchDocument(terms) {
terms is something like: "mac apple panther". Basically a collection of terms
The ranking algorithm as described in this document http://lucene.apache.org/java/2_1_0/api/org/apache/lucene/search/Similarity.html is used.
}</pre>

===nsNavFullTextIndexHelper===

This class contains mechanism for scheduling documents for indexing. It works pretty much like nsNavHistoryExpire. The scheduling balances indexing with overall firefox performance. It has a timer which when expires picks an unindexed document from the history and calls nsNavFullTextIndex::addDocument to index. When and how the timer is set, dictates the overall efficiency. Whenever a user visits a page, the timer is triggered.

When user visits a page, the history service calls OnAddURI function of this class. This function starts the timer. When timer expires the call back function is called. The function checks If there are any more document to be indexed and it has not been a long while since the user called addURI. If so, then it resets the timer and waits to expire again.

===nsNavFullTextAnalyzer===

The analyzer design is similar to the way Lucene works. The Lucene design enables support for multiple languages.
The Analyzer takes the tokens generated from the tokenizer and discards those that are not required for indexing. E.g. is, a, an the can be discarded. This is enabled by by tokenStream classes. Refer to nsNavFullTextTokenStream and Lucene API for more detail.

===nsNavFullTextTokenStream===

TokenStream is an abstract class with three methods next, reset and close. The next method returns a token. Token is a class representing startoffset, endoffset and termtext. TokenStream needs input to tokenize. There are two concrete class for TokenStream:
# Tokenizer: The input is an inputstream.
# TokenFilter: The input is another TokenStream. This works like pipes.

== Front-End ==
===nsNavQueryParser===
The function of this is to break the query given in human language into a graph of TermQuery and BooleanQuery. BooleanQuery is a struct with two operand(each a TermQuery or BooleanQuery) and an operator(AND, OR, NOT, XOR). Although the idea here is to implement all kind of queries as in: http://lucene.apache.org/java/docs/queryparsersyntax.html eventually. nsNavHistoryQuery is used to query using the query struct. The results of which is url_list which is displayed using view.

== References ==
# [http://www.vldb.org/conf/1992/P353.PDF An Efficient Indexing Technique for full-text database systems] Justin Jobel, Alistair Moffet, Ron Sacks Davis
# [http://www.n3labs.com/pdf/putz91using-inverted-DBMS.pdf Using a relational database for an Inverted Text index] Steve Putz, Xerox PARC

Places:Full Text Indexing

2007-06-11T05:26:47Z

Mindboggler: /* References */

ToDO: Formatting, it is badly formatted. Any idea on how to effectively format code?
== Overview ==

Full Text Indexing feature will allow user to search for a word/phrase from the pages that he has visited. The search query will be tightly integrated with Places's nsNavHistoryService. The tighter integration will allow queries like "search for pages visited between 01/05/07(dd/mm/yy) to 20/05/07(dd/mm/yy) containing the word 'places'"

== Design Descision ==

A number of options were looked into before proposing this design. The options included implementing using CLucene(like flock), SQLite's FTS1 and FTS2 module, implementation using B+ Trees, using relational database etc.. The following text will briefly describe the advantage and disadvantage of all the implementation methods.

CLucene is a full-text indexing engine that stores the index as B+ Trees in files. It uses a very efficient method for storage and retrieval. It has an excellent support for CJK languages. The Places system is a new incorporation into firefox. Hence, it is important that during its initial stages all the code that is written or used is flexible, small and tightly integrated with it. Tighter integration would allow future enhancements specific to firefox. Hence this approach was dropped.

SQLite's FTS1 and FTS2 module are open source implementation of full-text indexing integrated with SQLite. FTS1 and FTS2 stores the text in B+ Trees and the access to the full-text index is using SQL. However, there are number of short-comings for our usage. FTS2 is still in stage where the API and data might change without backward compatibility. FTS1 does not have custom tokenizer which means no CJK support. Also, FTS1 stores the entire page duplicating what is aleady stored in the cache.

A custom implementation using B+ Tree is a very good option but however, it would require additional B+ Tree engine. In light of availability of an efficient algorithm for implementing full-text indexing using relational database, this method is used.

A naive implementation of full-text indexing is very costly in terms of storage. I'll briefly explain how it is so. Let us define term. A term is any word that appear in a page. So a relational database contains a table with two columns, term and id. Another table contains two columsn term id and doc id(id of the document the term appeard in). The GNU manuals were analyzed [1]. It is 5.15 Mb of text containing 958,774 occurrences of word out of which 27,554 are unique. But table 2 will require that every occurrence has a corresponding doc id. If term id were stored as int, the amount of space required to store the first column alone, would be 958,774 * 4 bytes, which is about 3 Mb. A B+ Tree implementation is atleast 3Mb more efficient. However a nice encoding scheme and storage model proposed by [2] is almost as efficient as a B+ Tree implementation. This algorithm also leverages the capabilites of relational database system while not losing too much in terms of storage and performance. Hence I propose to implement this algorithm.

== Use Case ==

'''Actor: User''' 
- Visit Page 
- Search 
- Clear History 

'''Actor: Browser''' 
- Expire Page 

The use cases above will be used to validate the design.

== Database Design ==

TODO: Check the url_table and put it here. The url_table acts as the document table. The url table will contain additionally the document length

{| border="1" cellpadding="2"
|+'''Word Table'''
|-
!columnn!!type!!bytes!!description
|-
|word||varchar||<=100||term for indexing(Shouldn't it be unicode? how do i store unicode?)
|-
|wordnum||integer||4||unique id. Integer works cause the number of unique words will be atmost a million. Non-english language?
|-
|doc_count||integer||4||number of documents the word occurred in
|-
|word_count||integer||4||number of occurrences of the word
|}
 
{| border="1" cellpadding="2"
|+'''Posting Table'''
|-
!column!!type!!bytes!!description
|-
|wordnum||integer||4||This is the foreign key matching that in the word table
|-
|firstdoc||integer||4||lowest doc id referenced in the block
|-
|flags||tinyint||1||indicates the block type, length of doc list, sequence number
|-
|block||varbinary||<=255||contains encoded document and/or position postings for the word
|}

To Do
# We might need a table or two more for ranking efficiently
# Check if SQLite has varbinary datatype. There is a BLOB data type, I am sure.

Note that the tables structure is subject to change to improve efficiency. New Tables might be formed and/or the table might add/remove column

== Detailed Design ==

Classes

There are essentially four classes for the back-end:
# nsNavHistoryQuery
# nsNavFullTextIndexHelper
# nsNavFullTextIndex
# nsNavFullTextAnalyzer

===nsNavHistoryQuery===

This class is already implemented with all features except searching text. The mechanism of searching is no different from what described in http://developer.mozilla.org/en/docs/Places:Query_System.

<pre>
var historyService = Components.classes["@mozilla.org/browser/nav-history-service;1"].getService(Components.interfaces.nsINavHistoryService);

var options = historyService.getNewQueryOptions();
var query = historyService.getNewQuery();
query.searchTerms = "Mozilla Firefox";

// execute the query
var result = historyService.executeQuery(query, options);
</pre>

The result will contain a list of URI. A number of options can be specified in query and options making it very powerful. Conjunctive queries can be also be executed with historyService.executeQueries and a list of query as parameter.

Internally the function calls nsNavFullTextIndex::searchDocument(searchTerms) which returns a list of URI ranked according to algorithm described in SearchDocument(terms) function that will be described later in this document. The list of URI is further filtered by the other parameters set in query and options variable. In case of executeQueries method, the list is aggregated with results from multiple queries.

===nsNavFullTextIndex===

This class interacts with SQLite database. This implements the algorithm for adding document to the index, removing document from the index and search for a given term. This search function is used by nsNavHistoryQuery::search function. Look at [2] for the algorithm used.

Block is a struct used to encode and decode block. A block is variable length delta encoded. Variable Length delta Encoding, compresses very efficiently balancing speed and storage requirement.
<pre>struct Block {
//The width of the field is 255 bytes. The return value is an int indicating number of elements in the data that were encoded in the out byte array.
int encode(in int[] data, out byte[] encodedBlock) {
//How to encode more efficiently, any idea?
int[] bigEndian;
int k = 0;
data[i - 1] = 0;
for(int i = 0; i < data.length; i++) {
data[i] -= data[i - 1];
int j = 0;
while(data[i] != 0) {
bigEndian[j++] = data[i] % 128;
data[i] /= 128;
}
for( ; j > 0; j--, k++) {
encodedBlock[k] = (1 << 8) & bigEndian[j];
}
encodedBlock[k++] = bigEndian[0];
if (k > 255)
return i - 1
}
}
void decode(in byte[] encodedBlock, out int[] data) {
int j = 0;
for(int i = 0; i < encodedBlock.length; i++) {
if (encodedBlock[i] && (1 << 8)) {
data[j] *= 128;
data[j] += encodedBlock[i] & ((1 << 8) - 1);
}
else {
data[j] += data[j - 1]; //Because it was delta encoded
data[j++] = encodedBlock[i];
}
}
}
}

AddDocument(connection, document, analyzer) {
AddDocument in two pass. Scan the document for terms and then invert. We have the doc id. We will require a hash map library to make it very efficient. Term is a hash map. Usage: term['termname'] = array of pos.
<pre> while(analyzer.hasTerms()) {
cTerm = analyzer.nextTerm();
term[cTerm.Name].add(cTerm.pos);
}
iterator i = term.iterator();
while(i.hasNext()) {
termName = i.next();
termPos[] = term[termName];

//hopefully sqlite caches query results, it will be inefficient otherwise.

if (term already in table) {
record = executeQuery("SELECT ﬁrstdoc, ﬂags, block FROM postings
WHERE word = termName
AND ﬁrstdoc == (Given as >= in [2], == is correct in my opinion)
(SELECT max (ﬁrstdoc) FROM postings
WHERE word = termName and flags < 128)");
//Refer to [2] for explanation of this.
//only one record is retrieved with flags == 0 or flags between 0 and 128
//when flag == 0, the block contains only document list
if (flag == 0) {
1) Decode the block
2) See if one more document can be fitted in this.
3) Yes? add to it
i) find the position list
positionList = executeQuery("SELECT firstdoc, flags, block FROM positings
where firstdoc =
SELECT max(ﬁrstdoc) FROM postings
WHERE word = termName
AND ﬁrstdoc >= record.firstdoc AND flags >= 128
ii) Try to add as many position in this block
iii) when the block is done, create a new row, with firstdoc == currentdoc
and flags == 128 or 129 depending on whether the prev firstdoc was same as this.

iv) goto ii) if there are more left.
4) no?
i) create a new row with firstdoc = docid, flags=2
ii) To the block add, doc id and doc freq. And all the pos. Note the position listings are never split when flag == 2.
We must try to fit all the position listing in this block.99% of the case this should be possible. Make a small calculation, you'll find out i am correct
iii) The rare case it is not possible? create two rows one with flags==0 and flags==128
}
else {
//This is slightly complex in that there is both document list and position list in the same block. We have to decode the block. Try to add document id and and all the position to the position list. This might not be possible. And we'll have to create two new rows, one with flags == 0 and other with flags == 128
}
update the word count in the word list table appropriately
}
}
commit to the database
}

RemoveDocument() {
Inherently inefficient because of the structure that we have adopted. Needs to be optimized. The general algorithm revolves around finding the record whose firstDoc is immediately less or same as the docId we are searching for.

<pre> e.g. Say we want to delete docid = 20. We got two records
firstDoc = 18, block="blahblah"
firstDoc = 22, block="blahblah"
So we select the record with docid = 18 that is immediately before docid = 20;

query to achieve this: SELECT word, firstdoc, block FROM postings where
firstdoc = SELECT max(firstdoc) FROM postings where
firstdoc < docIdWeAreSearchingFor
This returns a number of records with flag == 0 or 0 < flag < 128 or flag == 128 or flag > 128.
for each record we found do the following:
docAndPostingTable = decodeBlock(block);
we have just decoded the block using Block::decodeBlock().
docAndPostingTable.find(docId) //we check the decode block if it contains docId of our interest.
if docId found
Check the flag
when flag == 0 //only document list
remove the document and freq from the block
update the word count for the term in word table
update the delta coding for other doc in the block.
if docId == firstDoc update firstDoc with the immediately following doc
if no more docs left in this row, delete the row
when 0 < flag < 128, contains both document list and posting table
remove the document from the block.
update the word count for the term in word table
update the delta coding for other doc in the block.
remove all the postings of the document for the term in the block
if (docId == firstDoc) update firstDoc with immediately following doc
if no more docs left in the row, delete the row
when flag >= 128 //only postings table
remove all the postings corresponding to the doc
update the firstDoc for the record
delete the record if block is empty

}

SearchDocument(terms) {
terms is something like: "mac apple panther". Basically a collection of terms
The ranking algorithm as described in this document http://lucene.apache.org/java/2_1_0/api/org/apache/lucene/search/Similarity.html is used.
}</pre>

===nsNavFullTextIndexHelper===

This class contains mechanism for scheduling documents for indexing. It works pretty much like nsNavHistoryExpire. The scheduling balances indexing with overall firefox performance. It has a timer which when expires picks an unindexed document from the history and calls nsNavFullTextIndex::addDocument to index. When and how the timer is set, dictates the overall efficiency. Whenever a user visits a page, the timer is triggered.

When user visits a page, the history service calls OnAddURI function of this class. This function starts the timer. When timer expires the call back function is called. The function checks If there are any more document to be indexed and it has not been a long while since the user called addURI. If so, then it resets the timer and waits to expire again.

===nsNavFullTextAnalyzer===

The analyzer design is similar to the way Lucene works. The Lucene design enables support for multiple languages.
The Analyzer takes the tokens generated from the tokenizer and discards those that are not required for indexing. E.g. is, a, an the can be discarded. This is enabled by by tokenStream classes. Refer to nsNavFullTextTokenStream and Lucene API for more detail.

===nsNavFullTextTokenStream===

TokenStream is an abstract class with three methods next, reset and close. The next method returns a token. Token is a class representing startoffset, endoffset and termtext. TokenStream needs input to tokenize. There are two concrete class for TokenStream:
# Tokenizer: The input is an inputstream.
# TokenFilter: The input is another TokenStream. This works like pipes.

== Front-End ==
1) nsNavQueryParser
The function of this is to break the query given in human language into a graph of TermQuery and BooleanQuery. BooleanQuery is a struct with two operand(each a TermQuery or BooleanQuery) and an operator(AND, OR, NOT, XOR). Although the idea here is to implement all kind of queries as in: http://lucene.apache.org/java/docs/queryparsersyntax.html eventually. nsNavHistoryQuery is used to query using the query struct. The results of which is url_list which is displayed using view.

== References ==
# [http://www.vldb.org/conf/1992/P353.PDF An Efficient Indexing Technique for full-text database systems] Justin Jobel, Alistair Moffet, Ron Sacks Davis
# [http://www.n3labs.com/pdf/putz91using-inverted-DBMS.pdf Using a relational database for an Inverted Text index] Steve Putz, Xerox PARC

Places:Full Text Indexing

2007-06-10T10:29:43Z

Mindboggler: /* nsNavHistoryQuery */

ToDO: Formatting, it is badly formatted. Any idea on how to effectively format code?
== Overview ==

Full Text Indexing feature will allow user to search for a word/phrase from the pages that he has visited. The search query will be tightly integrated with Places's nsNavHistoryService. The tighter integration will allow queries like "search for pages visited between 01/05/07(dd/mm/yy) to 20/05/07(dd/mm/yy) containing the word 'places'"

== Design Descision ==

A number of options were looked into before proposing this design. The options included implementing using CLucene(like flock), SQLite's FTS1 and FTS2 module, implementation using B+ Trees, using relational database etc.. The following text will briefly describe the advantage and disadvantage of all the implementation methods.

CLucene is a full-text indexing engine that stores the index as B+ Trees in files. It uses a very efficient method for storage and retrieval. It has an excellent support for CJK languages. The Places system is a new incorporation into firefox. Hence, it is important that during its initial stages all the code that is written or used is flexible, small and tightly integrated with it. Tighter integration would allow future enhancements specific to firefox. Hence this approach was dropped.

SQLite's FTS1 and FTS2 module are open source implementation of full-text indexing integrated with SQLite. FTS1 and FTS2 stores the text in B+ Trees and the access to the full-text index is using SQL. However, there are number of short-comings for our usage. FTS2 is still in stage where the API and data might change without backward compatibility. FTS1 does not have custom tokenizer which means no CJK support. Also, FTS1 stores the entire page duplicating what is aleady stored in the cache.

A custom implementation using B+ Tree is a very good option but however, it would require additional B+ Tree engine. In light of availability of an efficient algorithm for implementing full-text indexing using relational database, this method is used.

A naive implementation of full-text indexing is very costly in terms of storage. I'll briefly explain how it is so. Let us define term. A term is any word that appear in a page. So a relational database contains a table with two columns, term and id. Another table contains two columsn term id and doc id(id of the document the term appeard in). The GNU manuals were analyzed [1]. It is 5.15 Mb of text containing 958,774 occurrences of word out of which 27,554 are unique. But table 2 will require that every occurrence has a corresponding doc id. If term id were stored as int, the amount of space required to store the first column alone, would be 958,774 * 4 bytes, which is about 3 Mb. A B+ Tree implementation is atleast 3Mb more efficient. However a nice encoding scheme and storage model proposed by [2] is almost as efficient as a B+ Tree implementation. This algorithm also leverages the capabilites of relational database system while not losing too much in terms of storage and performance. Hence I propose to implement this algorithm.

== Use Case ==

'''Actor: User''' 
- Visit Page 
- Search 
- Clear History 

'''Actor: Browser''' 
- Expire Page 

The use cases above will be used to validate the design.

== Database Design ==

TODO: Check the url_table and put it here. The url_table acts as the document table. The url table will contain additionally the document length

{| border="1" cellpadding="2"
|+'''Word Table'''
|-
!columnn!!type!!bytes!!description
|-
|word||varchar||<=100||term for indexing(Shouldn't it be unicode? how do i store unicode?)
|-
|wordnum||integer||4||unique id. Integer works cause the number of unique words will be atmost a million. Non-english language?
|-
|doc_count||integer||4||number of documents the word occurred in
|-
|word_count||integer||4||number of occurrences of the word
|}
 
{| border="1" cellpadding="2"
|+'''Posting Table'''
|-
!column!!type!!bytes!!description
|-
|wordnum||integer||4||This is the foreign key matching that in the word table
|-
|firstdoc||integer||4||lowest doc id referenced in the block
|-
|flags||tinyint||1||indicates the block type, length of doc list, sequence number
|-
|block||varbinary||<=255||contains encoded document and/or position postings for the word
|}

To Do
# We might need a table or two more for ranking efficiently
# Check if SQLite has varbinary datatype. There is a BLOB data type, I am sure.

Note that the tables structure is subject to change to improve efficiency. New Tables might be formed and/or the table might add/remove column

== Detailed Design ==

Classes

There are essentially four classes for the back-end:
# nsNavHistoryQuery
# nsNavFullTextIndexHelper
# nsNavFullTextIndex
# nsNavFullTextAnalyzer

===nsNavHistoryQuery===

This class is already implemented with all features except searching text. The mechanism of searching is no different from what described in http://developer.mozilla.org/en/docs/Places:Query_System.

<pre>
var historyService = Components.classes["@mozilla.org/browser/nav-history-service;1"].getService(Components.interfaces.nsINavHistoryService);

var options = historyService.getNewQueryOptions();
var query = historyService.getNewQuery();
query.searchTerms = "Mozilla Firefox";

// execute the query
var result = historyService.executeQuery(query, options);
</pre>

The result will contain a list of URI. A number of options can be specified in query and options making it very powerful. Conjunctive queries can be also be executed with historyService.executeQueries and a list of query as parameter.

Internally the function calls nsNavFullTextIndex::searchDocument(searchTerms) which returns a list of URI ranked according to algorithm described in SearchDocument(terms) function that will be described later in this document. The list of URI is further filtered by the other parameters set in query and options variable. In case of executeQueries method, the list is aggregated with results from multiple queries.

===nsNavFullTextIndex===

This class interacts with SQLite database. This implements the algorithm for adding document to the index, removing document from the index and search for a given term. This search function is used by nsNavHistoryQuery::search function. Look at [2] for the algorithm used.

Block is a struct used to encode and decode block. A block is variable length delta encoded. Variable Length delta Encoding, compresses very efficiently balancing speed and storage requirement.
<pre>struct Block {
//The width of the field is 255 bytes. The return value is an int indicating number of elements in the data that were encoded in the out byte array.
int encode(in int[] data, out byte[] encodedBlock) {
//How to encode more efficiently, any idea?
int[] bigEndian;
int k = 0;
data[i - 1] = 0;
for(int i = 0; i < data.length; i++) {
data[i] -= data[i - 1];
int j = 0;
while(data[i] != 0) {
bigEndian[j++] = data[i] % 128;
data[i] /= 128;
}
for( ; j > 0; j--, k++) {
encodedBlock[k] = (1 << 8) & bigEndian[j];
}
encodedBlock[k++] = bigEndian[0];
if (k > 255)
return i - 1
}
}
void decode(in byte[] encodedBlock, out int[] data) {
int j = 0;
for(int i = 0; i < encodedBlock.length; i++) {
if (encodedBlock[i] && (1 << 8)) {
data[j] *= 128;
data[j] += encodedBlock[i] & ((1 << 8) - 1);
}
else {
data[j] += data[j - 1]; //Because it was delta encoded
data[j++] = encodedBlock[i];
}
}
}
}

AddDocument(connection, document, analyzer) {
AddDocument in two pass. Scan the document for terms and then invert. We have the doc id. We will require a hash map library to make it very efficient. Term is a hash map. Usage: term['termname'] = array of pos.
<pre> while(analyzer.hasTerms()) {
cTerm = analyzer.nextTerm();
term[cTerm.Name].add(cTerm.pos);
}
iterator i = term.iterator();
while(i.hasNext()) {
termName = i.next();
termPos[] = term[termName];

//hopefully sqlite caches query results, it will be inefficient otherwise.

if (term already in table) {
record = executeQuery("SELECT ﬁrstdoc, ﬂags, block FROM postings
WHERE word = termName
AND ﬁrstdoc == (Given as >= in [2], == is correct in my opinion)
(SELECT max (ﬁrstdoc) FROM postings
WHERE word = termName and flags < 128)");
//Refer to [2] for explanation of this.
//only one record is retrieved with flags == 0 or flags between 0 and 128
//when flag == 0, the block contains only document list
if (flag == 0) {
1) Decode the block
2) See if one more document can be fitted in this.
3) Yes? add to it
i) find the position list
positionList = executeQuery("SELECT firstdoc, flags, block FROM positings
where firstdoc =
SELECT max(ﬁrstdoc) FROM postings
WHERE word = termName
AND ﬁrstdoc >= record.firstdoc AND flags >= 128
ii) Try to add as many position in this block
iii) when the block is done, create a new row, with firstdoc == currentdoc
and flags == 128 or 129 depending on whether the prev firstdoc was same as this.

iv) goto ii) if there are more left.
4) no?
i) create a new row with firstdoc = docid, flags=2
ii) To the block add, doc id and doc freq. And all the pos. Note the position listings are never split when flag == 2.
We must try to fit all the position listing in this block.99% of the case this should be possible. Make a small calculation, you'll find out i am correct
iii) The rare case it is not possible? create two rows one with flags==0 and flags==128
}
else {
//This is slightly complex in that there is both document list and position list in the same block. We have to decode the block. Try to add document id and and all the position to the position list. This might not be possible. And we'll have to create two new rows, one with flags == 0 and other with flags == 128
}
update the word count in the word list table appropriately
}
}
commit to the database
}

RemoveDocument() {
Inherently inefficient because of the structure that we have adopted. Needs to be optimized. The general algorithm revolves around finding the record whose firstDoc is immediately less or same as the docId we are searching for.

<pre> e.g. Say we want to delete docid = 20. We got two records
firstDoc = 18, block="blahblah"
firstDoc = 22, block="blahblah"
So we select the record with docid = 18 that is immediately before docid = 20;

query to achieve this: SELECT word, firstdoc, block FROM postings where
firstdoc = SELECT max(firstdoc) FROM postings where
firstdoc < docIdWeAreSearchingFor
This returns a number of records with flag == 0 or 0 < flag < 128 or flag == 128 or flag > 128.
for each record we found do the following:
docAndPostingTable = decodeBlock(block);
we have just decoded the block using Block::decodeBlock().
docAndPostingTable.find(docId) //we check the decode block if it contains docId of our interest.
if docId found
Check the flag
when flag == 0 //only document list
remove the document and freq from the block
update the word count for the term in word table
update the delta coding for other doc in the block.
if docId == firstDoc update firstDoc with the immediately following doc
if no more docs left in this row, delete the row
when 0 < flag < 128, contains both document list and posting table
remove the document from the block.
update the word count for the term in word table
update the delta coding for other doc in the block.
remove all the postings of the document for the term in the block
if (docId == firstDoc) update firstDoc with immediately following doc
if no more docs left in the row, delete the row
when flag >= 128 //only postings table
remove all the postings corresponding to the doc
update the firstDoc for the record
delete the record if block is empty

}

SearchDocument(terms) {
terms is something like: "mac apple panther". Basically a collection of terms
The ranking algorithm as described in this document http://lucene.apache.org/java/2_1_0/api/org/apache/lucene/search/Similarity.html is used.
}</pre>

===nsNavFullTextIndexHelper===

This class contains mechanism for scheduling documents for indexing. It works pretty much like nsNavHistoryExpire. The scheduling balances indexing with overall firefox performance. It has a timer which when expires picks an unindexed document from the history and calls nsNavFullTextIndex::addDocument to index. When and how the timer is set, dictates the overall efficiency. Whenever a user visits a page, the timer is triggered.

When user visits a page, the history service calls OnAddURI function of this class. This function starts the timer. When timer expires the call back function is called. The function checks If there are any more document to be indexed and it has not been a long while since the user called addURI. If so, then it resets the timer and waits to expire again.

===nsNavFullTextAnalyzer===

The analyzer design is similar to the way Lucene works. The Lucene design enables support for multiple languages.
The Analyzer takes the tokens generated from the tokenizer and discards those that are not required for indexing. E.g. is, a, an the can be discarded. This is enabled by by tokenStream classes. Refer to nsNavFullTextTokenStream and Lucene API for more detail.

===nsNavFullTextTokenStream===

TokenStream is an abstract class with three methods next, reset and close. The next method returns a token. Token is a class representing startoffset, endoffset and termtext. TokenStream needs input to tokenize. There are two concrete class for TokenStream:
# Tokenizer: The input is an inputstream.
# TokenFilter: The input is another TokenStream. This works like pipes.

== Front-End ==
1) nsNavQueryParser
The function of this is to break the query given in human language into a graph of TermQuery and BooleanQuery. BooleanQuery is a struct with two operand(each a TermQuery or BooleanQuery) and an operator(AND, OR, NOT, XOR). Although the idea here is to implement all kind of queries as in: http://lucene.apache.org/java/docs/queryparsersyntax.html eventually. nsNavHistoryQuery is used to query using the query struct. The results of which is url_list which is displayed using view.

== References ==
[1] An Efficient Indexing Technique for full-text database systems: Justin Jobel, Alistair Moffet, Ron Sacks Davis
[2]Using a relational database for full-text index, Steve Putz, Xerox PARC

Places:Full Text Indexing

2007-06-10T10:29:00Z

Mindboggler: /* nsNavHistoryQuery */

ToDO: Formatting, it is badly formatted. Any idea on how to effectively format code?
== Overview ==

Full Text Indexing feature will allow user to search for a word/phrase from the pages that he has visited. The search query will be tightly integrated with Places's nsNavHistoryService. The tighter integration will allow queries like "search for pages visited between 01/05/07(dd/mm/yy) to 20/05/07(dd/mm/yy) containing the word 'places'"

== Design Descision ==

A number of options were looked into before proposing this design. The options included implementing using CLucene(like flock), SQLite's FTS1 and FTS2 module, implementation using B+ Trees, using relational database etc.. The following text will briefly describe the advantage and disadvantage of all the implementation methods.

CLucene is a full-text indexing engine that stores the index as B+ Trees in files. It uses a very efficient method for storage and retrieval. It has an excellent support for CJK languages. The Places system is a new incorporation into firefox. Hence, it is important that during its initial stages all the code that is written or used is flexible, small and tightly integrated with it. Tighter integration would allow future enhancements specific to firefox. Hence this approach was dropped.

SQLite's FTS1 and FTS2 module are open source implementation of full-text indexing integrated with SQLite. FTS1 and FTS2 stores the text in B+ Trees and the access to the full-text index is using SQL. However, there are number of short-comings for our usage. FTS2 is still in stage where the API and data might change without backward compatibility. FTS1 does not have custom tokenizer which means no CJK support. Also, FTS1 stores the entire page duplicating what is aleady stored in the cache.

A custom implementation using B+ Tree is a very good option but however, it would require additional B+ Tree engine. In light of availability of an efficient algorithm for implementing full-text indexing using relational database, this method is used.

A naive implementation of full-text indexing is very costly in terms of storage. I'll briefly explain how it is so. Let us define term. A term is any word that appear in a page. So a relational database contains a table with two columns, term and id. Another table contains two columsn term id and doc id(id of the document the term appeard in). The GNU manuals were analyzed [1]. It is 5.15 Mb of text containing 958,774 occurrences of word out of which 27,554 are unique. But table 2 will require that every occurrence has a corresponding doc id. If term id were stored as int, the amount of space required to store the first column alone, would be 958,774 * 4 bytes, which is about 3 Mb. A B+ Tree implementation is atleast 3Mb more efficient. However a nice encoding scheme and storage model proposed by [2] is almost as efficient as a B+ Tree implementation. This algorithm also leverages the capabilites of relational database system while not losing too much in terms of storage and performance. Hence I propose to implement this algorithm.

== Use Case ==

'''Actor: User''' 
- Visit Page 
- Search 
- Clear History 

'''Actor: Browser''' 
- Expire Page 

The use cases above will be used to validate the design.

== Database Design ==

TODO: Check the url_table and put it here. The url_table acts as the document table. The url table will contain additionally the document length

{| border="1" cellpadding="2"
|+'''Word Table'''
|-
!columnn!!type!!bytes!!description
|-
|word||varchar||<=100||term for indexing(Shouldn't it be unicode? how do i store unicode?)
|-
|wordnum||integer||4||unique id. Integer works cause the number of unique words will be atmost a million. Non-english language?
|-
|doc_count||integer||4||number of documents the word occurred in
|-
|word_count||integer||4||number of occurrences of the word
|}
 
{| border="1" cellpadding="2"
|+'''Posting Table'''
|-
!column!!type!!bytes!!description
|-
|wordnum||integer||4||This is the foreign key matching that in the word table
|-
|firstdoc||integer||4||lowest doc id referenced in the block
|-
|flags||tinyint||1||indicates the block type, length of doc list, sequence number
|-
|block||varbinary||<=255||contains encoded document and/or position postings for the word
|}

To Do
# We might need a table or two more for ranking efficiently
# Check if SQLite has varbinary datatype. There is a BLOB data type, I am sure.

Note that the tables structure is subject to change to improve efficiency. New Tables might be formed and/or the table might add/remove column

== Detailed Design ==

Classes

There are essentially four classes for the back-end:
# nsNavHistoryQuery
# nsNavFullTextIndexHelper
# nsNavFullTextIndex
# nsNavFullTextAnalyzer

===nsNavHistoryQuery===

This class is already implemented with all features except searching text. The mechanism of searching is no different from what described in http://developer.mozilla.org/en/docs/Places:Query_System.

<pre>
var historyService = Components.classes["@mozilla.org/browser/nav-history-service;1"].getService(Components.interfaces.nsINavHistoryService);// no query parameters will get all history

var options = historyService.getNewQueryOptions();
var query = historyService.getNewQuery();
query.searchTerms = "Mozilla Firefox";

// execute the query
var result = historyService.executeQuery(query, options);
</pre>

The result will contain a list of URI. A number of options can be specified in query and options making it very powerful. Conjunctive queries can be also be executed with historyService.executeQueries and a list of query as parameter.

Internally the function calls nsNavFullTextIndex::searchDocument(searchTerms) which returns a list of URI ranked according to algorithm described in SearchDocument(terms) function that will be described later in this document. The list of URI is further filtered by the other parameters set in query and options variable. In case of executeQueries method, the list is aggregated with results from multiple queries.

===nsNavFullTextIndex===

This class interacts with SQLite database. This implements the algorithm for adding document to the index, removing document from the index and search for a given term. This search function is used by nsNavHistoryQuery::search function. Look at [2] for the algorithm used.

Block is a struct used to encode and decode block. A block is variable length delta encoded. Variable Length delta Encoding, compresses very efficiently balancing speed and storage requirement.
<pre>struct Block {
//The width of the field is 255 bytes. The return value is an int indicating number of elements in the data that were encoded in the out byte array.
int encode(in int[] data, out byte[] encodedBlock) {
//How to encode more efficiently, any idea?
int[] bigEndian;
int k = 0;
data[i - 1] = 0;
for(int i = 0; i < data.length; i++) {
data[i] -= data[i - 1];
int j = 0;
while(data[i] != 0) {
bigEndian[j++] = data[i] % 128;
data[i] /= 128;
}
for( ; j > 0; j--, k++) {
encodedBlock[k] = (1 << 8) & bigEndian[j];
}
encodedBlock[k++] = bigEndian[0];
if (k > 255)
return i - 1
}
}
void decode(in byte[] encodedBlock, out int[] data) {
int j = 0;
for(int i = 0; i < encodedBlock.length; i++) {
if (encodedBlock[i] && (1 << 8)) {
data[j] *= 128;
data[j] += encodedBlock[i] & ((1 << 8) - 1);
}
else {
data[j] += data[j - 1]; //Because it was delta encoded
data[j++] = encodedBlock[i];
}
}
}
}

AddDocument(connection, document, analyzer) {
AddDocument in two pass. Scan the document for terms and then invert. We have the doc id. We will require a hash map library to make it very efficient. Term is a hash map. Usage: term['termname'] = array of pos.
<pre> while(analyzer.hasTerms()) {
cTerm = analyzer.nextTerm();
term[cTerm.Name].add(cTerm.pos);
}
iterator i = term.iterator();
while(i.hasNext()) {
termName = i.next();
termPos[] = term[termName];

//hopefully sqlite caches query results, it will be inefficient otherwise.

if (term already in table) {
record = executeQuery("SELECT ﬁrstdoc, ﬂags, block FROM postings
WHERE word = termName
AND ﬁrstdoc == (Given as >= in [2], == is correct in my opinion)
(SELECT max (ﬁrstdoc) FROM postings
WHERE word = termName and flags < 128)");
//Refer to [2] for explanation of this.
//only one record is retrieved with flags == 0 or flags between 0 and 128
//when flag == 0, the block contains only document list
if (flag == 0) {
1) Decode the block
2) See if one more document can be fitted in this.
3) Yes? add to it
i) find the position list
positionList = executeQuery("SELECT firstdoc, flags, block FROM positings
where firstdoc =
SELECT max(ﬁrstdoc) FROM postings
WHERE word = termName
AND ﬁrstdoc >= record.firstdoc AND flags >= 128
ii) Try to add as many position in this block
iii) when the block is done, create a new row, with firstdoc == currentdoc
and flags == 128 or 129 depending on whether the prev firstdoc was same as this.

iv) goto ii) if there are more left.
4) no?
i) create a new row with firstdoc = docid, flags=2
ii) To the block add, doc id and doc freq. And all the pos. Note the position listings are never split when flag == 2.
We must try to fit all the position listing in this block.99% of the case this should be possible. Make a small calculation, you'll find out i am correct
iii) The rare case it is not possible? create two rows one with flags==0 and flags==128
}
else {
//This is slightly complex in that there is both document list and position list in the same block. We have to decode the block. Try to add document id and and all the position to the position list. This might not be possible. And we'll have to create two new rows, one with flags == 0 and other with flags == 128
}
update the word count in the word list table appropriately
}
}
commit to the database
}

RemoveDocument() {
Inherently inefficient because of the structure that we have adopted. Needs to be optimized. The general algorithm revolves around finding the record whose firstDoc is immediately less or same as the docId we are searching for.

<pre> e.g. Say we want to delete docid = 20. We got two records
firstDoc = 18, block="blahblah"
firstDoc = 22, block="blahblah"
So we select the record with docid = 18 that is immediately before docid = 20;

query to achieve this: SELECT word, firstdoc, block FROM postings where
firstdoc = SELECT max(firstdoc) FROM postings where
firstdoc < docIdWeAreSearchingFor
This returns a number of records with flag == 0 or 0 < flag < 128 or flag == 128 or flag > 128.
for each record we found do the following:
docAndPostingTable = decodeBlock(block);
we have just decoded the block using Block::decodeBlock().
docAndPostingTable.find(docId) //we check the decode block if it contains docId of our interest.
if docId found
Check the flag
when flag == 0 //only document list
remove the document and freq from the block
update the word count for the term in word table
update the delta coding for other doc in the block.
if docId == firstDoc update firstDoc with the immediately following doc
if no more docs left in this row, delete the row
when 0 < flag < 128, contains both document list and posting table
remove the document from the block.
update the word count for the term in word table
update the delta coding for other doc in the block.
remove all the postings of the document for the term in the block
if (docId == firstDoc) update firstDoc with immediately following doc
if no more docs left in the row, delete the row
when flag >= 128 //only postings table
remove all the postings corresponding to the doc
update the firstDoc for the record
delete the record if block is empty

}

SearchDocument(terms) {
terms is something like: "mac apple panther". Basically a collection of terms
The ranking algorithm as described in this document http://lucene.apache.org/java/2_1_0/api/org/apache/lucene/search/Similarity.html is used.
}</pre>

===nsNavFullTextIndexHelper===

This class contains mechanism for scheduling documents for indexing. It works pretty much like nsNavHistoryExpire. The scheduling balances indexing with overall firefox performance. It has a timer which when expires picks an unindexed document from the history and calls nsNavFullTextIndex::addDocument to index. When and how the timer is set, dictates the overall efficiency. Whenever a user visits a page, the timer is triggered.

When user visits a page, the history service calls OnAddURI function of this class. This function starts the timer. When timer expires the call back function is called. The function checks If there are any more document to be indexed and it has not been a long while since the user called addURI. If so, then it resets the timer and waits to expire again.

===nsNavFullTextAnalyzer===

The analyzer design is similar to the way Lucene works. The Lucene design enables support for multiple languages.
The Analyzer takes the tokens generated from the tokenizer and discards those that are not required for indexing. E.g. is, a, an the can be discarded. This is enabled by by tokenStream classes. Refer to nsNavFullTextTokenStream and Lucene API for more detail.

===nsNavFullTextTokenStream===

TokenStream is an abstract class with three methods next, reset and close. The next method returns a token. Token is a class representing startoffset, endoffset and termtext. TokenStream needs input to tokenize. There are two concrete class for TokenStream:
# Tokenizer: The input is an inputstream.
# TokenFilter: The input is another TokenStream. This works like pipes.

== Front-End ==
1) nsNavQueryParser
The function of this is to break the query given in human language into a graph of TermQuery and BooleanQuery. BooleanQuery is a struct with two operand(each a TermQuery or BooleanQuery) and an operator(AND, OR, NOT, XOR). Although the idea here is to implement all kind of queries as in: http://lucene.apache.org/java/docs/queryparsersyntax.html eventually. nsNavHistoryQuery is used to query using the query struct. The results of which is url_list which is displayed using view.

== References ==
[1] An Efficient Indexing Technique for full-text database systems: Justin Jobel, Alistair Moffet, Ron Sacks Davis
[2]Using a relational database for full-text index, Steve Putz, Xerox PARC

Places:Full Text Indexing

2007-06-10T10:27:29Z

Mindboggler: /* nsNavHistoryQuery */

ToDO: Formatting, it is badly formatted. Any idea on how to effectively format code?
== Overview ==

Full Text Indexing feature will allow user to search for a word/phrase from the pages that he has visited. The search query will be tightly integrated with Places's nsNavHistoryService. The tighter integration will allow queries like "search for pages visited between 01/05/07(dd/mm/yy) to 20/05/07(dd/mm/yy) containing the word 'places'"

== Design Descision ==

A number of options were looked into before proposing this design. The options included implementing using CLucene(like flock), SQLite's FTS1 and FTS2 module, implementation using B+ Trees, using relational database etc.. The following text will briefly describe the advantage and disadvantage of all the implementation methods.

CLucene is a full-text indexing engine that stores the index as B+ Trees in files. It uses a very efficient method for storage and retrieval. It has an excellent support for CJK languages. The Places system is a new incorporation into firefox. Hence, it is important that during its initial stages all the code that is written or used is flexible, small and tightly integrated with it. Tighter integration would allow future enhancements specific to firefox. Hence this approach was dropped.

SQLite's FTS1 and FTS2 module are open source implementation of full-text indexing integrated with SQLite. FTS1 and FTS2 stores the text in B+ Trees and the access to the full-text index is using SQL. However, there are number of short-comings for our usage. FTS2 is still in stage where the API and data might change without backward compatibility. FTS1 does not have custom tokenizer which means no CJK support. Also, FTS1 stores the entire page duplicating what is aleady stored in the cache.

A custom implementation using B+ Tree is a very good option but however, it would require additional B+ Tree engine. In light of availability of an efficient algorithm for implementing full-text indexing using relational database, this method is used.

A naive implementation of full-text indexing is very costly in terms of storage. I'll briefly explain how it is so. Let us define term. A term is any word that appear in a page. So a relational database contains a table with two columns, term and id. Another table contains two columsn term id and doc id(id of the document the term appeard in). The GNU manuals were analyzed [1]. It is 5.15 Mb of text containing 958,774 occurrences of word out of which 27,554 are unique. But table 2 will require that every occurrence has a corresponding doc id. If term id were stored as int, the amount of space required to store the first column alone, would be 958,774 * 4 bytes, which is about 3 Mb. A B+ Tree implementation is atleast 3Mb more efficient. However a nice encoding scheme and storage model proposed by [2] is almost as efficient as a B+ Tree implementation. This algorithm also leverages the capabilites of relational database system while not losing too much in terms of storage and performance. Hence I propose to implement this algorithm.

== Use Case ==

'''Actor: User''' 
- Visit Page 
- Search 
- Clear History 

'''Actor: Browser''' 
- Expire Page 

The use cases above will be used to validate the design.

== Database Design ==

TODO: Check the url_table and put it here. The url_table acts as the document table. The url table will contain additionally the document length

{| border="1" cellpadding="2"
|+'''Word Table'''
|-
!columnn!!type!!bytes!!description
|-
|word||varchar||<=100||term for indexing(Shouldn't it be unicode? how do i store unicode?)
|-
|wordnum||integer||4||unique id. Integer works cause the number of unique words will be atmost a million. Non-english language?
|-
|doc_count||integer||4||number of documents the word occurred in
|-
|word_count||integer||4||number of occurrences of the word
|}
 
{| border="1" cellpadding="2"
|+'''Posting Table'''
|-
!column!!type!!bytes!!description
|-
|wordnum||integer||4||This is the foreign key matching that in the word table
|-
|firstdoc||integer||4||lowest doc id referenced in the block
|-
|flags||tinyint||1||indicates the block type, length of doc list, sequence number
|-
|block||varbinary||<=255||contains encoded document and/or position postings for the word
|}

To Do
# We might need a table or two more for ranking efficiently
# Check if SQLite has varbinary datatype. There is a BLOB data type, I am sure.

Note that the tables structure is subject to change to improve efficiency. New Tables might be formed and/or the table might add/remove column

== Detailed Design ==

Classes

There are essentially four classes for the back-end:
# nsNavHistoryQuery
# nsNavFullTextIndexHelper
# nsNavFullTextIndex
# nsNavFullTextAnalyzer

===nsNavHistoryQuery===

This class is already implemented with all features except searching text. The mechanism of searching is no different from what described in http://developer.mozilla.org/en/docs/Places:Query_System.

<pre>
var historyService = Components.classes["@mozilla.org/browser/nav-history-service;1"] .getService(Components.interfaces.nsINavHistoryService);// no query parameters will get all history

var options = historyService.getNewQueryOptions();
var query = historyService.getNewQuery();
query.searchTerms = "Mozilla Firefox";

// execute the query
var result = historyService.executeQuery(query, options);
</pre>

The result will contain a list of URI. A number of options can be specified in query and options making it very powerful. Conjunctive queries can be also be executed with historyService.executeQueries and a list of query as parameter.

Internally the function calls nsNavFullTextIndex::searchDocument(searchTerms) which returns a list of URI ranked according to algorithm described in SearchDocument(terms) function that will be described later in this document. The list of URI is further filtered by the other parameters set in query and options variable. In case of executeQueries method, the list is aggregated with results from multiple queries.

===nsNavFullTextIndex===

This class interacts with SQLite database. This implements the algorithm for adding document to the index, removing document from the index and search for a given term. This search function is used by nsNavHistoryQuery::search function. Look at [2] for the algorithm used.

Block is a struct used to encode and decode block. A block is variable length delta encoded. Variable Length delta Encoding, compresses very efficiently balancing speed and storage requirement.
<pre>struct Block {
//The width of the field is 255 bytes. The return value is an int indicating number of elements in the data that were encoded in the out byte array.
int encode(in int[] data, out byte[] encodedBlock) {
//How to encode more efficiently, any idea?
int[] bigEndian;
int k = 0;
data[i - 1] = 0;
for(int i = 0; i < data.length; i++) {
data[i] -= data[i - 1];
int j = 0;
while(data[i] != 0) {
bigEndian[j++] = data[i] % 128;
data[i] /= 128;
}
for( ; j > 0; j--, k++) {
encodedBlock[k] = (1 << 8) & bigEndian[j];
}
encodedBlock[k++] = bigEndian[0];
if (k > 255)
return i - 1
}
}
void decode(in byte[] encodedBlock, out int[] data) {
int j = 0;
for(int i = 0; i < encodedBlock.length; i++) {
if (encodedBlock[i] && (1 << 8)) {
data[j] *= 128;
data[j] += encodedBlock[i] & ((1 << 8) - 1);
}
else {
data[j] += data[j - 1]; //Because it was delta encoded
data[j++] = encodedBlock[i];
}
}
}
}

AddDocument(connection, document, analyzer) {
AddDocument in two pass. Scan the document for terms and then invert. We have the doc id. We will require a hash map library to make it very efficient. Term is a hash map. Usage: term['termname'] = array of pos.
<pre> while(analyzer.hasTerms()) {
cTerm = analyzer.nextTerm();
term[cTerm.Name].add(cTerm.pos);
}
iterator i = term.iterator();
while(i.hasNext()) {
termName = i.next();
termPos[] = term[termName];

//hopefully sqlite caches query results, it will be inefficient otherwise.

if (term already in table) {
record = executeQuery("SELECT ﬁrstdoc, ﬂags, block FROM postings
WHERE word = termName
AND ﬁrstdoc == (Given as >= in [2], == is correct in my opinion)
(SELECT max (ﬁrstdoc) FROM postings
WHERE word = termName and flags < 128)");
//Refer to [2] for explanation of this.
//only one record is retrieved with flags == 0 or flags between 0 and 128
//when flag == 0, the block contains only document list
if (flag == 0) {
1) Decode the block
2) See if one more document can be fitted in this.
3) Yes? add to it
i) find the position list
positionList = executeQuery("SELECT firstdoc, flags, block FROM positings
where firstdoc =
SELECT max(ﬁrstdoc) FROM postings
WHERE word = termName
AND ﬁrstdoc >= record.firstdoc AND flags >= 128
ii) Try to add as many position in this block
iii) when the block is done, create a new row, with firstdoc == currentdoc
and flags == 128 or 129 depending on whether the prev firstdoc was same as this.

iv) goto ii) if there are more left.
4) no?
i) create a new row with firstdoc = docid, flags=2
ii) To the block add, doc id and doc freq. And all the pos. Note the position listings are never split when flag == 2.
We must try to fit all the position listing in this block.99% of the case this should be possible. Make a small calculation, you'll find out i am correct
iii) The rare case it is not possible? create two rows one with flags==0 and flags==128
}
else {
//This is slightly complex in that there is both document list and position list in the same block. We have to decode the block. Try to add document id and and all the position to the position list. This might not be possible. And we'll have to create two new rows, one with flags == 0 and other with flags == 128
}
update the word count in the word list table appropriately
}
}
commit to the database
}

RemoveDocument() {
Inherently inefficient because of the structure that we have adopted. Needs to be optimized. The general algorithm revolves around finding the record whose firstDoc is immediately less or same as the docId we are searching for.

<pre> e.g. Say we want to delete docid = 20. We got two records
firstDoc = 18, block="blahblah"
firstDoc = 22, block="blahblah"
So we select the record with docid = 18 that is immediately before docid = 20;

query to achieve this: SELECT word, firstdoc, block FROM postings where
firstdoc = SELECT max(firstdoc) FROM postings where
firstdoc < docIdWeAreSearchingFor
This returns a number of records with flag == 0 or 0 < flag < 128 or flag == 128 or flag > 128.
for each record we found do the following:
docAndPostingTable = decodeBlock(block);
we have just decoded the block using Block::decodeBlock().
docAndPostingTable.find(docId) //we check the decode block if it contains docId of our interest.
if docId found
Check the flag
when flag == 0 //only document list
remove the document and freq from the block
update the word count for the term in word table
update the delta coding for other doc in the block.
if docId == firstDoc update firstDoc with the immediately following doc
if no more docs left in this row, delete the row
when 0 < flag < 128, contains both document list and posting table
remove the document from the block.
update the word count for the term in word table
update the delta coding for other doc in the block.
remove all the postings of the document for the term in the block
if (docId == firstDoc) update firstDoc with immediately following doc
if no more docs left in the row, delete the row
when flag >= 128 //only postings table
remove all the postings corresponding to the doc
update the firstDoc for the record
delete the record if block is empty

}

SearchDocument(terms) {
terms is something like: "mac apple panther". Basically a collection of terms
The ranking algorithm as described in this document http://lucene.apache.org/java/2_1_0/api/org/apache/lucene/search/Similarity.html is used.
}</pre>

===nsNavFullTextIndexHelper===

This class contains mechanism for scheduling documents for indexing. It works pretty much like nsNavHistoryExpire. The scheduling balances indexing with overall firefox performance. It has a timer which when expires picks an unindexed document from the history and calls nsNavFullTextIndex::addDocument to index. When and how the timer is set, dictates the overall efficiency. Whenever a user visits a page, the timer is triggered.

When user visits a page, the history service calls OnAddURI function of this class. This function starts the timer. When timer expires the call back function is called. The function checks If there are any more document to be indexed and it has not been a long while since the user called addURI. If so, then it resets the timer and waits to expire again.

===nsNavFullTextAnalyzer===

The analyzer design is similar to the way Lucene works. The Lucene design enables support for multiple languages.
The Analyzer takes the tokens generated from the tokenizer and discards those that are not required for indexing. E.g. is, a, an the can be discarded. This is enabled by by tokenStream classes. Refer to nsNavFullTextTokenStream and Lucene API for more detail.

===nsNavFullTextTokenStream===

TokenStream is an abstract class with three methods next, reset and close. The next method returns a token. Token is a class representing startoffset, endoffset and termtext. TokenStream needs input to tokenize. There are two concrete class for TokenStream:
# Tokenizer: The input is an inputstream.
# TokenFilter: The input is another TokenStream. This works like pipes.

== Front-End ==
1) nsNavQueryParser
The function of this is to break the query given in human language into a graph of TermQuery and BooleanQuery. BooleanQuery is a struct with two operand(each a TermQuery or BooleanQuery) and an operator(AND, OR, NOT, XOR). Although the idea here is to implement all kind of queries as in: http://lucene.apache.org/java/docs/queryparsersyntax.html eventually. nsNavHistoryQuery is used to query using the query struct. The results of which is url_list which is displayed using view.

== References ==
[1] An Efficient Indexing Technique for full-text database systems: Justin Jobel, Alistair Moffet, Ron Sacks Davis
[2]Using a relational database for full-text index, Steve Putz, Xerox PARC

Places:Full Text Indexing

2007-06-10T10:25:41Z

Mindboggler: /* nsNavHistoryQuery */

ToDO: Formatting, it is badly formatted. Any idea on how to effectively format code?
== Overview ==

Full Text Indexing feature will allow user to search for a word/phrase from the pages that he has visited. The search query will be tightly integrated with Places's nsNavHistoryService. The tighter integration will allow queries like "search for pages visited between 01/05/07(dd/mm/yy) to 20/05/07(dd/mm/yy) containing the word 'places'"

== Design Descision ==

A number of options were looked into before proposing this design. The options included implementing using CLucene(like flock), SQLite's FTS1 and FTS2 module, implementation using B+ Trees, using relational database etc.. The following text will briefly describe the advantage and disadvantage of all the implementation methods.

CLucene is a full-text indexing engine that stores the index as B+ Trees in files. It uses a very efficient method for storage and retrieval. It has an excellent support for CJK languages. The Places system is a new incorporation into firefox. Hence, it is important that during its initial stages all the code that is written or used is flexible, small and tightly integrated with it. Tighter integration would allow future enhancements specific to firefox. Hence this approach was dropped.

SQLite's FTS1 and FTS2 module are open source implementation of full-text indexing integrated with SQLite. FTS1 and FTS2 stores the text in B+ Trees and the access to the full-text index is using SQL. However, there are number of short-comings for our usage. FTS2 is still in stage where the API and data might change without backward compatibility. FTS1 does not have custom tokenizer which means no CJK support. Also, FTS1 stores the entire page duplicating what is aleady stored in the cache.

A custom implementation using B+ Tree is a very good option but however, it would require additional B+ Tree engine. In light of availability of an efficient algorithm for implementing full-text indexing using relational database, this method is used.

A naive implementation of full-text indexing is very costly in terms of storage. I'll briefly explain how it is so. Let us define term. A term is any word that appear in a page. So a relational database contains a table with two columns, term and id. Another table contains two columsn term id and doc id(id of the document the term appeard in). The GNU manuals were analyzed [1]. It is 5.15 Mb of text containing 958,774 occurrences of word out of which 27,554 are unique. But table 2 will require that every occurrence has a corresponding doc id. If term id were stored as int, the amount of space required to store the first column alone, would be 958,774 * 4 bytes, which is about 3 Mb. A B+ Tree implementation is atleast 3Mb more efficient. However a nice encoding scheme and storage model proposed by [2] is almost as efficient as a B+ Tree implementation. This algorithm also leverages the capabilites of relational database system while not losing too much in terms of storage and performance. Hence I propose to implement this algorithm.

== Use Case ==

'''Actor: User''' 
- Visit Page 
- Search 
- Clear History 

'''Actor: Browser''' 
- Expire Page 

The use cases above will be used to validate the design.

== Database Design ==

TODO: Check the url_table and put it here. The url_table acts as the document table. The url table will contain additionally the document length

{| border="1" cellpadding="2"
|+'''Word Table'''
|-
!columnn!!type!!bytes!!description
|-
|word||varchar||<=100||term for indexing(Shouldn't it be unicode? how do i store unicode?)
|-
|wordnum||integer||4||unique id. Integer works cause the number of unique words will be atmost a million. Non-english language?
|-
|doc_count||integer||4||number of documents the word occurred in
|-
|word_count||integer||4||number of occurrences of the word
|}
 
{| border="1" cellpadding="2"
|+'''Posting Table'''
|-
!column!!type!!bytes!!description
|-
|wordnum||integer||4||This is the foreign key matching that in the word table
|-
|firstdoc||integer||4||lowest doc id referenced in the block
|-
|flags||tinyint||1||indicates the block type, length of doc list, sequence number
|-
|block||varbinary||<=255||contains encoded document and/or position postings for the word
|}

To Do
# We might need a table or two more for ranking efficiently
# Check if SQLite has varbinary datatype. There is a BLOB data type, I am sure.

Note that the tables structure is subject to change to improve efficiency. New Tables might be formed and/or the table might add/remove column

== Detailed Design ==

Classes

There are essentially four classes for the back-end:
# nsNavHistoryQuery
# nsNavFullTextIndexHelper
# nsNavFullTextIndex
# nsNavFullTextAnalyzer

===nsNavHistoryQuery===

This class is already implemented with all features except searching text. The mechanism of searching is no different from what described in http://developer.mozilla.org/en/docs/Places:Query_System.

<pre>
var historyService = Components.classes["@mozilla.org/browser/nav-history-service;1"] .getService(Components.interfaces.nsINavHistoryService);// no query parameters will get all history

var options = historyService.getNewQueryOptions();
var query = historyService.getNewQuery();
query.searchTerms = "Mozilla Firefox";

// execute the query
var result = historyService.executeQuery(query, options);
</prev>

The result will contain a list of URI. A number of options can be specified in query and options making it very powerful. Conjunctive queries can be also be executed with historyService.executeQueries and a list of query as parameter.

Internally the function calls nsNavFullTextIndex::searchDocument(searchTerms) which returns a list of URI ranked according to algorithm described in SearchDocument(terms) function that will be described later in this document. The list of URI is further filtered by the other parameters set in query and options variable. In case of executeQueries method, the list is aggregated with results from multiple queries.

===nsNavFullTextIndex===

This class interacts with SQLite database. This implements the algorithm for adding document to the index, removing document from the index and search for a given term. This search function is used by nsNavHistoryQuery::search function. Look at [2] for the algorithm used.

Block is a struct used to encode and decode block. A block is variable length delta encoded. Variable Length delta Encoding, compresses very efficiently balancing speed and storage requirement.
<pre>struct Block {
//The width of the field is 255 bytes. The return value is an int indicating number of elements in the data that were encoded in the out byte array.
int encode(in int[] data, out byte[] encodedBlock) {
//How to encode more efficiently, any idea?
int[] bigEndian;
int k = 0;
data[i - 1] = 0;
for(int i = 0; i < data.length; i++) {
data[i] -= data[i - 1];
int j = 0;
while(data[i] != 0) {
bigEndian[j++] = data[i] % 128;
data[i] /= 128;
}
for( ; j > 0; j--, k++) {
encodedBlock[k] = (1 << 8) & bigEndian[j];
}
encodedBlock[k++] = bigEndian[0];
if (k > 255)
return i - 1
}
}
void decode(in byte[] encodedBlock, out int[] data) {
int j = 0;
for(int i = 0; i < encodedBlock.length; i++) {
if (encodedBlock[i] && (1 << 8)) {
data[j] *= 128;
data[j] += encodedBlock[i] & ((1 << 8) - 1);
}
else {
data[j] += data[j - 1]; //Because it was delta encoded
data[j++] = encodedBlock[i];
}
}
}
}

AddDocument(connection, document, analyzer) {
AddDocument in two pass. Scan the document for terms and then invert. We have the doc id. We will require a hash map library to make it very efficient. Term is a hash map. Usage: term['termname'] = array of pos.
<pre> while(analyzer.hasTerms()) {
cTerm = analyzer.nextTerm();
term[cTerm.Name].add(cTerm.pos);
}
iterator i = term.iterator();
while(i.hasNext()) {
termName = i.next();
termPos[] = term[termName];

//hopefully sqlite caches query results, it will be inefficient otherwise.

if (term already in table) {
record = executeQuery("SELECT ﬁrstdoc, ﬂags, block FROM postings
WHERE word = termName
AND ﬁrstdoc == (Given as >= in [2], == is correct in my opinion)
(SELECT max (ﬁrstdoc) FROM postings
WHERE word = termName and flags < 128)");
//Refer to [2] for explanation of this.
//only one record is retrieved with flags == 0 or flags between 0 and 128
//when flag == 0, the block contains only document list
if (flag == 0) {
1) Decode the block
2) See if one more document can be fitted in this.
3) Yes? add to it
i) find the position list
positionList = executeQuery("SELECT firstdoc, flags, block FROM positings
where firstdoc =
SELECT max(ﬁrstdoc) FROM postings
WHERE word = termName
AND ﬁrstdoc >= record.firstdoc AND flags >= 128
ii) Try to add as many position in this block
iii) when the block is done, create a new row, with firstdoc == currentdoc
and flags == 128 or 129 depending on whether the prev firstdoc was same as this.

iv) goto ii) if there are more left.
4) no?
i) create a new row with firstdoc = docid, flags=2
ii) To the block add, doc id and doc freq. And all the pos. Note the position listings are never split when flag == 2.
We must try to fit all the position listing in this block.99% of the case this should be possible. Make a small calculation, you'll find out i am correct
iii) The rare case it is not possible? create two rows one with flags==0 and flags==128
}
else {
//This is slightly complex in that there is both document list and position list in the same block. We have to decode the block. Try to add document id and and all the position to the position list. This might not be possible. And we'll have to create two new rows, one with flags == 0 and other with flags == 128
}
update the word count in the word list table appropriately
}
}
commit to the database
}

RemoveDocument() {
Inherently inefficient because of the structure that we have adopted. Needs to be optimized. The general algorithm revolves around finding the record whose firstDoc is immediately less or same as the docId we are searching for.

<pre> e.g. Say we want to delete docid = 20. We got two records
firstDoc = 18, block="blahblah"
firstDoc = 22, block="blahblah"
So we select the record with docid = 18 that is immediately before docid = 20;

query to achieve this: SELECT word, firstdoc, block FROM postings where
firstdoc = SELECT max(firstdoc) FROM postings where
firstdoc < docIdWeAreSearchingFor
This returns a number of records with flag == 0 or 0 < flag < 128 or flag == 128 or flag > 128.
for each record we found do the following:
docAndPostingTable = decodeBlock(block);
we have just decoded the block using Block::decodeBlock().
docAndPostingTable.find(docId) //we check the decode block if it contains docId of our interest.
if docId found
Check the flag
when flag == 0 //only document list
remove the document and freq from the block
update the word count for the term in word table
update the delta coding for other doc in the block.
if docId == firstDoc update firstDoc with the immediately following doc
if no more docs left in this row, delete the row
when 0 < flag < 128, contains both document list and posting table
remove the document from the block.
update the word count for the term in word table
update the delta coding for other doc in the block.
remove all the postings of the document for the term in the block
if (docId == firstDoc) update firstDoc with immediately following doc
if no more docs left in the row, delete the row
when flag >= 128 //only postings table
remove all the postings corresponding to the doc
update the firstDoc for the record
delete the record if block is empty

}

SearchDocument(terms) {
terms is something like: "mac apple panther". Basically a collection of terms
The ranking algorithm as described in this document is used.
}</pre>

===nsNavFullTextIndexHelper===

This class contains mechanism for scheduling documents for indexing. It works pretty much like nsNavHistoryExpire. The scheduling balances indexing with overall firefox performance. It has a timer which when expires picks an unindexed document from the history and calls nsNavFullTextIndex::addDocument to index. When and how the timer is set, dictates the overall efficiency. Whenever a user visits a page, the timer is triggered.

When user visits a page, the history service calls OnAddURI function of this class. This function starts the timer. When timer expires the call back function is called. The function checks If there are any more document to be indexed and it has not been a long while since the user called addURI. If so, then it resets the timer and waits to expire again.

===nsNavFullTextAnalyzer===

The analyzer design is similar to the way Lucene works. The Lucene design enables support for multiple languages.
The Analyzer takes the tokens generated from the tokenizer and discards those that are not required for indexing. E.g. is, a, an the can be discarded. This is enabled by by tokenStream classes. Refer to nsNavFullTextTokenStream and Lucene API for more detail.

===nsNavFullTextTokenStream===

TokenStream is an abstract class with three methods next, reset and close. The next method returns a token. Token is a class representing startoffset, endoffset and termtext. TokenStream needs input to tokenize. There are two concrete class for TokenStream:
# Tokenizer: The input is an inputstream.
# TokenFilter: The input is another TokenStream. This works like pipes.

== Front-End ==
1) nsNavQueryParser
The function of this is to break the query given in human language into a graph of TermQuery and BooleanQuery. BooleanQuery is a struct with two operand(each a TermQuery or BooleanQuery) and an operator(AND, OR, NOT, XOR). Although the idea here is to implement all kind of queries as in: http://lucene.apache.org/java/docs/queryparsersyntax.html eventually. nsNavHistoryQuery is used to query using the query struct. The results of which is url_list which is displayed using view.

== References ==
[1] An Efficient Indexing Technique for full-text database systems: Justin Jobel, Alistair Moffet, Ron Sacks Davis
[2]Using a relational database for full-text index, Steve Putz, Xerox PARC

Places:Full Text Indexing

2007-06-10T09:45:57Z

Mindboggler: /* Detailed Design */

ToDO: Formatting, it is badly formatted. Any idea on how to effectively format code?
== Overview ==

Full Text Indexing feature will allow user to search for a word/phrase from the pages that he has visited. The search query will be tightly integrated with Places's nsNavHistoryService. The tighter integration will allow queries like "search for pages visited between 01/05/07(dd/mm/yy) to 20/05/07(dd/mm/yy) containing the word 'places'"

== Design Descision ==

A number of options were looked into before proposing this design. The options included implementing using CLucene(like flock), SQLite's FTS1 and FTS2 module, implementation using B+ Trees, using relational database etc.. The following text will briefly describe the advantage and disadvantage of all the implementation methods.

CLucene is a full-text indexing engine that stores the index as B+ Trees in files. It uses a very efficient method for storage and retrieval. It has an excellent support for CJK languages. The Places system is a new incorporation into firefox. Hence, it is important that during its initial stages all the code that is written or used is flexible, small and tightly integrated with it. Tighter integration would allow future enhancements specific to firefox. Hence this approach was dropped.

SQLite's FTS1 and FTS2 module are open source implementation of full-text indexing integrated with SQLite. FTS1 and FTS2 stores the text in B+ Trees and the access to the full-text index is using SQL. However, there are number of short-comings for our usage. FTS2 is still in stage where the API and data might change without backward compatibility. FTS1 does not have custom tokenizer which means no CJK support. Also, FTS1 stores the entire page duplicating what is aleady stored in the cache.

A custom implementation using B+ Tree is a very good option but however, it would require additional B+ Tree engine. In light of availability of an efficient algorithm for implementing full-text indexing using relational database, this method is used.

A naive implementation of full-text indexing is very costly in terms of storage. I'll briefly explain how it is so. Let us define term. A term is any word that appear in a page. So a relational database contains a table with two columns, term and id. Another table contains two columsn term id and doc id(id of the document the term appeard in). The GNU manuals were analyzed [1]. It is 5.15 Mb of text containing 958,774 occurrences of word out of which 27,554 are unique. But table 2 will require that every occurrence has a corresponding doc id. If term id were stored as int, the amount of space required to store the first column alone, would be 958,774 * 4 bytes, which is about 3 Mb. A B+ Tree implementation is atleast 3Mb more efficient. However a nice encoding scheme and storage model proposed by [2] is almost as efficient as a B+ Tree implementation. This algorithm also leverages the capabilites of relational database system while not losing too much in terms of storage and performance. Hence I propose to implement this algorithm.

== Use Case ==

'''Actor: User''' 
- Visit Page 
- Search 
- Clear History 

'''Actor: Browser''' 
- Expire Page 

The use cases above will be used to validate the design.

== Database Design ==

TODO: Check the url_table and put it here. The url_table acts as the document table. The url table will contain additionally the document length

{| border="1" cellpadding="2"
|+'''Word Table'''
|-
!columnn!!type!!bytes!!description
|-
|word||varchar||<=100||term for indexing(Shouldn't it be unicode? how do i store unicode?)
|-
|wordnum||integer||4||unique id. Integer works cause the number of unique words will be atmost a million. Non-english language?
|-
|doc_count||integer||4||number of documents the word occurred in
|-
|word_count||integer||4||number of occurrences of the word
|}
 
{| border="1" cellpadding="2"
|+'''Posting Table'''
|-
!column!!type!!bytes!!description
|-
|wordnum||integer||4||This is the foreign key matching that in the word table
|-
|firstdoc||integer||4||lowest doc id referenced in the block
|-
|flags||tinyint||1||indicates the block type, length of doc list, sequence number
|-
|block||varbinary||<=255||contains encoded document and/or position postings for the word
|}

To Do
# We might need a table or two more for ranking efficiently
# Check if SQLite has varbinary datatype. There is a BLOB data type, I am sure.

Note that the tables structure is subject to change to improve efficiency. New Tables might be formed and/or the table might add/remove column

== Detailed Design ==

Classes

There are essentially four classes for the back-end:
# nsNavHistoryQuery
# nsNavFullTextIndexHelper
# nsNavFullTextIndex
# nsNavFullTextAnalyzer

===nsNavHistoryQuery===

This class is already implemented with all features except searching text. The mechanism of searching is no different from what described in http://developer.mozilla.org/en/docs/Places:Query_System. The result of nsNavHistoryQuery::search will do a full-text search with nsNavHistoryQuery::searchTerms. This is very powerful with the number of options that you can specify. Conjunctive queries can be executed.

The search function uses the searchTerms to call nsNavFullTextIndex::searchDocument(term). This returns a url list based on a certain ranking. This function further filters based on other criteria such as date etc.. provided in the query options. The filtered url list is returned

===nsNavFullTextIndex===

This class interacts with SQLite database. This implements the algorithm for adding document to the index, removing document from the index and search for a given term. This search function is used by nsNavHistoryQuery::search function. Look at [2] for the algorithm used.

Block is a struct used to encode and decode block. A block is variable length delta encoded. Variable Length delta Encoding, compresses very efficiently balancing speed and storage requirement.
<pre>struct Block {
//The width of the field is 255 bytes. The return value is an int indicating number of elements in the data that were encoded in the out byte array.
int encode(in int[] data, out byte[] encodedBlock) {
//How to encode more efficiently, any idea?
int[] bigEndian;
int k = 0;
data[i - 1] = 0;
for(int i = 0; i < data.length; i++) {
data[i] -= data[i - 1];
int j = 0;
while(data[i] != 0) {
bigEndian[j++] = data[i] % 128;
data[i] /= 128;
}
for( ; j > 0; j--, k++) {
encodedBlock[k] = (1 << 8) & bigEndian[j];
}
encodedBlock[k++] = bigEndian[0];
if (k > 255)
return i - 1
}
}
void decode(in byte[] encodedBlock, out int[] data) {
int j = 0;
for(int i = 0; i < encodedBlock.length; i++) {
if (encodedBlock[i] && (1 << 8)) {
data[j] *= 128;
data[j] += encodedBlock[i] & ((1 << 8) - 1);
}
else {
data[j] += data[j - 1]; //Because it was delta encoded
data[j++] = encodedBlock[i];
}
}
}
}

AddDocument(connection, document, analyzer) {
AddDocument in two pass. Scan the document for terms and then invert. We have the doc id. We will require a hash map library to make it very efficient. Term is a hash map. Usage: term['termname'] = array of pos.
<pre> while(analyzer.hasTerms()) {
cTerm = analyzer.nextTerm();
term[cTerm.Name].add(cTerm.pos);
}
iterator i = term.iterator();
while(i.hasNext()) {
termName = i.next();
termPos[] = term[termName];

//hopefully sqlite caches query results, it will be inefficient otherwise.

if (term already in table) {
record = executeQuery("SELECT ﬁrstdoc, ﬂags, block FROM postings
WHERE word = termName
AND ﬁrstdoc == (Given as >= in [2], == is correct in my opinion)
(SELECT max (ﬁrstdoc) FROM postings
WHERE word = termName and flags < 128)");
//Refer to [2] for explanation of this.
//only one record is retrieved with flags == 0 or flags between 0 and 128
//when flag == 0, the block contains only document list
if (flag == 0) {
1) Decode the block
2) See if one more document can be fitted in this.
3) Yes? add to it
i) find the position list
positionList = executeQuery("SELECT firstdoc, flags, block FROM positings
where firstdoc =
SELECT max(ﬁrstdoc) FROM postings
WHERE word = termName
AND ﬁrstdoc >= record.firstdoc AND flags >= 128
ii) Try to add as many position in this block
iii) when the block is done, create a new row, with firstdoc == currentdoc
and flags == 128 or 129 depending on whether the prev firstdoc was same as this.

iv) goto ii) if there are more left.
4) no?
i) create a new row with firstdoc = docid, flags=2
ii) To the block add, doc id and doc freq. And all the pos. Note the position listings are never split when flag == 2.
We must try to fit all the position listing in this block.99% of the case this should be possible. Make a small calculation, you'll find out i am correct
iii) The rare case it is not possible? create two rows one with flags==0 and flags==128
}
else {
//This is slightly complex in that there is both document list and position list in the same block. We have to decode the block. Try to add document id and and all the position to the position list. This might not be possible. And we'll have to create two new rows, one with flags == 0 and other with flags == 128
}
update the word count in the word list table appropriately
}
}
commit to the database
}

RemoveDocument() {
Inherently inefficient because of the structure that we have adopted. Needs to be optimized. The general algorithm revolves around finding the record whose firstDoc is immediately less or same as the docId we are searching for.

<pre> e.g. Say we want to delete docid = 20. We got two records
firstDoc = 18, block="blahblah"
firstDoc = 22, block="blahblah"
So we select the record with docid = 18 that is immediately before docid = 20;

query to achieve this: SELECT word, firstdoc, block FROM postings where
firstdoc = SELECT max(firstdoc) FROM postings where
firstdoc < docIdWeAreSearchingFor
This returns a number of records with flag == 0 or 0 < flag < 128 or flag == 128 or flag > 128.
for each record we found do the following:
docAndPostingTable = decodeBlock(block);
we have just decoded the block using Block::decodeBlock().
docAndPostingTable.find(docId) //we check the decode block if it contains docId of our interest.
if docId found
Check the flag
when flag == 0 //only document list
remove the document and freq from the block
update the word count for the term in word table
update the delta coding for other doc in the block.
if docId == firstDoc update firstDoc with the immediately following doc
if no more docs left in this row, delete the row
when 0 < flag < 128, contains both document list and posting table
remove the document from the block.
update the word count for the term in word table
update the delta coding for other doc in the block.
remove all the postings of the document for the term in the block
if (docId == firstDoc) update firstDoc with immediately following doc
if no more docs left in the row, delete the row
when flag >= 128 //only postings table
remove all the postings corresponding to the doc
update the firstDoc for the record
delete the record if block is empty

}

SearchDocument(terms) {
terms is something like: "mac apple panther". Basically a collection of terms
The ranking algorithm as described in this document is used.
}</pre>

===nsNavFullTextIndexHelper===

This class contains mechanism for scheduling documents for indexing. It works pretty much like nsNavHistoryExpire. The scheduling balances indexing with overall firefox performance. It has a timer which when expires picks an unindexed document from the history and calls nsNavFullTextIndex::addDocument to index. When and how the timer is set, dictates the overall efficiency. Whenever a user visits a page, the timer is triggered.

When user visits a page, the history service calls OnAddURI function of this class. This function starts the timer. When timer expires the call back function is called. The function checks If there are any more document to be indexed and it has not been a long while since the user called addURI. If so, then it resets the timer and waits to expire again.

===nsNavFullTextAnalyzer===

The analyzer design is similar to the way Lucene works. The Lucene design enables support for multiple languages.
The Analyzer takes the tokens generated from the tokenizer and discards those that are not required for indexing. E.g. is, a, an the can be discarded. This is enabled by by tokenStream classes. Refer to nsNavFullTextTokenStream and Lucene API for more detail.

===nsNavFullTextTokenStream===

TokenStream is an abstract class with three methods next, reset and close. The next method returns a token. Token is a class representing startoffset, endoffset and termtext. TokenStream needs input to tokenize. There are two concrete class for TokenStream:
# Tokenizer: The input is an inputstream.
# TokenFilter: The input is another TokenStream. This works like pipes.

== Front-End ==
1) nsNavQueryParser
The function of this is to break the query given in human language into a graph of TermQuery and BooleanQuery. BooleanQuery is a struct with two operand(each a TermQuery or BooleanQuery) and an operator(AND, OR, NOT, XOR). Although the idea here is to implement all kind of queries as in: http://lucene.apache.org/java/docs/queryparsersyntax.html eventually. nsNavHistoryQuery is used to query using the query struct. The results of which is url_list which is displayed using view.

== References ==
[1] An Efficient Indexing Technique for full-text database systems: Justin Jobel, Alistair Moffet, Ron Sacks Davis
[2]Using a relational database for full-text index, Steve Putz, Xerox PARC

Places:Full Text Indexing

2007-06-10T09:45:04Z

Mindboggler: /* Database Design */

ToDO: Formatting, it is badly formatted. Any idea on how to effectively format code?
== Overview ==

Full Text Indexing feature will allow user to search for a word/phrase from the pages that he has visited. The search query will be tightly integrated with Places's nsNavHistoryService. The tighter integration will allow queries like "search for pages visited between 01/05/07(dd/mm/yy) to 20/05/07(dd/mm/yy) containing the word 'places'"

== Design Descision ==

A number of options were looked into before proposing this design. The options included implementing using CLucene(like flock), SQLite's FTS1 and FTS2 module, implementation using B+ Trees, using relational database etc.. The following text will briefly describe the advantage and disadvantage of all the implementation methods.

CLucene is a full-text indexing engine that stores the index as B+ Trees in files. It uses a very efficient method for storage and retrieval. It has an excellent support for CJK languages. The Places system is a new incorporation into firefox. Hence, it is important that during its initial stages all the code that is written or used is flexible, small and tightly integrated with it. Tighter integration would allow future enhancements specific to firefox. Hence this approach was dropped.

SQLite's FTS1 and FTS2 module are open source implementation of full-text indexing integrated with SQLite. FTS1 and FTS2 stores the text in B+ Trees and the access to the full-text index is using SQL. However, there are number of short-comings for our usage. FTS2 is still in stage where the API and data might change without backward compatibility. FTS1 does not have custom tokenizer which means no CJK support. Also, FTS1 stores the entire page duplicating what is aleady stored in the cache.

A custom implementation using B+ Tree is a very good option but however, it would require additional B+ Tree engine. In light of availability of an efficient algorithm for implementing full-text indexing using relational database, this method is used.

A naive implementation of full-text indexing is very costly in terms of storage. I'll briefly explain how it is so. Let us define term. A term is any word that appear in a page. So a relational database contains a table with two columns, term and id. Another table contains two columsn term id and doc id(id of the document the term appeard in). The GNU manuals were analyzed [1]. It is 5.15 Mb of text containing 958,774 occurrences of word out of which 27,554 are unique. But table 2 will require that every occurrence has a corresponding doc id. If term id were stored as int, the amount of space required to store the first column alone, would be 958,774 * 4 bytes, which is about 3 Mb. A B+ Tree implementation is atleast 3Mb more efficient. However a nice encoding scheme and storage model proposed by [2] is almost as efficient as a B+ Tree implementation. This algorithm also leverages the capabilites of relational database system while not losing too much in terms of storage and performance. Hence I propose to implement this algorithm.

== Use Case ==

'''Actor: User''' 
- Visit Page 
- Search 
- Clear History 

'''Actor: Browser''' 
- Expire Page 

The use cases above will be used to validate the design.

== Database Design ==

TODO: Check the url_table and put it here. The url_table acts as the document table. The url table will contain additionally the document length

{| border="1" cellpadding="2"
|+'''Word Table'''
|-
!columnn!!type!!bytes!!description
|-
|word||varchar||<=100||term for indexing(Shouldn't it be unicode? how do i store unicode?)
|-
|wordnum||integer||4||unique id. Integer works cause the number of unique words will be atmost a million. Non-english language?
|-
|doc_count||integer||4||number of documents the word occurred in
|-
|word_count||integer||4||number of occurrences of the word
|}
 
{| border="1" cellpadding="2"
|+'''Posting Table'''
|-
!column!!type!!bytes!!description
|-
|wordnum||integer||4||This is the foreign key matching that in the word table
|-
|firstdoc||integer||4||lowest doc id referenced in the block
|-
|flags||tinyint||1||indicates the block type, length of doc list, sequence number
|-
|block||varbinary||<=255||contains encoded document and/or position postings for the word
|}

To Do
# We might need a table or two more for ranking efficiently
# Check if SQLite has varbinary datatype. There is a BLOB data type, I am sure.

Note that the tables structure is subject to change to improve efficiency. New Tables might be formed and/or the table might add/remove column

== Detailed Design ==

Classes

There are essentially five classes for the back-end:
# nsNavHistoryQuery
# nsNavFullTextIndexHelper
# nsNavFullTextIndex
# nsNavFullTextAnalyzer

===nsNavHistoryQuery===

This class is already implemented with all features except searching text. The mechanism of searching is no different from what described in http://developer.mozilla.org/en/docs/Places:Query_System. The result of nsNavHistoryQuery::search will do a full-text search with nsNavHistoryQuery::searchTerms. This is very powerful with the number of options that you can specify. Conjunctive queries can be executed.

The search function uses the searchTerms to call nsNavFullTextIndex::searchDocument(term). This returns a url list based on a certain ranking. This function further filters based on other criteria such as date etc.. provided in the query options. The filtered url list is returned

===nsNavFullTextIndex===

This class interacts with SQLite database. This implements the algorithm for adding document to the index, removing document from the index and search for a given term. This search function is used by nsNavHistoryQuery::search function. Look at [2] for the algorithm used.

Block is a struct used to encode and decode block. A block is variable length delta encoded. Variable Length delta Encoding, compresses very efficiently balancing speed and storage requirement.
<pre>struct Block {
//The width of the field is 255 bytes. The return value is an int indicating number of elements in the data that were encoded in the out byte array.
int encode(in int[] data, out byte[] encodedBlock) {
//How to encode more efficiently, any idea?
int[] bigEndian;
int k = 0;
data[i - 1] = 0;
for(int i = 0; i < data.length; i++) {
data[i] -= data[i - 1];
int j = 0;
while(data[i] != 0) {
bigEndian[j++] = data[i] % 128;
data[i] /= 128;
}
for( ; j > 0; j--, k++) {
encodedBlock[k] = (1 << 8) & bigEndian[j];
}
encodedBlock[k++] = bigEndian[0];
if (k > 255)
return i - 1
}
}
void decode(in byte[] encodedBlock, out int[] data) {
int j = 0;
for(int i = 0; i < encodedBlock.length; i++) {
if (encodedBlock[i] && (1 << 8)) {
data[j] *= 128;
data[j] += encodedBlock[i] & ((1 << 8) - 1);
}
else {
data[j] += data[j - 1]; //Because it was delta encoded
data[j++] = encodedBlock[i];
}
}
}
}

AddDocument(connection, document, analyzer) {
AddDocument in two pass. Scan the document for terms and then invert. We have the doc id. We will require a hash map library to make it very efficient. Term is a hash map. Usage: term['termname'] = array of pos.
<pre> while(analyzer.hasTerms()) {
cTerm = analyzer.nextTerm();
term[cTerm.Name].add(cTerm.pos);
}
iterator i = term.iterator();
while(i.hasNext()) {
termName = i.next();
termPos[] = term[termName];

//hopefully sqlite caches query results, it will be inefficient otherwise.

if (term already in table) {
record = executeQuery("SELECT ﬁrstdoc, ﬂags, block FROM postings
WHERE word = termName
AND ﬁrstdoc == (Given as >= in [2], == is correct in my opinion)
(SELECT max (ﬁrstdoc) FROM postings
WHERE word = termName and flags < 128)");
//Refer to [2] for explanation of this.
//only one record is retrieved with flags == 0 or flags between 0 and 128
//when flag == 0, the block contains only document list
if (flag == 0) {
1) Decode the block
2) See if one more document can be fitted in this.
3) Yes? add to it
i) find the position list
positionList = executeQuery("SELECT firstdoc, flags, block FROM positings
where firstdoc =
SELECT max(ﬁrstdoc) FROM postings
WHERE word = termName
AND ﬁrstdoc >= record.firstdoc AND flags >= 128
ii) Try to add as many position in this block
iii) when the block is done, create a new row, with firstdoc == currentdoc
and flags == 128 or 129 depending on whether the prev firstdoc was same as this.

iv) goto ii) if there are more left.
4) no?
i) create a new row with firstdoc = docid, flags=2
ii) To the block add, doc id and doc freq. And all the pos. Note the position listings are never split when flag == 2.
We must try to fit all the position listing in this block.99% of the case this should be possible. Make a small calculation, you'll find out i am correct
iii) The rare case it is not possible? create two rows one with flags==0 and flags==128
}
else {
//This is slightly complex in that there is both document list and position list in the same block. We have to decode the block. Try to add document id and and all the position to the position list. This might not be possible. And we'll have to create two new rows, one with flags == 0 and other with flags == 128
}
update the word count in the word list table appropriately
}
}
commit to the database
}

RemoveDocument() {
Inherently inefficient because of the structure that we have adopted. Needs to be optimized. The general algorithm revolves around finding the record whose firstDoc is immediately less or same as the docId we are searching for.

<pre> e.g. Say we want to delete docid = 20. We got two records
firstDoc = 18, block="blahblah"
firstDoc = 22, block="blahblah"
So we select the record with docid = 18 that is immediately before docid = 20;

query to achieve this: SELECT word, firstdoc, block FROM postings where
firstdoc = SELECT max(firstdoc) FROM postings where
firstdoc < docIdWeAreSearchingFor
This returns a number of records with flag == 0 or 0 < flag < 128 or flag == 128 or flag > 128.
for each record we found do the following:
docAndPostingTable = decodeBlock(block);
we have just decoded the block using Block::decodeBlock().
docAndPostingTable.find(docId) //we check the decode block if it contains docId of our interest.
if docId found
Check the flag
when flag == 0 //only document list
remove the document and freq from the block
update the word count for the term in word table
update the delta coding for other doc in the block.
if docId == firstDoc update firstDoc with the immediately following doc
if no more docs left in this row, delete the row
when 0 < flag < 128, contains both document list and posting table
remove the document from the block.
update the word count for the term in word table
update the delta coding for other doc in the block.
remove all the postings of the document for the term in the block
if (docId == firstDoc) update firstDoc with immediately following doc
if no more docs left in the row, delete the row
when flag >= 128 //only postings table
remove all the postings corresponding to the doc
update the firstDoc for the record
delete the record if block is empty

}

SearchDocument(terms) {
terms is something like: "mac apple panther". Basically a collection of terms
The ranking algorithm as described in this document is used.
}</pre>

===nsNavFullTextIndexHelper===

This class contains mechanism for scheduling documents for indexing. It works pretty much like nsNavHistoryExpire. The scheduling balances indexing with overall firefox performance. It has a timer which when expires picks an unindexed document from the history and calls nsNavFullTextIndex::addDocument to index. When and how the timer is set, dictates the overall efficiency. Whenever a user visits a page, the timer is triggered.

When user visits a page, the history service calls OnAddURI function of this class. This function starts the timer. When timer expires the call back function is called. The function checks If there are any more document to be indexed and it has not been a long while since the user called addURI. If so, then it resets the timer and waits to expire again.

===nsNavFullTextAnalyzer===

The analyzer design is similar to the way Lucene works. The Lucene design enables support for multiple languages.
The Analyzer takes the tokens generated from the tokenizer and discards those that are not required for indexing. E.g. is, a, an the can be discarded. This is enabled by by tokenStream classes. Refer to nsNavFullTextTokenStream and Lucene API for more detail.

===nsNavFullTextTokenStream===

TokenStream is an abstract class with three methods next, reset and close. The next method returns a token. Token is a class representing startoffset, endoffset and termtext. TokenStream needs input to tokenize. There are two concrete class for TokenStream:
# Tokenizer: The input is an inputstream.
# TokenFilter: The input is another TokenStream. This works like pipes.

== Front-End ==
1) nsNavQueryParser
The function of this is to break the query given in human language into a graph of TermQuery and BooleanQuery. BooleanQuery is a struct with two operand(each a TermQuery or BooleanQuery) and an operator(AND, OR, NOT, XOR). Although the idea here is to implement all kind of queries as in: http://lucene.apache.org/java/docs/queryparsersyntax.html eventually. nsNavHistoryQuery is used to query using the query struct. The results of which is url_list which is displayed using view.

== References ==
[1] An Efficient Indexing Technique for full-text database systems: Justin Jobel, Alistair Moffet, Ron Sacks Davis
[2]Using a relational database for full-text index, Steve Putz, Xerox PARC

Places:Full Text Indexing

2007-06-02T10:44:07Z

Mindboggler:

ToDO: Formatting, it is badly formatted. Any idea on how to effectively format code?
== Overview ==

Full Text Indexing feature will allow user to search for a word/phrase from the pages that he has visited. The search query will be tightly integrated with Places's nsNavHistoryService. The tighter integration will allow queries like "search for pages visited between 01/05/07(dd/mm/yy) to 20/05/07(dd/mm/yy) containing the word 'places'"

== Design Descision ==

A number of options were looked into before proposing this design. The options included implementing using CLucene(like flock), SQLite's FTS1 and FTS2 module, implementation using B+ Trees, using relational database etc.. The following text will briefly describe the advantage and disadvantage of all the implementation methods.

CLucene is a full-text indexing engine that stores the index as B+ Trees in files. It uses a very efficient method for storage and retrieval. It has an excellent support for CJK languages. The Places system is a new incorporation into firefox. Hence, it is important that during its initial stages all the code that is written or used is flexible, small and tightly integrated with it. Tighter integration would allow future enhancements specific to firefox. Hence this approach was dropped.

SQLite's FTS1 and FTS2 module are open source implementation of full-text indexing integrated with SQLite. FTS1 and FTS2 stores the text in B+ Trees and the access to the full-text index is using SQL. However, there are number of short-comings for our usage. FTS2 is still in stage where the API and data might change without backward compatibility. FTS1 does not have custom tokenizer which means no CJK support. Also, FTS1 stores the entire page duplicating what is aleady stored in the cache.

A custom implementation using B+ Tree is a very good option but however, it would require additional B+ Tree engine. In light of availability of an efficient algorithm for implementing full-text indexing using relational database, this method is used.

A naive implementation of full-text indexing is very costly in terms of storage. I'll briefly explain how it is so. Let us define term. A term is any word that appear in a page. So a relational database contains a table with two columns, term and id. Another table contains two columsn term id and doc id(id of the document the term appeard in). The GNU manuals were analyzed [1]. It is 5.15 Mb of text containing 958,774 occurrences of word out of which 27,554 are unique. But table 2 will require that every occurrence has a corresponding doc id. If term id were stored as int, the amount of space required to store the first column alone, would be 958,774 * 4 bytes, which is about 3 Mb. A B+ Tree implementation is atleast 3Mb more efficient. However a nice encoding scheme and storage model proposed by [2] is almost as efficient as a B+ Tree implementation. This algorithm also leverages the capabilites of relational database system while not losing too much in terms of storage and performance. Hence I propose to implement this algorithm.

== Use Case ==

'''Actor: User''' 
- Visit Page 
- Search 
- Clear History 

'''Actor: Browser''' 
- Expire Page 

The use cases above will be used to validate the design.

== Database Design ==

TODO: Check the url_table and put it here. The url_table acts as the document table. The url table will contain additionally the document length

{| border="1" cellpadding="2"
|+'''Word Table'''
|-
!columnn!!type!!bytes!!description
|-
|word||varchar||<=100||term for indexing(Shouldn't it be unicode? how do i store unicode?)
|-
|wordnum||integer||4||unique id. Integer works cause the number of unique words will be atmost a million. Non-english language?
|-
|doc_count||integer||4||number of documents the word occurred in
|-
|word_count||integer||4||number of occurrences of the word
|}
Posting Table
column type bytes description
wordnum integer 4 This is the foreign key matching that in the word table
firstdoc integer 4 lowest doc id referenced in the block
flags tinyint 1 indicates the block type, length of doc list, sequence number
block varbinary <=255 contains encoded document and/or position postings for the word

To Do
1) We might need a table or two more for efficiently ranking
2) See how docid will referenced
3) check if there is varbinary. There is a BLOB

Note that the tables structure is subject to change to improve efficiency. New Tables might be formed and/or the table might add/remove column

== Detailed Design ==

Classes

There are essentially five classes for the back-end:
1) nsNavHistoryQuery
2) nsNavFullTextIndexHelper
3) nsNavFullTextIndex
4) nsNavFullTextAnalyzer

nsNavHistoryQuery.

This class is already implemented with all features except searching text. The mechanism of searching is no different from what described in http://developer.mozilla.org/en/docs/Places:Query_System. The result of nsNavHistoryQuery::search will do a full-text search with nsNavHistoryQuery::searchTerms. This is very powerful with the number of options that you can specify. Conjunctive queries can be executed.

The search function uses the searchTerms to call nsNavFullTextIndex::searchDocument(term). This returns a url list based on a certain ranking. This function further filters based on other criteria such as date etc.. provided in the query options. The filtered url list is returned

nsNavFullTextIndex

This class interacts with SQLite database. This implements the algorithm for adding document to the index, removing document from the index and search for a given term. This search function is used by nsNavHistoryQuery::search function. Look at [2] for the algorithm used.

Block is a struct used to encode and decode block. A block is variable length delta encoded. Variable Length delta Encoding, compresses very efficiently balancing speed and storage requirement.

struct Block {
//The width of the field is 255 bytes. The return value is an int indicating number of elements in the data that were encoded in the out byte array.
int encode(in int[] data, out byte[] encodedBlock) {
//How to encode more efficiently, any idea?
int[] bigEndian;
int k = 0;
data[i - 1] = 0;
for(int i = 0; i < data.length; i++) {
data[i] -= data[i - 1];
int j = 0;
while(data[i] != 0) {
bigEndian[j++] = data[i] % 128;
data[i] /= 128;
}
for( ; j > 0; j--, k++) {
encodedBlock[k] = (1 << 8) & bigEndian[j];
}
encodedBlock[k++] = bigEndian[0];
if (k > 255)
return i - 1
}
}
void decode(in byte[] encodedBlock, out int[] data) {
int j = 0;
for(int i = 0; i < encodedBlock.length; i++) {
if (encodedBlock[i] && (1 << 8)) {
data[j] *= 128;
data[j] += encodedBlock[i] & ((1 << 8) - 1);
}
else {
data[j] += data[j - 1]; //Because it was delta encoded
data[j++] = encodedBlock[i];
}
}
}
}

AddDocument(connection, document, analyzer) {
AddDocument in two pass. Scan the document for terms and then invert. We have the doc id. We will require a hash map library to make it very efficient. Term is a hash map. Usage: term['termname'] = array of pos.
while(analyzer.hasTerms()) {
cTerm = analyzer.nextTerm();
term[cTerm.Name].add(cTerm.pos);
}
iterator i = term.iterator();
while(i.hasNext()) {
termName = i.next();
termPos[] = term[termName];

//hopefully sqlite caches query results, it will be inefficient otherwise.

if (term already in table) {
record = executeQuery("SELECT ﬁrstdoc, ﬂags, block FROM postings
WHERE word = termName
AND ﬁrstdoc == (Given as >= in [2], == is correct in my opinion)
(SELECT max (ﬁrstdoc) FROM postings
WHERE word = termName and flags < 128)");
//Refer to [2] for explanation of this.
//only one record is retrieved with flags == 0 or flags between 0 and 128
//when flag == 0, the block contains only document list
if (flag == 0) {
1) Decode the block
2) See if one more document can be fitted in this.
3) Yes? add to it
i) find the position list
positionList = executeQuery("SELECT firstdoc, flags, block FROM positings
where firstdoc =
SELECT max(ﬁrstdoc) FROM postings
WHERE word = termName
AND ﬁrstdoc >= record.firstdoc AND flags >= 128
ii) Try to add as many position in this block
iii) when the block is done, create a new row, with firstdoc == currentdoc
and flags == 128 or 129 depending on whether the prev firstdoc was same as this.

iv) goto ii) if there are more left.
4) no?
i) create a new row with firstdoc = docid, flags=2
ii) To the block add, doc id and doc freq. And all the pos. Note the position listings are never split when flag == 2.
We must try to fit all the position listing in this block.99% of the case this should be possible. Make a small calculation, you'll find out i am correct
iii) The rare case it is not possible? create two rows one with flags==0 and flags==128
}
else {
//This is slightly complex in that there is both document list and position list in the same block. We have to decode the block. Try to add document id and and all the position to the position list. This might not be possible. And we'll have to create two new rows, one with flags == 0 and other with flags == 128
}
update the word count in the word list table appropriately
}
}
commit to the database
}

RemoveDocument() {
Inherently inefficient because of the structure that we have adopted. Needs to be optimized. The general algorithm revolves around finding the record whose firstDoc is immediately less or same as the docId we are searching for.

e.g. Say we want to delete docid = 20. We got two records
firstDoc = 18, block="blahblah"
firstDoc = 22, block="blahblah"
So we select the record with docid = 18 that is immediately before docid = 20;

query to achieve this: SELECT word, firstdoc, block FROM postings where
firstdoc = SELECT max(firstdoc) FROM postings where
firstdoc < docIdWeAreSearchingFor
This returns a number of records with flag == 0 or 0 < flag < 128 or flag == 128 or flag > 128.
for each record we found do the following:
docAndPostingTable = decodeBlock(block);
we have just decoded the block using Block::decodeBlock().
docAndPostingTable.find(docId) //we check the decode block if it contains docId of our interest.
if docId found
Check the flag
when flag == 0 //only document list
remove the document and freq from the block
update the word count for the term in word table
update the delta coding for other doc in the block.
if docId == firstDoc update firstDoc with the immediately following doc
if no more docs left in this row, delete the row
when 0 < flag < 128, contains both document list and posting table
remove the document from the block.
update the word count for the term in word table
update the delta coding for other doc in the block.
remove all the postings of the document for the term in the block
if (docId == firstDoc) update firstDoc with immediately following doc
if no more docs left in the row, delete the row
when flag >= 128 //only postings table
remove all the postings corresponding to the doc
update the firstDoc for the record
delete the record if block is empty

}

SearchDocument(terms) {
terms is something like: "mac apple panther". Basically a collection of terms
The ranking algorithm as described in this document is used.
}

nsNavFullTextIndexHelper

This class contains mechanism for scheduling documents for indexing. It works pretty much like nsNavHistoryExpire. The scheduling balances indexing with overall firefox performance. It has a timer which when expires picks an unindexed document from the history and calls nsNavFullTextIndex::addDocument to index. When and how the timer is set, dictates the overall efficiency. Whenever a user visits a page, the timer is triggered.

When user visits a page, the history service calls OnAddURI function of this class. This function starts the timer. When timer expires the call back function is called. The function checks If there are any more document to be indexed and it has not been a long while since the user called addURI. If so, then it resets the timer and waits to expire again.

nsNavFullTextAnalyzer

The analyzer design is similar to the way Lucene works. The Lucene design enables support for multiple languages.
The Analyzer takes the tokens generated from the tokenizer and discards those that are not required for indexing. E.g. is, a, an the can be discarded. This is enabled by by tokenStream classes. Refer to nsNavFullTextTokenStream and Lucene API for more detail.

nsNavFullTextTokenStream

TokenStream is an abstract class with three methods next, reset and close. The next method returns a token. Token is a class representing startoffset, endoffset and termtext. TokenStream needs input to tokenize. There are two concrete class for TokenStream:
1) Tokenizer: The input is an inputstream.
2) TokenFilter: The input is another TokenStream. This works like pipes.

== Front-End ==
1) nsNavQueryParser
The function of this is to break the query given in human language into a graph of TermQuery and BooleanQuery. BooleanQuery is a struct with two operand(each a TermQuery or BooleanQuery) and an operator(AND, OR, NOT, XOR). Although the idea here is to implement all kind of queries as in: http://lucene.apache.org/java/docs/queryparsersyntax.html eventually. nsNavHistoryQuery is used to query using the query struct. The results of which is url_list which is displayed using view.

== References ==
[1] An Efficient Indexing Technique for full-text database systems: Justin Jobel, Alistair Moffet, Ron Sacks Davis
[2]Using a relational database for full-text index, Steve Putz, Xerox PARC

Places:Full Text Indexing

2007-05-30T13:22:57Z

Mindboggler:

ToDO: Formatting, it is badly formatted. Any idea on how to effectively format code?
== Overview ==

Full Text Indexing feature will allow user to search for a word/phrase from the pages that he has visited. The search query will be tightly integrated with Places's nsNavHistoryService. The tighter integration will allow queries like "search for pages visited between 01/05/07(dd/mm/yy) to 20/05/07(dd/mm/yy) containing the word 'places'"

== Design Descision ==

A number of options were looked into before proposing this design. The options included implementing using CLucene(like flock), SQLite's FTS1 and FTS2 module, implementation using B+ Trees, using relational database etc.. The following text will briefly describe the advantage and disadvantage of all the implementation methods.

CLucene is a full-text indexing engine that stores the index as B+ Trees in files. It uses a very efficient method for storage and retrieval. It has an excellent support for CJK languages. The Places system is a new incorporation into firefox. Hence, it is important that during its initial stages all the code that is written or used is flexible, small and tightly integrated with it. Tighter integration would allow future enhancements specific to firefox. Hence this approach was dropped.

SQLite's FTS1 and FTS2 module are open source implementation of full-text indexing integrated with SQLite. FTS1 and FTS2 stores the text in B+ Trees and the access to the full-text index is using SQL. However, there are number of short-comings for our usage. FTS2 is still in stage where the API and data might change without backward compatibility. FTS1 does not have custom tokenizer which means no CJK support. Also, FTS1 stores the entire page duplicating what is aleady stored in the cache.

A custom implementation using B+ Tree is a very good option but however, it would require additional B+ Tree engine. In light of availability of an efficient algorithm for implementing full-text indexing using relational database, this method is used.

A naive implementation of full-text indexing is very costly in terms of storage. I'll briefly explain how it is so. Let us define term. A term is any word that appear in a page. So a relational database contains a table with two columns, term and id. Another table contains two columsn term id and doc id(id of the document the term appeard in). The GNU manuals were analyzed [1]. It is 5.15 Mb of text containing 958,774 occurrences of word out of which 27,554 are unique. But table 2 will require that every occurrence has a corresponding doc id. If term id were stored as int, the amount of space required to store the first column alone, would be 958,774 * 4 bytes, which is about 3 Mb. A B+ Tree implementation is atleast 3Mb more efficient. However a nice encoding scheme and storage model proposed by [2] is almost as efficient as a B+ Tree implementation. This algorithm also leverages the capabilites of relational database system while not losing too much in terms of storage and performance. Hence I propose to implement this algorithm.

== Use Case ==

Actor: User
- Visit Page
- Search
- Clear History

Actor: Browser
- Expire Page

The use cases above will be used to validate the design.

== Database Design ==

TODO: Check the url_table and put it here. The url_table acts as the document table. The url table will contain additionally the document length

Word Table
columnn type bytes description
word varchar <=100 term for indexing(Shouldn't it be unicode? how do i store unicode?)
wordnum integer 4 unique id. Integer works cause the number of unique words will be atmost a million. Non-english language?
doc_count integer 4 number of documents the word occurred in
word_count integer 4 number of occurrences of the word

Posting Table
column type bytes description
wordnum integer 4 This is the foreign key matching that in the word table
firstdoc integer 4 lowest doc id referenced in the block
flags tinyint 1 indicates the block type, length of doc list, sequence number
block varbinary <=255 contains encoded document and/or position postings for the word

To Do
1) We might need a table or two more for efficiently ranking
2) See how docid will referenced
3) check if there is varbinary. There is a BLOB

Note that the tables structure is subject to change to improve efficiency. New Tables might be formed and/or the table might add/remove column

== Detailed Design ==

Classes

There are essentially five classes for the back-end:
1) nsNavHistoryQuery
2) nsNavFullTextIndexHelper
3) nsNavFullTextIndex
4) nsNavFullTextAnalyzer

nsNavHistoryQuery.

This class is already implemented with all features except searching text. The mechanism of searching is no different from what described in http://developer.mozilla.org/en/docs/Places:Query_System. The result of nsNavHistoryQuery::search will do a full-text search with nsNavHistoryQuery::searchTerms. This is very powerful with the number of options that you can specify. Conjunctive queries can be executed.

The search function uses the searchTerms to call nsNavFullTextIndex::searchDocument(term). This returns a url list based on a certain ranking. This function further filters based on other criteria such as date etc.. provided in the query options. The filtered url list is returned

nsNavFullTextIndex

This class interacts with SQLite database. This implements the algorithm for adding document to the index, removing document from the index and search for a given term. This search function is used by nsNavHistoryQuery::search function. Look at [2] for the algorithm used.

Block is a struct used to encode and decode block. A block is variable length delta encoded. Variable Length delta Encoding, compresses very efficiently balancing speed and storage requirement.

struct Block {
//The width of the field is 255 bytes. The return value is an int indicating number of elements in the data that were encoded in the out byte array.
int encode(in int[] data, out byte[] encodedBlock) {
//How to encode more efficiently, any idea?
int[] bigEndian;
int k = 0;
data[i - 1] = 0;
for(int i = 0; i < data.length; i++) {
data[i] -= data[i - 1];
int j = 0;
while(data[i] != 0) {
bigEndian[j++] = data[i] % 128;
data[i] /= 128;
}
for( ; j > 0; j--, k++) {
encodedBlock[k] = (1 << 8) & bigEndian[j];
}
encodedBlock[k++] = bigEndian[0];
if (k > 255)
return i - 1
}
}
void decode(in byte[] encodedBlock, out int[] data) {
int j = 0;
for(int i = 0; i < encodedBlock.length; i++) {
if (encodedBlock[i] && (1 << 8)) {
data[j] *= 128;
data[j] += encodedBlock[i] & ((1 << 8) - 1);
}
else {
data[j] += data[j - 1]; //Because it was delta encoded
data[j++] = encodedBlock[i];
}
}
}
}

AddDocument(connection, document, analyzer) {
AddDocument in two pass. Scan the document for terms and then invert. We have the doc id. We will require a hash map library to make it very efficient. Term is a hash map. Usage: term['termname'] = array of pos.
while(analyzer.hasTerms()) {
cTerm = analyzer.nextTerm();
term[cTerm.Name].add(cTerm.pos);
}
iterator i = term.iterator();
while(i.hasNext()) {
termName = i.next();
termPos[] = term[termName];

//hopefully sqlite caches query results, it will be inefficient otherwise.

if (term already in table) {
record = executeQuery("SELECT ﬁrstdoc, ﬂags, block FROM postings
WHERE word = termName
AND ﬁrstdoc == (Given as >= in [2], == is correct in my opinion)
(SELECT max (ﬁrstdoc) FROM postings
WHERE word = termName and flags < 128)");
//Refer to [2] for explanation of this.
//only one record is retrieved with flags == 0 or flags between 0 and 128
//when flag == 0, the block contains only document list
if (flag == 0) {
1) Decode the block
2) See if one more document can be fitted in this.
3) Yes? add to it
i) find the position list
positionList = executeQuery("SELECT firstdoc, flags, block FROM positings
where firstdoc =
SELECT max(ﬁrstdoc) FROM postings
WHERE word = termName
AND ﬁrstdoc >= record.firstdoc AND flags >= 128
ii) Try to add as many position in this block
iii) when the block is done, create a new row, with firstdoc == currentdoc
and flags == 128 or 129 depending on whether the prev firstdoc was same as this.

iv) goto ii) if there are more left.
4) no?
i) create a new row with firstdoc = docid, flags=2
ii) To the block add, doc id and doc freq. And all the pos. Note the position listings are never split when flag == 2.
We must try to fit all the position listing in this block.99% of the case this should be possible. Make a small calculation, you'll find out i am correct
iii) The rare case it is not possible? create two rows one with flags==0 and flags==128
}
else {
//This is slightly complex in that there is both document list and position list in the same block. We have to decode the block. Try to add document id and and all the position to the position list. This might not be possible. And we'll have to create two new rows, one with flags == 0 and other with flags == 128
}
update the word count in the word list table appropriately
}
}
commit to the database
}

RemoveDocument() {
Inherently inefficient because of the structure that we have adopted. Needs to be optimized. The general algorithm revolves around finding the record whose firstDoc is immediately less or same as the docId we are searching for.

e.g. Say we want to delete docid = 20. We got two records
firstDoc = 18, block="blahblah"
firstDoc = 22, block="blahblah"
So we select the record with docid = 18 that is immediately before docid = 20;

query to achieve this: SELECT word, firstdoc, block FROM postings where
firstdoc = SELECT max(firstdoc) FROM postings where
firstdoc < docIdWeAreSearchingFor
This returns a number of records with flag == 0 or 0 < flag < 128 or flag == 128 or flag > 128.
for each record we found do the following:
docAndPostingTable = decodeBlock(block);
we have just decoded the block using Block::decodeBlock().
docAndPostingTable.find(docId) //we check the decode block if it contains docId of our interest.
if docId found
Check the flag
when flag == 0 //only document list
remove the document and freq from the block
update the word count for the term in word table
update the delta coding for other doc in the block.
if docId == firstDoc update firstDoc with the immediately following doc
if no more docs left in this row, delete the row
when 0 < flag < 128, contains both document list and posting table
remove the document from the block.
update the word count for the term in word table
update the delta coding for other doc in the block.
remove all the postings of the document for the term in the block
if (docId == firstDoc) update firstDoc with immediately following doc
if no more docs left in the row, delete the row
when flag >= 128 //only postings table
remove all the postings corresponding to the doc
update the firstDoc for the record
delete the record if block is empty

}

SearchDocument(terms) {
terms is something like: "mac apple panther". Basically a collection of terms
The ranking algorithm as described in this document is used.
}

nsNavFullTextIndexHelper

This class contains mechanism for scheduling documents for indexing. It works pretty much like nsNavHistoryExpire. The scheduling balances indexing with overall firefox performance. It has a timer which when expires picks an unindexed document from the history and calls nsNavFullTextIndex::addDocument to index. When and how the timer is set, dictates the overall efficiency. Whenever a user visits a page, the timer is triggered.

When user visits a page, the history service calls OnAddURI function of this class. This function starts the timer. When timer expires the call back function is called. The function checks If there are any more document to be indexed and it has not been a long while since the user called addURI. If so, then it resets the timer and waits to expire again.

nsNavFullTextAnalyzer

The analyzer design is similar to the way Lucene works. The Lucene design enables support for multiple languages.
The Analyzer takes the tokens generated from the tokenizer and discards those that are not required for indexing. E.g. is, a, an the can be discarded. This is enabled by by tokenStream classes. Refer to nsNavFullTextTokenStream and Lucene API for more detail.

nsNavFullTextTokenStream

TokenStream is an abstract class with three methods next, reset and close. The next method returns a token. Token is a class representing startoffset, endoffset and termtext. TokenStream needs input to tokenize. There are two concrete class for TokenStream:
1) Tokenizer: The input is an inputstream.
2) TokenFilter: The input is another TokenStream. This works like pipes.

== Front-End ==
1) nsNavQueryParser
The function of this is to break the query given in human language into a graph of TermQuery and BooleanQuery. BooleanQuery is a struct with two operand(each a TermQuery or BooleanQuery) and an operator(AND, OR, NOT, XOR). Although the idea here is to implement all kind of queries as in: http://lucene.apache.org/java/docs/queryparsersyntax.html eventually. nsNavHistoryQuery is used to query using the query struct. The results of which is url_list which is displayed using view.

== References ==
[1] An Efficient Indexing Technique for full-text database systems: Justin Jobel, Alistair Moffet, Ron Sacks Davis
[2]Using a relational database for full-text index, Steve Putz, Xerox PARC

Places:Full Text Indexing

2007-05-26T15:43:48Z

Mindboggler: /* Detailed Design */

Places:Full Text Indexing

2007-05-26T06:25:59Z

Mindboggler:

Places:Full Text Indexing

2007-05-25T11:16:47Z

Mindboggler: The complete design of Places: Full Text Indexing feature

Places:Full Text Indexing

2007-05-25T06:43:05Z

Mindboggler:

Community:SummerOfCode07:Brainstorming

2007-03-24T14:42:01Z

Mindboggler: /* Suggestion List */

Projects with a confirmed mentor and approved by the Mozilla project SoC administrator will be moved to [[Community:SummerOfCode07]]. Potential students should look at that page to find project ideas for which we'd like submissions.

==Ground Rules==

* '''Be specific'''. It's hard to understand the impact of, or the size of, vague proposals.
* '''Consider size'''. The student has eight weeks to design, code, test and document the proposal. It needs to fill, but not overfill, that time.
* '''Do your research'''. Support the idea with well-researched links.
* '''Don't morph other people's ideas'''. If you have a related idea, place it next to the existing one, or add a comment.
* '''Insert only your own name into the Mentor column''', and then only if you are willing to take on the responsibility. Potential mentors [[Community:SummerOfCode07:Mentors|sign up here]].

([http://weblogs.mozillazine.org/gerv/archives/2006/05/making_a_soc_project_list.html More thoughts on making a good list])

==Suggestion List==

Last year's ideas: [[Community:SummerOfCode06|General]], [[Thunderbird:Summer_Of_Code_2006|Thunderbird]]

Please use this format for submitting ideas.
{| border="1" cellpadding="3" width="100%" valign="top"
|- align="center"
| style="background-color: #efefef;" | '''Title'''
| style="background-color: #efefef;" | '''Abstract - links to details/bugs/etc'''
| style="background-color: #efefef;" | '''Reporter'''
| style="background-color: #efefef;" | '''Mentor(s)'''
| style="background-color: #efefef;" | '''Comments'''
|-
|-
| valign="top" | Internal streamed audio player for Firefox
| valign="top" |This tool will play streamed audio files in firefox itself.( For example .ra files). Currently there must be external player (like real player) to play them.
| valign="top" | maxaeran
| valign="top" |
| valign="top" |When user clicks to play a streamed audio file, I am suggesting two methods to play it. First one is view the downloader and supply a Firefox’s “internal player” to play it. Second one is supply a player options within tool bar. I don’t know the possibility of this project. Please comment on this. 
 
maxaeran: how does this fit with the WHAT-WG Audio object proposals? - Gerv
|-
| valign="top" | Extension for bookmarking and sharing scripts and extensions.
| valign="top" |An extension tying into a web-based tool - a "del.icio.us for extensions" that also allows users to load their preferred extensions on any firefox browser in seconds.
| valign="top" | Hivemya
| valign="top" |
| valign="top" |JS coded extension (w/ Open ID based accounts?). SQL-based bookmark accounts directly linking to XPIs, with support for RSS. Is extension auto-installation possible through RSS/JSON subscription? 
 
Who regularly wants to "load their preferred extensions on any firefox browser in seconds"? - Gerv
 
 
 
That would be a side-benefit. Basically, it's adding the same batch-installation that is currently being discussed in the Google Greasemonkey group, but for extensions at large. Once enough extensions are bookmarked and tagged, the social bookmarking system could be integrated into the Mozilla Add-Ons site.
 
 
I think it's an *important* idea because the quantity of extensions is going to increase ten-fold over the next year, and therefore there is much more of a need for (1) attention agents and (2) spam filters; both needs can be solved through a social bookmarking system. - Hivemya

|-
| valign="top" | Firefox Tab Grouping
| valign="top" | This feature will group the logically related tabs into related groups in Firefox into similar logical groups.
| valign="top" | maxaeran
| valign="top" |
| valign="top" |These groups can be made by user or it can be done automatically ( can be configured ) This proposal is too vague - Gerv
|-
| valing="top" | Allow the option of passing URL to helper application instead of downloading
| valign="top" | See {{bug|225882}} and {{bug|137339}}
|
|
|
|-
| valign="top" | Metalink
| valign="top" | A simple XML format for downloads ({{bug|331979}}) that lists mirrors and checksums, along with other useful metadata such as mirror location. Listing multiple URLs for a file increases availability while the checksums guarantee integrity and let downloads be repaired automatically. You can also filter downloads by location and other things. This is currently supported by over ten download managers.
| valign="top" | Antini
| valign="top" |
| valign="top" |
|-
| valign="top" | Auto verify MD5/SHA1 hashes & PGP signatures
| valign="top" | Automatically verifying MD5/SHA1 hashes, and optionally PGP signatures, of downloads. When you have downloaded a file, the download manager should try to download filename.md5, filename.sha, filename.asc and run the associated tool on the downloaded file to verify. Mark the entry as red or something in the download manager, and change the Open link to Info link, if the file did not verify. The Info link would open a page explaining what is wrong. It could perhaps have a open or preferably just delete file button. More difficult case would be to get the md5/sha1 signature if it is just embedded on the page where the download link is, but you could try some heuristics... (see also bug 292481).
| valign="top" | HeikkiToivonen
| valign="top" |
| valign="top" | PGP signature support would probably be easiest to build on top of Enigmail extension. See Metalink which supports associating MD5/SHA1 hashes and PGP signatures with files, and [http://microformats.org/wiki/hash-examples hash microformat] for embedding within a page. Making three extra 404 hits on a website for each file downloaded is not a friendly thing to do (remember favicon.ico) - Gerv
|-
| valign="top" | Internal audio
| valign="top" | [https://bugzilla.mozilla.org/show_bug.cgi?id=92110 Allow Firefox to play WAV and AIFF audio files internally]
| valign="top" | schapel
| valign="top" |
| valign="top" | This would probably need to be done following the [http://www.whatwg.org/specs/web-apps/current-work/#sound WHAT-WG specs for the Audio() object] - Gerv
|-
| valign="top" | Memory Manager
| valign="top" | Try to implement an internal memory manager. It should, for example, pre-allocate about 10% of system ram memory and try to operate within that memory. All calls to "free()" should release memory to this global memory pool and all calls to "malloc()" must allocate memory from this memory pool. If properly implemented, we can even reduce the overheads that may arise due to such an implementation.
| valign="top" | [[User:Shyamk|Shyam]]
| valign="top" |
| valign="top" | A repost of the idea I posted [http://wiki.mozilla.org/Firefox/Feature_Brainstorming:Performance Here] (Firefox3 Brainstorming). Shyam: what qualifications do you have to mentor this project? - Gerv Grev: Replied to you by e-mail, and updated this [http://wiki.mozilla.org/Community:SummerOfCode07:Mentors wiki]. Needless to say, I {can/would like} to get dirty in implementing this along with the student (In case of time constraints). Shyam: mentors need to have Mozilla community experience. A mentor is not a co-worker by another name :-) - Gerv. Grev: Point taken ! Removed my name from mentor column. I can help the student who comes in to work on this as an outside contributor, and not as GSoC student, as I just graduated :-(
|-
| valign="top" | Image type finder
| valign="top" | Implement an image type finder as described in [https://bugzilla.mozilla.org/show_bug.cgi?id=18574#c672 this Bugzilla comment]
| valign="top" | schapel
| valign="top" |
| valign="top" |
|-
| valign="top" | Remote Cookies
| valign="top" | Write a Firefox extension that stores/retrieves cookies on a server instead of in the local cookies.txt file. This will enable Firefox users to use the same cookies on all their computers and Firefox profiles.
| valign="top" | [[User:ericjung|Eric H. Jung]]
| valign="top" | [[User:ericjung|Eric H. Jung]]
| valign="top" | Think of never having to authenticate against all of your websites again! If the student runs out of time, I will write code to keep the server contents encrypted and the SSL delivery/retrieval mechanism. Student needs to write the GUI and web progress listener hooks. Why just cookies? Why not full remote profiles? Do you have an algorithm for handling merge conflicts? How does this relate to Google Sync? - Gerv
|-
| valign="top" | Broken Add-on Detector
| valign="top" | Let the user show a problem with the application which happens only with extensions enabled (fine in -safe-mode) and let the application search for the broken add-on/conflicting add-ons itself.
| valign="top" | [[User:Archaeopteryx|Archaeopteryx]]
| valign="top" |
| valign="top" | The user recognizes a problem with the application (Fx, Tb, ...) which does not happen in safe mode, so a wizard will demand him to perform the steps to reproduce in normal mode and for comparing in safe mode and try to find the problematic extension by disabling an extension, starting the app and testing and continue with the next extension. Basically, I think about the red code (broken translations and so on) or obvious problems with doubling the event handler which let the tab control keys jump two tabs instead of one. Finally, the problematic extension(s) should be disabled and the user informed about this action.
|-
| valign="top" | E-mail send/receive progress dialog
| valign="top" | Write a Thunderbird extension that displays a dialog showing the progress of e-mail send/receive, showing the total number of mails to process, their size and a progress bar.
| valign="top" | [[User:piecu|Bartosz Piec]]
| valign="top" |
| valign="top" | Look at Microsoft Outlook or Outlook Express for an example dialog What benefits does having such a dialog give us? - Gerv
|-
| valign="top" | Bugzilla: Duplicate Bug Detection
| valign="top" | Implement a system in Bugzilla that detects automatically that the user has likely entered a bug that is a duplicate of another bug, and display a list of bugs that this bug might be a duplicate of.
| valign="top" | [[User:MaxKanatAlexander|mkanat]]
| valign="top" |
| valign="top" | [https://launchpad.net/malone Malone] can do this now, although I'm not certain its code is actually open source. (Anyhow, GPL'ed code can't be included in Bugzilla, which uses the MPL.)
|-
| valign="top" | "Search as you Type in addressbar" extension
| valign="top" | This extension will search in local bookmark and History
| valign="top" | jigar shah
| valign="top" | [[User:jigarashah|jigar shah]]
| valign="top" | Most of the time user want to find a page on a particular website; say mozilla developer, He goes to that website and browses through all available links. If when he starts typing in addressbar he gets suggestions based on his bookmarks and History it will reduce his search time. This is easy to do in Firefox 3 since there are plans to add SQLLite in FF3. Don't know about possibilities for FF2.
|-
| valign="top" | ODF stylesheet support
| valign="top" | Extension using XSLT stylesheets to make ODF documents viewable in-browser
| valign="top" | Gerv
| valign="top" |
| valign="top" | [https://addons.mozilla.org/firefox/1888/ ODFReader] already exists, although it's quite simple, for OpenDocument Text only, and requires a stylesheet whose licensing isn't quite compatible with that of Mozilla. This project would enhance ODT support, and perhaps add support for ODS (spreadsheet) and ODP (presentation), such that these types could be reliably viewed in a pleasant (if not 100% accurate) way directly in the browser. A "Save" link or button would also be provided, for the potentially confused.
|-
| valign="top" | Firefox 2 Go.
| valign="top" | Write an extension that will allow users to sign into firefox anywhere in the world and have their history,bookmarks,browser settings, plugins(firefox profiles) automaticaly loaded into the browser. They will basicaly have a browser that will go anywhere they do. Ofcourse when they sign off everything will be removed if they wish.

Please post comments on this idea
| valign="top" | Peter Kemp (BCIT Student)
| valign="top" | Looking for mentor
| valign="top" | Why should you be limited to surfing your way and style only at home! What if you could travel anywear in the world, to any computer and your browser would be right there for you. It would Supply you all of your bookmarks, your browser settings, your history and even the plugins that you use everyday.

No long is firefox just a browser, but a travel companion.

Firefox, you travel, we follow. 
Comment(jigar):
Google Browser Sync extension already exist for this purpose [http://www.google.com/tools/firefox/browsersync/ Google Browser Sync]
 
Such an extension would have value if it were open source and usable with any storage backend, not just Google's - Gerv
 
Comment(Peter Kemp):
Thank you jigar, I didnt know about the browsersync.
 
(To Gerv):For the storage backend i was thinking about implementing it via XML files. What is your opinion for a backend storage ? Thank you
|-
| valign="top" | SVG as an image format
| valign="top" | One of the possibilities that having a native SVG implementation in the browser provides is being able to use SVG in contexts where normally a raster image would be used, such as <html:img> and CSS properties that accept images.

Someone interested in taking this on as a SoC project would need to be pretty familiar with the Mozilla codebase, as this involves getting bits of code that weren't originally planned to work with one another to play nicely together.
| valign="top" | tor
| valign="top" | tor
| valign="top" |

|-
| valign="top" | Merge the two existing French spelling dictionaries
| valign="top" | There are currently [https://addons.mozilla.org/thunderbird/dictionaries/?lang=fr two French spelling dictionaries] for MySpell. The first was made available from a former ISpell dictionary, and it was later "enhanced" by another group wanting to support new spellings only (1990 reform), although those are not mandatory. As a result, we have two dictionaries, but none of practical use (the first is outdated, the other is underlining perfectly valid words).

A possible implementation of this project would be to take the new dictionary, re-add the hundreds of words that were removed from the old one, and enhance it in other ways (for example, HunSpell allows you to remove some words from the spelling suggestions without underlining them). It might looks like a trivial task, but it is not the case. There were structural changes in the affix dictionary file which can't be resolved by a simple diff.
| valign="top" | Benoit / [http://frenchmozilla.sourceforge.net/ The FrenchMozilla team]
| valign="top" |
| valign="top" |

|-
| valign="top" | Create a new French dictionary (HunSpell) from scratch
| valign="top" | No matter how good the French spelling dictionary may become, it can't be shipped with Firefox or Thunderbird because of licensing issues (It's GPL only, Mozilla products are tri-licensed)[http://frenchmozilla.sourceforge.net/blog/index.php/2006/02/02/21-correction-orthographique-et-logiciels-mozilla explanation in French].
This proposal is to build a new French dictionary from scratch, taking advantage of the new features in [http://sourceforge.net/docman/display_doc.php?docid=29374&group_id=143754 HunSpell]

Someone interested in taking this on as a SoC project would probably need to have a strong background in linguistics or a similar field.
| valign="top" | Benoit / [http://frenchmozilla.sourceforge.net/ The FrenchMozilla team]
| valign="top" |
| valign="top" |

|-
| valign="top" | Index visited pages. Allow query on it.
| valign="top" | People need to re-find the information that they have already found on the web. This mechanism is currently provided through bookmarks, history and the navigation buttons. Firefox 3 is set to include a number of features through “Places”. “Places” can be further enhanced by allowing user “word-search” the visited web pages. This project will add indexing capabilities to firefox and allow user queries on visited web pages. Thus helping the user find what they need to know. Reported as an enhancement for firefox3. {{bug|342913}}
| valign="top" | [http://wiki.mozilla.org/User:Mindboggler Kunal]
| valign="top" | Looking for mentor
| valign="top" |Places will add exciting capabilities to firefox. Indexing visited pages is a consistent demand seen in the wikis. Further users spend a lot of time on re-finding information on the web. A feature like this will enhance his user-experience. Sooner or later, competiting browsers will implement this feature. An early start on this will make firefox even more competitive compared to its rival browsers.
|}

Community:SummerOfCode07:Brainstorming

2007-03-24T14:38:24Z

Mindboggler: /* Suggestion List */

Projects with a confirmed mentor and approved by the Mozilla project SoC administrator will be moved to [[Community:SummerOfCode07]]. Potential students should look at that page to find project ideas for which we'd like submissions.

==Ground Rules==

* '''Be specific'''. It's hard to understand the impact of, or the size of, vague proposals.
* '''Consider size'''. The student has eight weeks to design, code, test and document the proposal. It needs to fill, but not overfill, that time.
* '''Do your research'''. Support the idea with well-researched links.
* '''Don't morph other people's ideas'''. If you have a related idea, place it next to the existing one, or add a comment.
* '''Insert only your own name into the Mentor column''', and then only if you are willing to take on the responsibility. Potential mentors [[Community:SummerOfCode07:Mentors|sign up here]].

([http://weblogs.mozillazine.org/gerv/archives/2006/05/making_a_soc_project_list.html More thoughts on making a good list])

==Suggestion List==

Last year's ideas: [[Community:SummerOfCode06|General]], [[Thunderbird:Summer_Of_Code_2006|Thunderbird]]

Please use this format for submitting ideas.
{| border="1" cellpadding="3" width="100%" valign="top"
|- align="center"
| style="background-color: #efefef;" | '''Title'''
| style="background-color: #efefef;" | '''Abstract - links to details/bugs/etc'''
| style="background-color: #efefef;" | '''Reporter'''
| style="background-color: #efefef;" | '''Mentor(s)'''
| style="background-color: #efefef;" | '''Comments'''
|-
|-
| valign="top" | Internal streamed audio player for Firefox
| valign="top" |This tool will play streamed audio files in firefox itself.( For example .ra files). Currently there must be external player (like real player) to play them.
| valign="top" | maxaeran
| valign="top" |
| valign="top" |When user clicks to play a streamed audio file, I am suggesting two methods to play it. First one is view the downloader and supply a Firefox’s “internal player” to play it. Second one is supply a player options within tool bar. I don’t know the possibility of this project. Please comment on this. 
 
maxaeran: how does this fit with the WHAT-WG Audio object proposals? - Gerv
|-
| valign="top" | Extension for bookmarking and sharing scripts and extensions.
| valign="top" |An extension tying into a web-based tool - a "del.icio.us for extensions" that also allows users to load their preferred extensions on any firefox browser in seconds.
| valign="top" | Hivemya
| valign="top" |
| valign="top" |JS coded extension (w/ Open ID based accounts?). SQL-based bookmark accounts directly linking to XPIs, with support for RSS. Is extension auto-installation possible through RSS/JSON subscription? 
 
Who regularly wants to "load their preferred extensions on any firefox browser in seconds"? - Gerv
 
 
 
That would be a side-benefit. Basically, it's adding the same batch-installation that is currently being discussed in the Google Greasemonkey group, but for extensions at large. Once enough extensions are bookmarked and tagged, the social bookmarking system could be integrated into the Mozilla Add-Ons site.
 
 
I think it's an *important* idea because the quantity of extensions is going to increase ten-fold over the next year, and therefore there is much more of a need for (1) attention agents and (2) spam filters; both needs can be solved through a social bookmarking system. - Hivemya

|-
| valign="top" | Firefox Tab Grouping
| valign="top" | This feature will group the logically related tabs into related groups in Firefox into similar logical groups.
| valign="top" | maxaeran
| valign="top" |
| valign="top" |These groups can be made by user or it can be done automatically ( can be configured ) This proposal is too vague - Gerv
|-
| valing="top" | Allow the option of passing URL to helper application instead of downloading
| valign="top" | See {{bug|225882}} and {{bug|137339}}
|
|
|
|-
| valign="top" | Metalink
| valign="top" | A simple XML format for downloads ({{bug|331979}}) that lists mirrors and checksums, along with other useful metadata such as mirror location. Listing multiple URLs for a file increases availability while the checksums guarantee integrity and let downloads be repaired automatically. You can also filter downloads by location and other things. This is currently supported by over ten download managers.
| valign="top" | Antini
| valign="top" |
| valign="top" |
|-
| valign="top" | Auto verify MD5/SHA1 hashes & PGP signatures
| valign="top" | Automatically verifying MD5/SHA1 hashes, and optionally PGP signatures, of downloads. When you have downloaded a file, the download manager should try to download filename.md5, filename.sha, filename.asc and run the associated tool on the downloaded file to verify. Mark the entry as red or something in the download manager, and change the Open link to Info link, if the file did not verify. The Info link would open a page explaining what is wrong. It could perhaps have a open or preferably just delete file button. More difficult case would be to get the md5/sha1 signature if it is just embedded on the page where the download link is, but you could try some heuristics... (see also bug 292481).
| valign="top" | HeikkiToivonen
| valign="top" |
| valign="top" | PGP signature support would probably be easiest to build on top of Enigmail extension. See Metalink which supports associating MD5/SHA1 hashes and PGP signatures with files, and [http://microformats.org/wiki/hash-examples hash microformat] for embedding within a page. Making three extra 404 hits on a website for each file downloaded is not a friendly thing to do (remember favicon.ico) - Gerv
|-
| valign="top" | Internal audio
| valign="top" | [https://bugzilla.mozilla.org/show_bug.cgi?id=92110 Allow Firefox to play WAV and AIFF audio files internally]
| valign="top" | schapel
| valign="top" |
| valign="top" | This would probably need to be done following the [http://www.whatwg.org/specs/web-apps/current-work/#sound WHAT-WG specs for the Audio() object] - Gerv
|-
| valign="top" | Memory Manager
| valign="top" | Try to implement an internal memory manager. It should, for example, pre-allocate about 10% of system ram memory and try to operate within that memory. All calls to "free()" should release memory to this global memory pool and all calls to "malloc()" must allocate memory from this memory pool. If properly implemented, we can even reduce the overheads that may arise due to such an implementation.
| valign="top" | [[User:Shyamk|Shyam]]
| valign="top" |
| valign="top" | A repost of the idea I posted [http://wiki.mozilla.org/Firefox/Feature_Brainstorming:Performance Here] (Firefox3 Brainstorming). Shyam: what qualifications do you have to mentor this project? - Gerv Grev: Replied to you by e-mail, and updated this [http://wiki.mozilla.org/Community:SummerOfCode07:Mentors wiki]. Needless to say, I {can/would like} to get dirty in implementing this along with the student (In case of time constraints). Shyam: mentors need to have Mozilla community experience. A mentor is not a co-worker by another name :-) - Gerv. Grev: Point taken ! Removed my name from mentor column. I can help the student who comes in to work on this as an outside contributor, and not as GSoC student, as I just graduated :-(
|-
| valign="top" | Image type finder
| valign="top" | Implement an image type finder as described in [https://bugzilla.mozilla.org/show_bug.cgi?id=18574#c672 this Bugzilla comment]
| valign="top" | schapel
| valign="top" |
| valign="top" |
|-
| valign="top" | Remote Cookies
| valign="top" | Write a Firefox extension that stores/retrieves cookies on a server instead of in the local cookies.txt file. This will enable Firefox users to use the same cookies on all their computers and Firefox profiles.
| valign="top" | [[User:ericjung|Eric H. Jung]]
| valign="top" | [[User:ericjung|Eric H. Jung]]
| valign="top" | Think of never having to authenticate against all of your websites again! If the student runs out of time, I will write code to keep the server contents encrypted and the SSL delivery/retrieval mechanism. Student needs to write the GUI and web progress listener hooks. Why just cookies? Why not full remote profiles? Do you have an algorithm for handling merge conflicts? How does this relate to Google Sync? - Gerv
|-
| valign="top" | Broken Add-on Detector
| valign="top" | Let the user show a problem with the application which happens only with extensions enabled (fine in -safe-mode) and let the application search for the broken add-on/conflicting add-ons itself.
| valign="top" | [[User:Archaeopteryx|Archaeopteryx]]
| valign="top" |
| valign="top" | The user recognizes a problem with the application (Fx, Tb, ...) which does not happen in safe mode, so a wizard will demand him to perform the steps to reproduce in normal mode and for comparing in safe mode and try to find the problematic extension by disabling an extension, starting the app and testing and continue with the next extension. Basically, I think about the red code (broken translations and so on) or obvious problems with doubling the event handler which let the tab control keys jump two tabs instead of one. Finally, the problematic extension(s) should be disabled and the user informed about this action.
|-
| valign="top" | E-mail send/receive progress dialog
| valign="top" | Write a Thunderbird extension that displays a dialog showing the progress of e-mail send/receive, showing the total number of mails to process, their size and a progress bar.
| valign="top" | [[User:piecu|Bartosz Piec]]
| valign="top" |
| valign="top" | Look at Microsoft Outlook or Outlook Express for an example dialog What benefits does having such a dialog give us? - Gerv
|-
| valign="top" | Bugzilla: Duplicate Bug Detection
| valign="top" | Implement a system in Bugzilla that detects automatically that the user has likely entered a bug that is a duplicate of another bug, and display a list of bugs that this bug might be a duplicate of.
| valign="top" | [[User:MaxKanatAlexander|mkanat]]
| valign="top" |
| valign="top" | [https://launchpad.net/malone Malone] can do this now, although I'm not certain its code is actually open source. (Anyhow, GPL'ed code can't be included in Bugzilla, which uses the MPL.)
|-
| valign="top" | "Search as you Type in addressbar" extension
| valign="top" | This extension will search in local bookmark and History
| valign="top" | jigar shah
| valign="top" | [[User:jigarashah|jigar shah]]
| valign="top" | Most of the time user want to find a page on a particular website; say mozilla developer, He goes to that website and browses through all available links. If when he starts typing in addressbar he gets suggestions based on his bookmarks and History it will reduce his search time. This is easy to do in Firefox 3 since there are plans to add SQLLite in FF3. Don't know about possibilities for FF2.
|-
| valign="top" | ODF stylesheet support
| valign="top" | Extension using XSLT stylesheets to make ODF documents viewable in-browser
| valign="top" | Gerv
| valign="top" |
| valign="top" | [https://addons.mozilla.org/firefox/1888/ ODFReader] already exists, although it's quite simple, for OpenDocument Text only, and requires a stylesheet whose licensing isn't quite compatible with that of Mozilla. This project would enhance ODT support, and perhaps add support for ODS (spreadsheet) and ODP (presentation), such that these types could be reliably viewed in a pleasant (if not 100% accurate) way directly in the browser. A "Save" link or button would also be provided, for the potentially confused.
|-
| valign="top" | Firefox 2 Go.
| valign="top" | Write an extension that will allow users to sign into firefox anywhere in the world and have their history,bookmarks,browser settings, plugins(firefox profiles) automaticaly loaded into the browser. They will basicaly have a browser that will go anywhere they do. Ofcourse when they sign off everything will be removed if they wish.

Please post comments on this idea
| valign="top" | Peter Kemp (BCIT Student)
| valign="top" | Looking for mentor
| valign="top" | Why should you be limited to surfing your way and style only at home! What if you could travel anywear in the world, to any computer and your browser would be right there for you. It would Supply you all of your bookmarks, your browser settings, your history and even the plugins that you use everyday.

No long is firefox just a browser, but a travel companion.

Firefox, you travel, we follow. 
Comment(jigar):
Google Browser Sync extension already exist for this purpose [http://www.google.com/tools/firefox/browsersync/ Google Browser Sync]
 
Such an extension would have value if it were open source and usable with any storage backend, not just Google's - Gerv
 
Comment(Peter Kemp):
Thank you jigar, I didnt know about the browsersync.
 
(To Gerv):For the storage backend i was thinking about implementing it via XML files. What is your opinion for a backend storage ? Thank you
|-
| valign="top" | SVG as an image format
| valign="top" | One of the possibilities that having a native SVG implementation in the browser provides is being able to use SVG in contexts where normally a raster image would be used, such as <html:img> and CSS properties that accept images.

Someone interested in taking this on as a SoC project would need to be pretty familiar with the Mozilla codebase, as this involves getting bits of code that weren't originally planned to work with one another to play nicely together.
| valign="top" | tor
| valign="top" | tor
| valign="top" |

|-
| valign="top" | Merge the two existing French spelling dictionaries
| valign="top" | There are currently [https://addons.mozilla.org/thunderbird/dictionaries/?lang=fr two French spelling dictionaries] for MySpell. The first was made available from a former ISpell dictionary, and it was later "enhanced" by another group wanting to support new spellings only (1990 reform), although those are not mandatory. As a result, we have two dictionaries, but none of practical use (the first is outdated, the other is underlining perfectly valid words).

A possible implementation of this project would be to take the new dictionary, re-add the hundreds of words that were removed from the old one, and enhance it in other ways (for example, HunSpell allows you to remove some words from the spelling suggestions without underlining them). It might looks like a trivial task, but it is not the case. There were structural changes in the affix dictionary file which can't be resolved by a simple diff.
| valign="top" | Benoit / [http://frenchmozilla.sourceforge.net/ The FrenchMozilla team]
| valign="top" |
| valign="top" |

|-
| valign="top" | Create a new French dictionary (HunSpell) from scratch
| valign="top" | No matter how good the French spelling dictionary may become, it can't be shipped with Firefox or Thunderbird because of licensing issues (It's GPL only, Mozilla products are tri-licensed)[http://frenchmozilla.sourceforge.net/blog/index.php/2006/02/02/21-correction-orthographique-et-logiciels-mozilla explanation in French].
This proposal is to build a new French dictionary from scratch, taking advantage of the new features in [http://sourceforge.net/docman/display_doc.php?docid=29374&group_id=143754 HunSpell]

Someone interested in taking this on as a SoC project would probably need to have a strong background in linguistics or a similar field.
| valign="top" | Benoit / [http://frenchmozilla.sourceforge.net/ The FrenchMozilla team]
| valign="top" |
| valign="top" |

|-
| valign="top" | Index visited pages. Allow query on it.
| valign="top" | This will go into the places work. Reported as an enhancement for firefox3. {{bug|342913}}
| valign="top" | [http://wiki.mozilla.org/User:Mindboggler Kunal]
| valign="top" |
| valign="top" |
|}

Community:SummerOfCode07:Brainstorming

2007-03-24T14:34:41Z

Mindboggler: /* Suggestion List */

Projects with a confirmed mentor and approved by the Mozilla project SoC administrator will be moved to [[Community:SummerOfCode07]]. Potential students should look at that page to find project ideas for which we'd like submissions.

==Ground Rules==

* '''Be specific'''. It's hard to understand the impact of, or the size of, vague proposals.
* '''Consider size'''. The student has eight weeks to design, code, test and document the proposal. It needs to fill, but not overfill, that time.
* '''Do your research'''. Support the idea with well-researched links.
* '''Don't morph other people's ideas'''. If you have a related idea, place it next to the existing one, or add a comment.
* '''Insert only your own name into the Mentor column''', and then only if you are willing to take on the responsibility. Potential mentors [[Community:SummerOfCode07:Mentors|sign up here]].

([http://weblogs.mozillazine.org/gerv/archives/2006/05/making_a_soc_project_list.html More thoughts on making a good list])

==Suggestion List==

Last year's ideas: [[Community:SummerOfCode06|General]], [[Thunderbird:Summer_Of_Code_2006|Thunderbird]]

Please use this format for submitting ideas.
{| border="1" cellpadding="3" width="100%" valign="top"
|- align="center"
| style="background-color: #efefef;" | '''Title'''
| style="background-color: #efefef;" | '''Abstract - links to details/bugs/etc'''
| style="background-color: #efefef;" | '''Reporter'''
| style="background-color: #efefef;" | '''Mentor(s)'''
| style="background-color: #efefef;" | '''Comments'''
|-
|-
| valign="top" | Internal streamed audio player for Firefox
| valign="top" |This tool will play streamed audio files in firefox itself.( For example .ra files). Currently there must be external player (like real player) to play them.
| valign="top" | maxaeran
| valign="top" |
| valign="top" |When user clicks to play a streamed audio file, I am suggesting two methods to play it. First one is view the downloader and supply a Firefox’s “internal player” to play it. Second one is supply a player options within tool bar. I don’t know the possibility of this project. Please comment on this. 
 
maxaeran: how does this fit with the WHAT-WG Audio object proposals? - Gerv
|-
| valign="top" | Extension for bookmarking and sharing scripts and extensions.
| valign="top" |An extension tying into a web-based tool - a "del.icio.us for extensions" that also allows users to load their preferred extensions on any firefox browser in seconds.
| valign="top" | Hivemya
| valign="top" |
| valign="top" |JS coded extension (w/ Open ID based accounts?). SQL-based bookmark accounts directly linking to XPIs, with support for RSS. Is extension auto-installation possible through RSS/JSON subscription? 
 
Who regularly wants to "load their preferred extensions on any firefox browser in seconds"? - Gerv
 
 
 
That would be a side-benefit. Basically, it's adding the same batch-installation that is currently being discussed in the Google Greasemonkey group, but for extensions at large. Once enough extensions are bookmarked and tagged, the social bookmarking system could be integrated into the Mozilla Add-Ons site.
 
 
I think it's an *important* idea because the quantity of extensions is going to increase ten-fold over the next year, and therefore there is much more of a need for (1) attention agents and (2) spam filters; both needs can be solved through a social bookmarking system. - Hivemya

|-
| valign="top" | Firefox Tab Grouping
| valign="top" | This feature will group the logically related tabs into related groups in Firefox into similar logical groups.
| valign="top" | maxaeran
| valign="top" |
| valign="top" |These groups can be made by user or it can be done automatically ( can be configured ) This proposal is too vague - Gerv
|-
| valing="top" | Allow the option of passing URL to helper application instead of downloading
| valign="top" | See {{bug|225882}} and {{bug|137339}}
|
|
|
|-
| valign="top" | Metalink
| valign="top" | A simple XML format for downloads ({{bug|331979}}) that lists mirrors and checksums, along with other useful metadata such as mirror location. Listing multiple URLs for a file increases availability while the checksums guarantee integrity and let downloads be repaired automatically. You can also filter downloads by location and other things. This is currently supported by over ten download managers.
| valign="top" | Antini
| valign="top" |
| valign="top" |
|-
| valign="top" | Auto verify MD5/SHA1 hashes & PGP signatures
| valign="top" | Automatically verifying MD5/SHA1 hashes, and optionally PGP signatures, of downloads. When you have downloaded a file, the download manager should try to download filename.md5, filename.sha, filename.asc and run the associated tool on the downloaded file to verify. Mark the entry as red or something in the download manager, and change the Open link to Info link, if the file did not verify. The Info link would open a page explaining what is wrong. It could perhaps have a open or preferably just delete file button. More difficult case would be to get the md5/sha1 signature if it is just embedded on the page where the download link is, but you could try some heuristics... (see also bug 292481).
| valign="top" | HeikkiToivonen
| valign="top" |
| valign="top" | PGP signature support would probably be easiest to build on top of Enigmail extension. See Metalink which supports associating MD5/SHA1 hashes and PGP signatures with files, and [http://microformats.org/wiki/hash-examples hash microformat] for embedding within a page. Making three extra 404 hits on a website for each file downloaded is not a friendly thing to do (remember favicon.ico) - Gerv
|-
| valign="top" | Internal audio
| valign="top" | [https://bugzilla.mozilla.org/show_bug.cgi?id=92110 Allow Firefox to play WAV and AIFF audio files internally]
| valign="top" | schapel
| valign="top" |
| valign="top" | This would probably need to be done following the [http://www.whatwg.org/specs/web-apps/current-work/#sound WHAT-WG specs for the Audio() object] - Gerv
|-
| valign="top" | Memory Manager
| valign="top" | Try to implement an internal memory manager. It should, for example, pre-allocate about 10% of system ram memory and try to operate within that memory. All calls to "free()" should release memory to this global memory pool and all calls to "malloc()" must allocate memory from this memory pool. If properly implemented, we can even reduce the overheads that may arise due to such an implementation.
| valign="top" | [[User:Shyamk|Shyam]]
| valign="top" |
| valign="top" | A repost of the idea I posted [http://wiki.mozilla.org/Firefox/Feature_Brainstorming:Performance Here] (Firefox3 Brainstorming). Shyam: what qualifications do you have to mentor this project? - Gerv Grev: Replied to you by e-mail, and updated this [http://wiki.mozilla.org/Community:SummerOfCode07:Mentors wiki]. Needless to say, I {can/would like} to get dirty in implementing this along with the student (In case of time constraints). Shyam: mentors need to have Mozilla community experience. A mentor is not a co-worker by another name :-) - Gerv. Grev: Point taken ! Removed my name from mentor column. I can help the student who comes in to work on this as an outside contributor, and not as GSoC student, as I just graduated :-(
|-
| valign="top" | Image type finder
| valign="top" | Implement an image type finder as described in [https://bugzilla.mozilla.org/show_bug.cgi?id=18574#c672 this Bugzilla comment]
| valign="top" | schapel
| valign="top" |
| valign="top" |
|-
| valign="top" | Remote Cookies
| valign="top" | Write a Firefox extension that stores/retrieves cookies on a server instead of in the local cookies.txt file. This will enable Firefox users to use the same cookies on all their computers and Firefox profiles.
| valign="top" | [[User:ericjung|Eric H. Jung]]
| valign="top" | [[User:ericjung|Eric H. Jung]]
| valign="top" | Think of never having to authenticate against all of your websites again! If the student runs out of time, I will write code to keep the server contents encrypted and the SSL delivery/retrieval mechanism. Student needs to write the GUI and web progress listener hooks. Why just cookies? Why not full remote profiles? Do you have an algorithm for handling merge conflicts? How does this relate to Google Sync? - Gerv
|-
| valign="top" | Broken Add-on Detector
| valign="top" | Let the user show a problem with the application which happens only with extensions enabled (fine in -safe-mode) and let the application search for the broken add-on/conflicting add-ons itself.
| valign="top" | [[User:Archaeopteryx|Archaeopteryx]]
| valign="top" |
| valign="top" | The user recognizes a problem with the application (Fx, Tb, ...) which does not happen in safe mode, so a wizard will demand him to perform the steps to reproduce in normal mode and for comparing in safe mode and try to find the problematic extension by disabling an extension, starting the app and testing and continue with the next extension. Basically, I think about the red code (broken translations and so on) or obvious problems with doubling the event handler which let the tab control keys jump two tabs instead of one. Finally, the problematic extension(s) should be disabled and the user informed about this action.
|-
| valign="top" | E-mail send/receive progress dialog
| valign="top" | Write a Thunderbird extension that displays a dialog showing the progress of e-mail send/receive, showing the total number of mails to process, their size and a progress bar.
| valign="top" | [[User:piecu|Bartosz Piec]]
| valign="top" |
| valign="top" | Look at Microsoft Outlook or Outlook Express for an example dialog What benefits does having such a dialog give us? - Gerv
|-
| valign="top" | Bugzilla: Duplicate Bug Detection
| valign="top" | Implement a system in Bugzilla that detects automatically that the user has likely entered a bug that is a duplicate of another bug, and display a list of bugs that this bug might be a duplicate of.
| valign="top" | [[User:MaxKanatAlexander|mkanat]]
| valign="top" |
| valign="top" | [https://launchpad.net/malone Malone] can do this now, although I'm not certain its code is actually open source. (Anyhow, GPL'ed code can't be included in Bugzilla, which uses the MPL.)
|-
| valign="top" | "Search as you Type in addressbar" extension
| valign="top" | This extension will search in local bookmark and History
| valign="top" | jigar shah
| valign="top" | [[User:jigarashah|jigar shah]]
| valign="top" | Most of the time user want to find a page on a particular website; say mozilla developer, He goes to that website and browses through all available links. If when he starts typing in addressbar he gets suggestions based on his bookmarks and History it will reduce his search time. This is easy to do in Firefox 3 since there are plans to add SQLLite in FF3. Don't know about possibilities for FF2.
|-
| valign="top" | ODF stylesheet support
| valign="top" | Extension using XSLT stylesheets to make ODF documents viewable in-browser
| valign="top" | Gerv
| valign="top" |
| valign="top" | [https://addons.mozilla.org/firefox/1888/ ODFReader] already exists, although it's quite simple, for OpenDocument Text only, and requires a stylesheet whose licensing isn't quite compatible with that of Mozilla. This project would enhance ODT support, and perhaps add support for ODS (spreadsheet) and ODP (presentation), such that these types could be reliably viewed in a pleasant (if not 100% accurate) way directly in the browser. A "Save" link or button would also be provided, for the potentially confused.
|-
| valign="top" | Firefox 2 Go.
| valign="top" | Write an extension that will allow users to sign into firefox anywhere in the world and have their history,bookmarks,browser settings, plugins(firefox profiles) automaticaly loaded into the browser. They will basicaly have a browser that will go anywhere they do. Ofcourse when they sign off everything will be removed if they wish.

Please post comments on this idea
| valign="top" | Peter Kemp (BCIT Student)
| valign="top" | Looking for mentor
| valign="top" | Why should you be limited to surfing your way and style only at home! What if you could travel anywear in the world, to any computer and your browser would be right there for you. It would Supply you all of your bookmarks, your browser settings, your history and even the plugins that you use everyday.

No long is firefox just a browser, but a travel companion.

Firefox, you travel, we follow. 
Comment(jigar):
Google Browser Sync extension already exist for this purpose [http://www.google.com/tools/firefox/browsersync/ Google Browser Sync]
 
Such an extension would have value if it were open source and usable with any storage backend, not just Google's - Gerv
 
Comment(Peter Kemp):
Thank you jigar, I didnt know about the browsersync.
 
(To Gerv):For the storage backend i was thinking about implementing it via XML files. What is your opinion for a backend storage ? Thank you
|-
| valign="top" | SVG as an image format
| valign="top" | One of the possibilities that having a native SVG implementation in the browser provides is being able to use SVG in contexts where normally a raster image would be used, such as <html:img> and CSS properties that accept images.

Someone interested in taking this on as a SoC project would need to be pretty familiar with the Mozilla codebase, as this involves getting bits of code that weren't originally planned to work with one another to play nicely together.
| valign="top" | tor
| valign="top" | tor
| valign="top" |

|-
| valign="top" | Merge the two existing French spelling dictionaries
| valign="top" | There are currently [https://addons.mozilla.org/thunderbird/dictionaries/?lang=fr two French spelling dictionaries] for MySpell. The first was made available from a former ISpell dictionary, and it was later "enhanced" by another group wanting to support new spellings only (1990 reform), although those are not mandatory. As a result, we have two dictionaries, but none of practical use (the first is outdated, the other is underlining perfectly valid words).

A possible implementation of this project would be to take the new dictionary, re-add the hundreds of words that were removed from the old one, and enhance it in other ways (for example, HunSpell allows you to remove some words from the spelling suggestions without underlining them). It might looks like a trivial task, but it is not the case. There were structural changes in the affix dictionary file which can't be resolved by a simple diff.
| valign="top" | Benoit / [http://frenchmozilla.sourceforge.net/ The FrenchMozilla team]
| valign="top" |
| valign="top" |

|-
| valign="top" | Create a new French dictionary (HunSpell) from scratch
| valign="top" | No matter how good the French spelling dictionary may become, it can't be shipped with Firefox or Thunderbird because of licensing issues (It's GPL only, Mozilla products are tri-licensed)[http://frenchmozilla.sourceforge.net/blog/index.php/2006/02/02/21-correction-orthographique-et-logiciels-mozilla explanation in French].
This proposal is to build a new French dictionary from scratch, taking advantage of the new features in [http://sourceforge.net/docman/display_doc.php?docid=29374&group_id=143754 HunSpell]

Someone interested in taking this on as a SoC project would probably need to have a strong background in linguistics or a similar field.
| valign="top" | Benoit / [http://frenchmozilla.sourceforge.net/ The FrenchMozilla team]
| valign="top" |
| valign="top" |

|-
| valign="top" | Index visited pages. Allow query on it.
| valign="top" | https://bugzilla.mozilla.org/show_bug.cgi?id=342913
| valign="top" | http://wiki.mozilla.org/User:Mindboggler
| valign="top" |
| valign="top" |
|}