24
edits
Mindboggler (talk | contribs) |
Mindboggler (talk | contribs) No edit summary |
||
| Line 1: | Line 1: | ||
ToDO: Formatting, it is badly formatted. Any idea on how to effectively format code? | |||
== Overview == | == Overview == | ||
| Line 16: | Line 17: | ||
== Use Case == | == Use Case == | ||
| Line 30: | Line 30: | ||
== Database Design == | == Database Design == | ||
TODO: Check the url_table and put it here. The url_table acts as the document table. The url table will contain additionally the document length | |||
Word Table | Word Table | ||
columnn type bytes description | columnn type bytes description | ||
| Line 46: | Line 48: | ||
To Do | To Do | ||
1) We might need a table or two more for ranking | 1) We might need a table or two more for efficiently ranking | ||
2) See how docid will referenced | 2) See how docid will referenced | ||
3) check if there is varbinary. There is a BLOB | |||
Note that the tables structure is subject to change to improve efficiency. New Tables might be formed and/or the table might add/remove column | Note that the tables structure is subject to change to improve efficiency. New Tables might be formed and/or the table might add/remove column | ||
== Detailed Design == | == Detailed Design == | ||
Classes | Classes | ||
There are essentially five classes for the back-end: | There are essentially five classes for the back-end: | ||
1) nsNavHistoryQuery | 1) nsNavHistoryQuery | ||
2) nsNavFullTextIndexHelper | 2) nsNavFullTextIndexHelper | ||
3) nsNavFullTextIndex | 3) nsNavFullTextIndex | ||
4 | 4) nsNavFullTextAnalyzer | ||
nsNavHistoryQuery | nsNavHistoryQuery. | ||
This class is already implemented with all features except searching text. The mechanism of searching is no different from what described in http://developer.mozilla.org/en/docs/Places:Query_System. The result of nsNavHistoryQuery::search will do a full-text search with nsNavHistoryQuery::searchTerms. This is very powerful with the number of options that you can specify. Conjunctive queries can be executed. | This class is already implemented with all features except searching text. The mechanism of searching is no different from what described in http://developer.mozilla.org/en/docs/Places:Query_System. The result of nsNavHistoryQuery::search will do a full-text search with nsNavHistoryQuery::searchTerms. This is very powerful with the number of options that you can specify. Conjunctive queries can be executed. | ||
nsNavFullTextIndex | The search function uses the searchTerms to call nsNavFullTextIndex::searchDocument(term). This returns a url list based on a certain ranking. This function further filters based on other criteria such as date etc.. provided in the query options. The filtered url list is returned | ||
nsNavFullTextIndex | |||
This class interacts with SQLite database. This implements the algorithm for adding document to the index, removing document from the index and search for a given term. This search function is used by nsNavHistoryQuery::search function. Look at [2] for the algorithm used. | This class interacts with SQLite database. This implements the algorithm for adding document to the index, removing document from the index and search for a given term. This search function is used by nsNavHistoryQuery::search function. Look at [2] for the algorithm used. | ||
Block is a struct used to encode and decode block. A block is variable length delta encoded. | Block is a struct used to encode and decode block. A block is variable length delta encoded. Variable Length delta Encoding, compresses very efficiently balancing speed and storage requirement. | ||
Variable Length delta Encoding, compresses very efficiently balancing speed and storage requirement | |||
struct Block { | struct Block { | ||
//The width of the field is 255 bytes. The return value is an int indicating number of elements in the data that were encoded in the out byte array. | |||
int encode(in int[] data, out byte[] encodedBlock) { | |||
//How to encode more efficiently, any idea? | |||
int[] bigEndian; | |||
int k = 0; | |||
data[i - 1] = 0; | |||
for(int i = 0; i < data.length; i++) { | |||
data[i] -= data[i - 1]; | |||
int j = 0; | |||
while(data[i] != 0) { | |||
bigEndian[j++] = data[i] % 128; | |||
data[i] /= 128; | |||
} | |||
for( ; j > 0; j--, k++) { | |||
encodedBlock[k] = (1 << 8) & bigEndian[j]; | |||
} | |||
encodedBlock[k++] = bigEndian[0]; | |||
if (k > 255) | |||
return i - 1 | |||
} | |||
} | |||
void decode(in byte[] encodedBlock, out int[] data) { | |||
int j = 0; | |||
for(int i = 0; i < encodedBlock.length; i++) { | |||
if (encodedBlock[i] && (1 << 8)) { | |||
data[j] *= 128; | |||
data[j] += encodedBlock[i] & ((1 << 8) - 1); | |||
} | |||
else { | |||
data[j] += data[j - 1]; //Because it was delta encoded | |||
data[j++] = encodedBlock[i]; | |||
} | |||
} | |||
} | } | ||
} | } | ||
AddDocument(connection, document, analyzer) { | AddDocument(connection, document, analyzer) { | ||
AddDocument in two pass. Scan the document for terms and then invert. | AddDocument in two pass. Scan the document for terms and then invert. We have the doc id. We will require a hash map library to make it very efficient. Term is a hash map. Usage: term['termname'] = array of pos. | ||
while(analyzer.hasTerms()) { | |||
cTerm = analyzer.nextTerm(); | |||
term[cTerm.Name].add(cTerm.pos); | |||
while(analyzer.hasTerms()) { | } | ||
cTerm = analyzer.nextTerm(); | iterator i = term.iterator(); | ||
term[cTerm.Name].add(cTerm.pos); | while(i.hasNext()) { | ||
} | termName = i.next(); | ||
iterator i = term.iterator(); | termPos[] = term[termName]; | ||
while(i.hasNext()) { | |||
termName = i.next(); | |||
termPos[] = term[termName]; | |||
//hopefully sqlite caches query results, it will be inefficient otherwise.< | //hopefully sqlite caches query results, it will be inefficient otherwise. | ||
if (term already in table) { | |||
record = executeQuery("SELECT firstdoc, flags, block FROM postings | |||
WHERE word = termName | |||
AND firstdoc == (Given as >= in [2], == is correct in my opinion) | |||
(SELECT max (firstdoc) FROM postings | |||
WHERE word = termName and flags < 128)"); | |||
//Refer to [2] for explanation of this. | |||
//only one record is retrieved with flags == 0 or flags between 0 and 128 | |||
//when flag == 0, the block contains only document list | |||
if (flag == 0) { | |||
1) Decode the block | |||
2) See if one more document can be fitted in this. | |||
3) Yes? add to it | |||
i) find the position list | |||
positionList = executeQuery("SELECT firstdoc, flags, block FROM positings | |||
where firstdoc = | |||
SELECT max(firstdoc) FROM postings | |||
WHERE word = termName | |||
AND firstdoc >= record.firstdoc AND flags >= 128 | |||
ii) Try to add as many position in this block | |||
iii) when the block is done, create a new row, with firstdoc == currentdoc | |||
and flags == 128 or 129 depending on whether the prev firstdoc was same as this. | |||
iv) goto ii) if there are more left. | |||
4) no? | |||
i) create a new row with firstdoc = docid, flags=2 | |||
ii) To the block add, doc id and doc freq. And all the pos. Note the position listings are never split when flag == 2. | |||
We must try to fit all the position listing in this block.99% of the case this should be possible. Make a small calculation, you'll find out i am correct | |||
iii) The rare case it is not possible? create two rows one with flags==0 and flags==128 | |||
} | |||
else { | |||
//This is slightly complex in that there is both document list and position list in the same block. We have to decode the block. Try to add document id and and all the position to the position list. This might not be possible. And we'll have to create two new rows, one with flags == 0 and other with flags == 128 | |||
} | |||
update the word count in the word list table appropriately | |||
} | |||
} | |||
commit to the database | |||
} | |||
RemoveDocument() { | |||
Inherently inefficient because of the structure that we have adopted. Needs to be optimized. The general algorithm revolves around finding the record whose firstDoc is immediately less or same as the docId we are searching for. | |||
e.g. Say we want to delete docid = 20. We got two records | |||
firstDoc = 18, block="blahblah" | |||
firstDoc = 22, block="blahblah" | |||
} | So we select the record with docid = 18 that is immediately before docid = 20; | ||
query to achieve this: SELECT word, firstdoc, block FROM postings where | |||
firstdoc = SELECT max(firstdoc) FROM postings where | |||
firstdoc < docIdWeAreSearchingFor | |||
This returns a number of records with flag == 0 or 0 < flag < 128 or flag == 128 or flag > 128. | |||
for each record we found do the following: | |||
docAndPostingTable = decodeBlock(block); | |||
we have just decoded the block using Block::decodeBlock(). | |||
docAndPostingTable.find(docId) //we check the decode block if it contains docId of our interest. | |||
if docId found | |||
Check the flag | |||
when flag == 0 //only document list | |||
remove the document and freq from the block | |||
update the word count for the term in word table | |||
update the delta coding for other doc in the block. | |||
if docId == firstDoc update firstDoc with the immediately following doc | |||
if no more docs left in this row, delete the row | |||
when 0 < flag < 128, contains both document list and posting table | |||
remove the document from the block. | |||
update the word count for the term in word table | |||
update the delta coding for other doc in the block. | |||
remove all the postings of the document for the term in the block | |||
if (docId == firstDoc) update firstDoc with immediately following doc | |||
if no more docs left in the row, delete the row | |||
when flag >= 128 //only postings table | |||
remove all the postings corresponding to the doc | |||
update the firstDoc for the record | |||
delete the record if block is empty | |||
} | |||
SearchDocument(terms) { | |||
terms is something like: "mac apple panther". Basically a collection of terms | |||
} | The ranking algorithm as described in this document is used. | ||
} | |||
nsNavFullTextIndexHelper | nsNavFullTextIndexHelper | ||
This class contains mechanism for scheduling documents for indexing. It works pretty much like nsNavHistoryExpire. The scheduling balances indexing with overall firefox performance. It has a timer which when expires picks an unindexed document from the history and calls nsNavFullTextIndex::addDocument to index. When and how the timer is set, dictates the overall efficiency. Whenever a user visits a page, the timer is triggered. | This class contains mechanism for scheduling documents for indexing. It works pretty much like nsNavHistoryExpire. The scheduling balances indexing with overall firefox performance. It has a timer which when expires picks an unindexed document from the history and calls nsNavFullTextIndex::addDocument to index. When and how the timer is set, dictates the overall efficiency. Whenever a user visits a page, the timer is triggered. | ||
When user visits a page, the history service calls OnAddURI function of this class. This function starts the timer. When timer expires the call back function is called. The function checks If there are any more document to be indexed and it has not been a long while since the user called addURI. If so, then it resets the timer and waits to expire again. | |||
nsNavFullTextAnalyzer | |||
The analyzer design is similar to the way Lucene works. The Lucene design enables support for multiple languages. | |||
The Analyzer takes the tokens generated from the tokenizer and discards those that are not required for indexing. E.g. is, a, an the can be discarded. This is enabled by by tokenStream classes. Refer to nsNavFullTextTokenStream and Lucene API for more detail. | |||
nsNavFullTextTokenStream | |||
TokenStream is an abstract class with three methods next, reset and close. The next method returns a token. Token is a class representing startoffset, endoffset and termtext. TokenStream needs input to tokenize. There are two concrete class for TokenStream: | |||
1) Tokenizer: The input is an inputstream. | |||
2) TokenFilter: The input is another TokenStream. This works like pipes. | |||
== Front-End == | == Front-End == | ||
1) nsNavQueryParser | |||
The function of this is to break the query given in human language into a graph of TermQuery and BooleanQuery. BooleanQuery is a struct with two operand(each a TermQuery or BooleanQuery) and an operator(AND, OR, NOT, XOR). Although the idea here is to implement all kind of queries as in: http://lucene.apache.org/java/docs/queryparsersyntax.html eventually. nsNavHistoryQuery is used to query using the query struct. The results of which is url_list which is displayed using view. | |||
== References == | == References == | ||
[1] An Efficient Indexing Technique for full-text database systems: Justin Jobel, Alistair Moffet, Ron Sacks Davis | [1] An Efficient Indexing Technique for full-text database systems: Justin Jobel, Alistair Moffet, Ron Sacks Davis | ||
[2]Using a relational database for full-text index, Steve Putz, Xerox PARC | [2]Using a relational database for full-text index, Steve Putz, Xerox PARC | ||
edits