Places:Full Text Indexing: Difference between revisions

m
mNo edit summary
Line 63: Line 63:
===nsNavFullTextIndex===
===nsNavFullTextIndex===


This class interacts with SQLite database. This implements the algorithm for adding document to the index, removing document from the index and search for a given term. This search function is used by nsNavHistoryQuery::search function. Look at [2] for the algorithm used.
This class interacts with FTS2. It is responsible for executing queries that will generate  URI of matching documents. It will also generate text snippets to be displayed for UI to display
 
Block is a struct used to encode and decode block. A block is variable length delta encoded. Variable Length delta Encoding, compresses very efficiently balancing speed and storage requirement.
<pre>struct Block {
//The width of the field is 255 bytes. The return value is an int indicating number of elements in the data that were encoded in the out byte array.
int encode(in int[] data, out byte[] encodedBlock) {
//How to encode more efficiently, any idea?
int[] bigEndian;
int k = 0;
data[i - 1] = 0;
for(int i = 0; i < data.length; i++) {
data[i] -= data[i - 1];
int j = 0;
while(data[i] != 0) {
bigEndian[j++] = data[i] % 128;
data[i] /= 128;
}
for( ; j > 0; j--, k++) {
encodedBlock[k] = (1 << 8) & bigEndian[j];
}
encodedBlock[k++] = bigEndian[0];
if (k > 255)
return i - 1
}
}
void decode(in byte[] encodedBlock, out int[] data) {
int j = 0;
for(int i = 0; i < encodedBlock.length; i++) {
if (encodedBlock[i] && (1 << 8)) {
data[j] *= 128;
data[j] += encodedBlock[i] & ((1 << 8) - 1);
}
else {
data[j] += data[j - 1]; //Because it was delta encoded
data[j++] = encodedBlock[i];
}
}
}
}
 
 
AddDocument(connection, document, analyzer) {
AddDocument in two pass. Scan the document for terms and then invert. We have the doc id. We will require a hash map library to make it very efficient. Term is a hash map. Usage: term['termname'] = array of pos.
<pre> while(analyzer.hasTerms()) {
cTerm = analyzer.nextTerm();
term[cTerm.Name].add(cTerm.pos);
}
iterator i = term.iterator();
while(i.hasNext()) {
termName = i.next();
termPos[] = term[termName];
//hopefully sqlite caches query results, it will be inefficient otherwise.
 
if (term already in table) {
record = executeQuery("SELECT firstdoc, flags, block FROM postings
WHERE word = termName
AND firstdoc == (Given as >= in [2], == is correct in my opinion)
  (SELECT max (firstdoc) FROM postings
    WHERE word = termName and flags < 128)");
//Refer to [2] for explanation of this.
//only one record is retrieved with flags == 0 or flags between 0 and 128
//when flag == 0, the block contains only document list
if (flag == 0) {
1) Decode the block
2) See if one more document can be fitted in this.
3) Yes? add to it
i) find the position list
positionList = executeQuery("SELECT firstdoc, flags, block FROM positings
where firstdoc =
SELECT max(firstdoc) FROM postings
WHERE word = termName
AND firstdoc >= record.firstdoc AND flags >= 128
ii) Try to add as many position in this block
iii) when the block is done, create a new row, with firstdoc == currentdoc
and flags == 128 or 129 depending on whether the prev firstdoc was same as this.
 
iv) goto ii) if there are more left.
4) no?
i) create a new row with firstdoc = docid, flags=2
ii) To the block  add, doc id and doc freq. And all the pos. Note the position listings are never split when flag == 2.
We must try to fit all the position listing in this block.99% of the case this should be possible. Make a small calculation, you'll find out i am correct
iii) The rare case it is not possible? create two rows one with flags==0 and flags==128
}
else {
//This is slightly complex in that there is both document list and position list in the same block. We have to decode the block. Try to add document id and and all the position to the position list. This might not be possible. And we'll have to create two new rows, one with flags == 0 and other with flags == 128
}
update the word count in the word list table appropriately
}
}
commit to the database
}
 
RemoveDocument() {
Inherently inefficient because of the structure that we have adopted. Needs to be optimized. The general algorithm revolves around finding the record whose firstDoc is immediately less or same as the docId we are searching for.
 
<pre> e.g. Say we want to delete docid = 20. We got two records
firstDoc = 18, block="blahblah"
firstDoc = 22, block="blahblah"
So we select the record with docid = 18 that is immediately before docid = 20;
query to achieve this: SELECT word, firstdoc, block FROM postings where
firstdoc = SELECT max(firstdoc) FROM postings where
firstdoc < docIdWeAreSearchingFor
This returns a number of records with flag == 0 or 0 < flag < 128 or flag == 128 or flag > 128.
for each record we found do the following:
docAndPostingTable = decodeBlock(block);
we have just decoded the block using Block::decodeBlock().
docAndPostingTable.find(docId) //we check the decode block if it contains docId of our interest.
if docId found
Check the flag
when flag == 0 //only document list
remove the document and freq from the block
update the word count for the term in word table
update the delta coding for other doc in the block.
if docId == firstDoc update firstDoc with the immediately following doc
if no more docs left in this row, delete the row
when 0 < flag < 128, contains both document list and posting table
remove the document from the block.
update the word count for the term in word table
update the delta coding for other doc in the block.
remove all the postings of the document for the term in the block
if (docId == firstDoc) update firstDoc with immediately following doc
if no more docs left in the row, delete the row
when flag >= 128 //only postings table
remove all the postings corresponding to the doc
update the firstDoc for the record
delete the record if block is empty
}
 
SearchDocument(terms) {
terms is something like: "mac apple panther". Basically a collection of terms
The ranking algorithm as described in this document http://lucene.apache.org/java/2_1_0/api/org/apache/lucene/search/Similarity.html is used.
}</pre>


===nsNavFullTextIndexHelper===
===nsNavFullTextIndexHelper===
24

edits