Loci

From MozillaWiki
Jump to: navigation, search

Loci is a work in progress add-on to index the content of pages navigated through the browser.

This project's main goals are to:

  1. Create a generic worker queue for processing pages DOM through different tasks (e.g.: text, metadata, image...).
  2. Provide an API to search the browser history by it's content.

Steps

  1. Access webpage.
  2. Copy text version of the page DOM.
  3. Store it in the file system.
  4. Add tasks to the Worker Queue.
  5. Workers go through Worker Queue and process the data.
  6. Store the metadata obtained from worker.

Code repo

The code repository can be found here: https://github.com/mozilla/loci

The Worker Queue

The worker queue goal is to be a generic page data processing platform.

Schema

Schema for the worker queue

Class Diagram

Class diagram for the worker queue

Sequence Diagram

Sequence diagram for the worker queue

Copying the DOM

The DOM is copied when the user leaves the current page or the URL changes. By doing this, we can take into account dynamic changes to the DOM (like the ones made by JavaScript frameworks).

To do this, we inject a framescript in every open window. This script will detect when the page is closed and will send a message to a subscribed listener (the DOMFetcher is responsible for injecting the script and attaching the listener).

Task processing

Each different type of task will have a Task Processor. A Task Processor manages a queue of tasks and it's workers.

The Task Router is responsible for receiving the message from the framescript and doing the following:

  • Check if the page should be processed:
* Do we have a task for this page already?
* Is the page already stored and the cache is still valid?
* Does the page file exists?
  • Writing the DOM string to the file system.
  • Adding an entry at the Pages table for the specific url.
  • Creating the tasks and adding them to their Task Processors.
  • Adding the tasks to the Tasks Table.

Once per day, when the browser is idle, the Task Cleaner will be called to remove finished tasks from the Tasks table and remove the respective DOM files from the file system. This will only take place for pages that had all their task types done.

Full-text Indexing

Indexing activity diagram

Full-text indexing is one of the use cases for the worker queue.

The DOM content is simplified by using the Readability.js module. This operation is destructive, that's why we need a copy of the page DOM. After this, we need to remove all HTML tags an keep only the text on the page.

Some filtering can be done to stop words. Although, as discussed on chapter 2.2.2 of An Introduction to Information Retrieval: "[...] for most modern IR systems, the additional cost of including stop words is not that big – neither in terms of index size nor in terms of query processing time".

The storage used SQLite FTS4 virtual table extension modules, already embeded on the browser. Some normalization and Porter Stemming can be done by the ```porter``` tokenizer, included in the extension.

The matchinfo function provided by FTS4 extension gives us important statistics about the results of a query. With this information we can implement sorting algorithms, like BM25.

If we don't need to show snippets of the resulting text, we can save a lot of storage by using the content= option. This allows us to store only the indexes on this database, keeping only a docid reference (that could be the page guid in places, for example).

If we need a more complex indexing and search we can look into using Thunderbird Gloda.