User:Beckley/Indexed Search Proposal

From MozillaWiki
Jump to: navigation, search
Draft-template-image.png THIS PAGE IS A WORKING DRAFT Pencil-emoji U270F-gray.png
The page may be difficult to navigate, and some information on its subject might be incomplete and/or evolving rapidly.
If you have any questions or ideas, please add them as a new topic on the discussion page.

This is a proposal for integrating/adding indexed search functionality in to the Thunderbird email client.

Rationale

The current Search Messages feature in Thunderbird is very slow for users that have a normal amount of accumulated messages. Each time a user initiates a search, the individual mailbox files are opened up and searched for the matching text. This can wind up taking tens of seconds, even minutes to complete. Users have become accustomed to Internet search engines which provide near instantaneous results, and if the entire Web can be searched that quickly then we can do the same for a user's mail store.

Providing instant results will require the use of a indexing engine. Most recent operating systems come with an indexing engine, or make it available as a free download. However, there are reasons why it would be beneficial to include an indexing engine as part of Thunderbird. We will look at both solutions here.

Using the OS Indexing Engine

Recent OSes have indexing engines included or available for download. Windows Vista comes with Windows Search, and it is available for XP as well. Mac OSX has had Spotlight for the last couple of versions now. For Linux there is Beagle, Tracker, Strigi, Recoll, and a number of others. There's even Google Desktop Search, which has versions for all 3 OSes.

OS-based desktop search provides its own user interface for searching, and is able to search email, as well as user documents, web browsing history, and other files stored on the user's computer. To get a good user experience between these services and Thunderbird some integration work is required. That integration has already been written for Spotlight, and is nearly complete for Windows Search. However, the user interface in OS-based desktop search is not optimized for searching email.

The OS-based desktop search components all have an API for programatically searching its indexed store. So one approach is to take advantage of that database when performing search inside of Thunderbird. The win for this route is that the index already exists, and doesn't have to be duplicated (the size of an index is generally around 25% of what the data that it indexes). The disadvantages are numerous, though:

  • Have to filter out other data in the index that isn't email
  • Each indexing engine has different capabilities, which leads to least-common-denominator solutions or differences between platforms
  • Not all OSes come with the indexing engine (Windows XP, Linux), and so requires user download and configuration
  • Glue code needs to be written/maintained to keep a single interface in the front-end (there are existing APIs that do this, Xesam is an example, but they don't support all of the indexing engines we would want)
  • The indexing engine can get disabled, have settings changed, or get upgraded to an incompatible state, making it unusable

Even though OS-based indexing engines are not well suited for search inside Thunderbird, they still are useful for users who want to perform search outside of Thunderbird and have their email show up in the results.

Incorporating an Indexing Engine

As mentioned in the section above, a better route to proceed is to include a indexing engine inside of Thunderbird. That way we can control it, and ensure that it is present, enabled, and compatible. There are a number of FOSS indexing engines available for use, but a few in particular stand out: C-Lucene, SQLite Full Text Search, Ferret and Sphinx. They're discussed below, but ideally the design for integrating indexed full-text search would be flexible enough to allow different engines to be plugged in.

Architectural Approach

Regardless of what indexing engine is used, we have to figure out how those engines are fed data, queried, etc.

Lucene

Lucene is an indexing engine that comes from the Apache project. It was originally written in Java, but has been ported to many other languages, including (for our purposes) C/C++. Lucene is considered the de-facto text indexing engine, and is used in thousands of projects. The capabilities of Lucene include multiple fields, boolean operators, wild cards, fuzzy matches, proximity searching, multiple word phrases, and more.

Java Lucene (the original) is not being considered because of the complexities of bundling Java software with Thunderbird.

C-Lucene is a possibility. Flock has used C-Lucene, and has XPCOM bindings for it (currently under GPL). C-Lucene is under the Apache license.

TBD: What was the Flock experience w/ C-Lucene, and how much does it apply to us?

TBD: What's involved in building stemmers for more languages for C-Lucene?

SQLite Full Text Search

Sqlite-FTS is the full-text indexer that is part of Sqlite.

Advantages:

  • part of Sqlite, which we already bundle & ship.

Disadvantages:

  • STL currently works for text in the Sqlite database. Gloda currently doesn't store any of the message content, so either FTS needs to be taught how to index "foreign" text, or we need to bring in more text into a sqlite database.
  • current stemming is done through the IBM ICU library, which, reportedly, is very large.


Ferret

Ferret started off as a Ruby port of Lucene, but has evolved to be a thin Ruby wrapper around a reportedly clean and well architected C library, which could be wrapped with XPCOM.

Advantages:

  • MIT license IIRC
  • small
  • stemmers for many languages & multiple encodings already there

Disadvantages:

Sphinx

Sphinx is a full text indexing engine targeted at mySQL. It gets some positive reviews.

Advantages:

  • At least two languages supported (so i18n is possible, if not already done)

Disadvantages:

  • Maybe mySQL specific code?
  • Currently GPL