Firefox/Projects/FTS and Awesomebar
This project will attempt to make Storage text searches faster, focusing on the awesomebar.
SQLite supports indexed full text search (FTS) through its fts extensions. Indexed FTS makes text searches fast: Rather than looking at every record in the database to see if it contains a search string, target records are found by comparing the search string against an index.
A good first target feature is the awesomebar. It is highly visible, and since it makes extensive use of text searches, it stands to benefit from this work.
(Note that full text search does not mean that we store the entire text of pages. We will use it to store page titles, URLs, tags, etc., as we already do now, only lookup will be fast.)
- Champion, lead: adw
STALLED, GOING DOWN, PEOPLE SCREAMING
After a few weeks of exploration, it's not clear as I had hoped it would be that FTS can provide faster awesomebar searches on the desktop. The awesomebar's default SQL query (one of several it performs) makes good use of indexes, and while I don't fully understand why, its execution seems to take constant time independent of data set size. Prefix searches with FTS are slow; to mitigate, I investigated partitioning the FTS table into many tables, each containing no more than 1000 rows. Individual table lookups were fast with this scheme, but the overhead of searching dozens of tables conspired to make the aggregate query slower than the awesomebar's default query. Perhaps there is a way to game this tradeoff between partitioning and overhead that more tinkering would reveal.
There may be other areas of Firefox where FTS can help -- search terms in the Places query API, for example.
On hold for now.
Week of 2010/03/15
- Progress in hooking up the awesomebar to FTS for the purpose of performance testing.
- Exact matches are quite fast, i.e., matching whole words. Finding the records in a places database with 34000 history entries that match a whole word is on the order of 10 ms. Noticably faster than the current awesomebar.
- However, prefix searches, i.e., searching for words that start with a given string, can be quite slow. For common prefixes, they can be even slower than a full table scan. (See also this mail.) Prefixing searching is of course the real-world case.
- If prefix searches are really slower than full table scans, then this project is done.
- So I'm currently trying ways of optimizing prefix searches. Partitioning the FTS tables and doing multiple queries is showing some promise.
Week of 2010/03/08
- Gecko has some facilities for i18n word boundary analysis under the intl/lwbrk/ directory.
- nsISemanticUnitScanner is the only interface there, and nsSemanticUnitScanner is its only implementation. nsSemanticUnitScanner is derived from nsSampleWordBreaker, which as its name implies is not robust. It supports ASCII but as far as I can tell has incomplete support for CJK and Thai and no support for other scripts. Manual testing shows that it does not tokenize Japanese satisfactorily.
- There's nsIWordBreaker, but its only implementation is nsSampleWordBreaker.
- There are some other files. There are several line breakers in the src directory. There's another word breaker, but it's for Thai text only. There's a smattering of files related to JIS encoding.
- Thunderbird 3 does FTS with a custom tokenizer in the mailnews/extensions/fts3/ directory.
- According to the readme, the tokenizer "supports CJK indexing using bi-gram. So you have to use bi-gram search string if you wanto to search CJK character." There is no mention of other scripts.
- I chatted with asuth about it, asked how well their tokenizer serves their users and whether he could recommend it to us.
- What's in their tree right now is mainly just improvements to CJK tokenization.
- There's a pending patch on bug 525537 that handles case and accent folding, but that's only an improvement to the Porter stemmer. Before we (Firefox) even think about stemming, we need to competently support i18n text in the first place. I also think that stemming, while definitely a good thing in general, is more useful for Thunderbird, where they are indexing the entire bodies of emails. We don't currently plan to index bodies of Web pages. I would expect stemming to be less useful for page titles and especially things like URLs, tags, and keywords.
- He too looked at our libintl library recently for the aforementioned bug and agreed that it's not a featureful or actively developed library to rely on.
- When I asked if they'd considered bringing in ICU or any other outside libraries, he mentioned that Mozilla had shied away from bringing in ICU before.
- He said their plan was to reconsider later if they should move to something more featureful, like Lucene. Their tokenizer improvements are due entirely to one (awesome) Japanese contributor.
- He noted that they don't see many non-English user complaints about their search. I'm not sure if that's a useful metric for us, and there are some complaints in the bug linked above.
- They don't expect to do a lot more work on their tokenizer anytime soon.
- So. The i18n word boundary analysis work done around Mozilla doesn't seem sufficient for us. We could make it sufficient by brute force. That would require expertise that Mozilla to my knowledge does currently not possess. Even if we did, or if we called out to the Mozilla community at large, my impression is that writing a comprehensive, well tested i18n word boundary analysis library is a large undertaking.
- If we called out to the open source community, however, I think we'd have better luck, specifically because it's already written these libraries! asuth mentioned Lucene, an Apache text retrieval library. SQLite supports the ICU library's tokenizer out of the box. ICU is large, established, and widely used.
- So, I investigated using ICU for tokenization.
- ICU's license is the ICU License, which is the MIT license with additional clauses about documentation and promotion. I'm pretty sure it's compatible with our tri-license, but I'm no lawyer...
- ICU is designed to be modular. Here's a picture showing API dependencies. We need "BreakIterator", which is inside "Common library", which depends on the "Data library." There are many compilation options that allow you to pick only the pieces you need. (1, 2, 3.) I haven't investigated them yet, but the important point is that we don't need all of ICU to support i18n tokenization.
- I was able to build our SQLite with ICU support. I'm building and linking against an ICU build outside of our tree, because I don't want to focus on the grunt work of building and linking inside the tree and build system right now. I assume doing so is not technically impossible...
- Contrary to the API dependency picture I linked to above, I had to link against three ICU libraries: common library, data library, and i18n library. Since I built using default options, I presume I built more than I needed, including the i18n lib, or perhaps I misunderstood the tokenizer dependencies. More investigation needed. It built as a collection of shared libraries: the common library is 1.3 MiB, the data library is 16 MiB, and the i18n library is 1.9 MiB. The size of our SQLite library is not changed, since it links dynamically against these ICU libraries. Again, this is all by default, since I haven't yet played with compile and link options. The docs say the the data lib can instead be built as a data file, not a lib, and Marco had the idea of the client's downloading the data in the background when needed instead of bundling it with the install.
- Next I would like to do some tests to determine its potential for improving awesomebar searches.
- Make awesomebar results come back from the database faster.
- Improving user-facing features in Firefox other than the awesomebar. There is definitely potential, but that's for follow-up work.
- Pulling in external i18n libraries for the sake of pulling in external i18n libraries.
- Improving our own i18n library for the sake of improving our own i18n library.
Dates in the future are only estimates.
- 2010/03 - Investigate i18n tokenizers
- 2010/03 - Test potential perf impact on awesomebar searches
- 2010/?? - Integrate awesomebar with SQLite's fts extension
- Testing to make sure awesomebar functionality and certainly perf is not regressed.
- We have weak support for tokenizers that work well on scripts other than English, CJK, and Thai. Assuming preliminary testing shows that FTS is worth the costs, I'll have to pull in an outside lib or get people to write a bunch of new code. (Discussion)
- Since this project is broadly defined -- improving FTS all the way to using FTS in the awesomebar -- none?
- Will require manual testing of the awesomebar.
- Maybe we can set up some automated harness to time awesomebar searches.