User:Adw/FTS
From MozillaWiki
< User:Adw
- mar 8 2010 - fts in sqlite - module is called fts3 - http://www.sqlite.org/cvstrac/wiki?p=FtsUsage - thread about l10n and fts tokenizer: http://old.nabble.com/FTS3-Unicode-support-td15078601.html - fts3 tokenizers readme: http://www.sqlite.org/cvstrac/fileview?f=sqlite/ext/fts3/README.tokenizers - see my sep 10 2009 note also - i don't see how to do fts in the face of multi-lingual data for us. a user can have pages of any language in his history. would need a way of tokenizing a string whose language is unknown or at best hinted at by the encoding of the web page it came from. we don't even have a way to tokenize for a known lang other than english. - could do it for urls, though. tokenize on punctuation. - what would google do? looked at chromium source. relevant parts: - HistoryContentsProvider is the class that provides data to their autocomplete. it calls HistoryBackend::QueryHistory, which calls HistoryBackend::QueryHistoryFTS if the query contains text. (branches to HistoryBackend::QueryHistoryBasic otherwise.) - HistoryBackend::QueryHistoryFTS in trunk/src/chrome/browser/history/history_backend.h: http://www.google.com/codesearch/p?hl=en#h0RrPvyPu-c/chrome/browser/history/history_backend.h&q=package:%22src.chromium.org/svn/trunk%22%20QueryHistoryFTS&sa=N&cd=2&ct=rc - comment above QueryHistoryFTS: "The FTS version queries the text_database, then merges with the history DB." - impl is here: http://www.google.com/codesearch/p?hl=en#h0RrPvyPu-c/chrome/browser/history/history_backend.cc&q=package:%22src.chromium.org/svn/trunk%22%20QueryHistoryFTS&sa=N&cd=1&ct=rc&l=1100 - it queries the text database mentioned in the comment. what is it, how's it made? the exact call is |text_database_->GetTextMatches()|. - text_database_ is a TextDatabaseManager ptr. they have sources named text_database.{h,cc} and text_database_manager.{h,cc}. comment above text_database_ says "Full text database manager". - comment at the top of text_database_manager.h: // Manages a set of text databases representing different time // periods. This will page them in and out as necessary, and will // manage queries for times spanning multiple databases. // // It will also keep a list of partial changes, such as page adds and // title and body sets, all of which come in at different times for a // given page. When all data is received or enough time has elapsed // since adding, the indexed data will be comitted. // // This allows us to minimize inserts and modifications, which are // slow for the full text database, since each page's information is // added exactly once. // // Note: be careful to delete the relevant entries from this // uncommitted list when clearing history or this information may get // added to the database soon after the clear. - TextDatabaseManager::GetTextMatches: http://www.google.com/codesearch/p?hl=en#h0RrPvyPu-c/chrome/browser/history/text_database_manager.cc&l=401 - so they partition these text dbs by date. comment from TextDatabaseManager::GetTextMatches: // Iterate over the databases from the most recent backwards. - TextDatabaseManager::GetTextMatches calls each db's TextDatabase::GetTextMatches, defined here: http://www.google.com/codesearch/p?hl=en#h0RrPvyPu-c/chrome/browser/history/text_database.cc&d=3&l=293 - aha, finally we get to sqlite: "SELECT url, title, time, offsets(pages), body " "FROM pages LEFT OUTER JOIN info ON pages.rowid = info.rowid " "WHERE pages MATCH ? AND time >= ? AND time < ? " "ORDER BY time DESC " "LIMIT ?")); - TextDatabase::AddPageData i'm guessing does the insertions: // Add to the pages table. "INSERT INTO pages (url, title, body) VALUES (?,?,?)" // Add to the info table with the same rowid. "INSERT INTO info (rowid, time) VALUES (?,?)" http://www.google.com/codesearch/p?hl=en#h0RrPvyPu-c/chrome/browser/history/text_database.cc&d=3&l=196 - question is, how are they handling non-ascii data? - aha! TextDatabase::CreateTables // FTS table of page contents. if (!db_.DoesTableExist("pages")) { if (!db_.Execute("CREATE VIRTUAL TABLE pages USING fts2(" "TOKENIZE icu," "url LONGVARCHAR," "title LONGVARCHAR," "body LONGVARCHAR)")) return false; } // Non-FTS table containing URLs and times so we can efficiently find // them using a regular index (all FTS columns are special and are // treated as full-text-search, which is not what we want when // retrieving this data). if (!db_.DoesTableExist("info")) { // Note that there is no point in creating an index over time. Since // we must always query the entire FTS table (it can not efficiently do // subsets), we will always end up doing that first, and joining the // info table off of that. if (!db_.Execute("CREATE TABLE info(time INTEGER NOT NULL)")) return false; } - "icu" stands for "International Components for Unicode" - icu is an extension to sqlite: http://www.sqlite.org/cvstrac/fileview?f=sqlite/ext/fts3/README.tokenizers - important: icu is a library, but it's not a library provided by sqlite. sqlite's icu facilities are merely glue to integrate the icu lib with sqlite. see comment here: http://www.hwaci.com/cgi-bin/sqlite/artifact/850e9a3656 shawn's comment in the thread linked above suggests that we aren't likely to include icu in firefox... huh, chromium must include it then. looks like it: http://src.chromium.org/svn/trunk/src/base/third_party/icu, although the readme says it's just a subset. oh, here? http://src.chromium.org/svn/trunk/deps/third_party/icu42 http://src.chromium.org/svn/trunk/deps/third_party/icu42/README.google - sqlite's icu-based tokenizer is here: http://www.hwaci.com/cgi-bin/sqlite/artifact/ac494aed69 can port it to gecko, ugh? does icu provide things that gecko doesn't? these questions might be answered in that bug 472764. (a quick search for "port" showed that they talked about porting other impls at least, but nothing concrete came of it afaict.) - looks like icu has facilities for getting and iterating over code points, which i think means characters? word iterator also, which can detect word boundaries: http://icu-project.org/apiref/icu4c/ubrk_8h.html - i don't think anything like that exists in gecko. surely the smart people in bug 472764 would have known. - nsIEntityConverter.idl? - what is this? http://mxr.mozilla.org/mozilla-central/source/intl/build/nsI18nModule.cpp#64 64 { "Word Breaker", NS_WBRK_CID, 65 NS_WBRK_CONTRACTID, nsSampleWordBreakerConstructor}, - nsISemanticUnitScanner - this defines a class called nsIWordBreaker, but there's no idl: http://mxr.mozilla.org/mozilla-central/source/intl/lwbrk/public/nsIWordBreaker.h an instance of it is exposed as a static method on nsContentUtils: http://mxr.mozilla.org/mozilla-central/source/content/base/public/nsContentUtils.h#587 i wonder if there's no idl because it's not implemented. there's intl/lwbrk/src/nsSampleWordBreaker.cpp which looks half-assed. - i wonder if the icu tokenizer simply tokenizes on code points (assuming that translates into characters), because when i type the "go" part of "nihongo" into chrome's address bar in kanji, my visit to wikipedia's nihongo page shows up. (same for "ji" in ja.wikipedia's shokuji page.) but then that would mean that english words get tokenized into chars also, and they don't... chrome also indexes the text of pages, so maybe it's pulling out those chars from there. - oh: 45 class nsSemanticUnitScanner : public nsISemanticUnitScanner 46 , public nsSampleWordBreaker - testing chromium's tokenization: http://doraku.asahi.com/lifestyle/pet/gallery_091201.html?ref=kanren page title has this run of katakana: スーパードッグカーニバル (super dog carnival) スーパー is matched. ドッグ is matched, but the カ is also hilighted. カーニバル is not matched. - mar 9 2010 - fts - conclusion: gecko includes a less-than-half-assed i18n tokenizer, nsISemanticUnitScanner. it supports ascii and half-assedly supports cjk. (e.g., it doesn't differentiate between kanji in chinese langs and kanji in japanese langs; there's even a note in the code (nsSampleWordBreaker.cpp) about breaking words per kanji in chinese, but using bigger units in japanese, but the japanese case is not implemented, so all individual kanji are treated as words.) no other langs. icu is a "big" library because it's competent, unlike gecko's i18n support, which is shoddy and scattershot. either someone has to do the immense work of re-implementing icu's facilities in gecko, or we use icu. both tasks are too big for me to lift. the first is stupid, since the open source community has already done that work. the second is politics, but it's the better option. we will never have fts unless someone who knows how to work the levers better than i do drives the effort. - should look at: - porting or including icu's tokenization bits, but i suspect they have deep dependencies on large parts of the rest of the lib. - marco looked at their web site, points out that icu is meant to be configurable, modularizable. maybe we can include the word break code afterall. - http://source.icu-project.org/repos/icu/icu/trunk/readme.html - http://source.icu-project.org/repos/icu/icu/trunk/readme.html#HowToPackage - http://userguide.icu-project.org/packaging - http://userguide.icu-project.org/design#TOC-API-Dependencies - http://icu-project.org/repos/icu/icuhtml/trunk/design/modularization/icu_modularization_overview.html - http://userguide.icu-project.org/boundaryanalysis - note: icu is sometimes called icu4c to distinguish the c/c++ version of the lib. (there's also icu4j, a java version.) - chromium's changes to icu, including word-breaking changes: http://src.chromium.org/svn/trunk/deps/third_party/icu42/README.google - chromium's dir structure re: icu: icu has only 3 top-level dirs, as_is, packaging, and source. they removed the first two since they don't need them. they added all dirs except source. public contains some dirs they moved out of source. see readme.google. so they have: http://src.chromium.org/svn/trunk/deps/third_party/icu42 which is the root of the tarball that icu provides. - including other tokenizers. marco mentioned boost has some. - is thunderbird's fts search really ascii and cjk only? what happens for other lang users, like arabic? - mar 10 2010 - continuing fts - looking at other non-icu tokenizers like marco suggested yesterday - boost has tokenizers but they're simple and i18n-unaware. http://www.boost.org/doc/libs?view=categorized - looking at state-of-the-art fts in thunderbird - apparently still using bi-grams. i don't know how well it works in practice for non-cjk text. should ask andrew sutherland. - http://mxr.mozilla.org/comm-central/source/mailnews/extensions/fts3/src/fts3_porter.c - http://mxr.mozilla.org/comm-central/source/mailnews/extensions/fts3/src/README.mozilla - http://mxr.mozilla.org/comm-central/source/mailnews/extensions/fts3/public/nsIFts3Tokenizer.idl - chat with andrew sutherland (asuth) in #maildev, see chatlog today. upshot: they're not planning more work on their tokenizer. my thoughts: all the progress so far has been due to makoto. as it exists now, it's not good enough for firefox. - trying to hack together icu, sqlite, firefox just to get it running: - downloaded, untared the icu source to intl/icu. - copied the source to a directory on my desktop and built there. followed instructions in the tarball's readme but did not do the install step. - copied source/lib/libicudata.42.1.dylib and libicuuc.42.1.dylib from desktop build into obj-debug/dist/MinefieldDebug.app/Contents/MacOS/ components/. i have no idea what i'm doing. - copied source/common/unicode/platform.h from desktop build to source that i dropped into the mozilla tree. that file is generated when building from platform.h.in. - modified db/sqlite3/src/Makefile.in by adding at the very end of the file: (because i think that the include $(topsrcdir)/config/rules.mk is what contains whatever cflags set by the build system): CFLAGS +=-I$(topsrcdir)/intl/icu/source/common -I$(topsrcdir)/intl/icu/source/i18n/ probably could/should (for the time being) have just pointed these at my desktop build. - also modified that makefile to include -DSQLITE_ENABLE_ICU=1. - that made the compile work but now i get linker errors, undefined symbols in sqlite3.o. hmm. i have an icu dylib from building it. how do i get sqlite3.o to not need to statically link against a static icu lib? other option is to do just that: make an icu static lib, link sqlite3.o against it. - sqlite's icu extension readme says easiest way is to build the extension as a dynamically loadable sqlite extension. i guess that is exactly what i am not doing by using the SQLITE_ENABLE_ICU amalgamation switch. sdwilsh says moz storage doesn't expose the sqlite dynamic extension loading api though, but there's a bug for it. - perhaps just to get started i can make an icu static lib, link against sqlite3.o. - libmozsqlite3.dylib did not change in size after my changes. - sqlite is borked on start up, get this error: nsNativeModuleLoader::LoadModule("/Users/adw/trees/mozilla-central/obj-debug/dist/MinefieldDebug.app/Contents/MacOS/components/libstoragecomps.dylib") - load FAILED, rv: 80004005, error: Unknown error: -2804 - noticed that the last step in make -C db/sqlite3/src is: /Users/adw/trees/mozilla-central/obj-debug/config/nsinstall -L /Users/adw/trees/mozilla-central/obj-debug/db/sqlite3/src -m 755 libmozsqlite3.dylib ../../../dist/lib /Users/adw/trees/mozilla-central/obj-debug/config/nsinstall -L /Users/adw/trees/mozilla-central/obj-debug/db/sqlite3/src -m 755 libmozsqlite3.dylib ../../../dist/bin i.e., installing dylibs into obj-debug/dist/lib and dist/bin, so i copied the three dylibs i linked sqlite3.o against to the same dirs: adw@host-3-121:~/trees/mozilla-central/obj-debug$ cp ~/Desktop/icu4c-4_2_1/source/lib/libicudata.dylib dist/lib/ adw@host-3-121:~/trees/mozilla-central/obj-debug$ cp ~/Desktop/icu4c-4_2_1/source/lib/libicui18n.dylib dist/lib/ adw@host-3-121:~/trees/mozilla-central/obj-debug$ cp ~/Desktop/icu4c-4_2_1/source/lib/libicuuc.dylib dist/lib/ adw@host-3-121:~/trees/mozilla-central/obj-debug$ cp ~/Desktop/icu4c-4_2_1/source/lib/libicudata.dylib dist/bin/ adw@host-3-121:~/trees/mozilla-central/obj-debug$ cp ~/Desktop/icu4c-4_2_1/source/lib/libicui18n.dylib dist/bin/ adw@host-3-121:~/trees/mozilla-central/obj-debug$ cp ~/Desktop/icu4c-4_2_1/source/lib/libicuuc.dylib dist/bin/ - didn't fix error. how is libstoragecomps.dylib involved? - noticed dist/MinefieldDebug.app/Contents/MacOS also contains the dylibs, so copied them there, but same error. - noticed dist/MinefieldDebug.app/Contents/MacOS/components also contains the dylibs, so copied them there, but that didn't work either. - i wonder if it's some symbol location issue, like libstoragecomps.dylib was built against libsqlite3.dylib, and now i've changed libsqlite3.dylib. but i thought these dylibs weren't linked against each other? - make -C storage, storage/build, etc. didn't fix. - ah, i did make -C browser/app, removed compatibility.ini et al. in my profdir, and ran the build with DYLD_LIBRARY_PATH=/Users/adw/Desktop/icu4c-4_2_1/source/lib/. i'm not sure if the first two steps were necessary, but it worked after setting DYLD_LIBRARY_PATH. so the problem was libstoragecomps.dylib (i think? or firefox?) couldn't find those three dylibs. - note that libicui18n.dylib was necessary in the sqlite3.o linking stage, because some collation functions were being called. that's even though that chart on the design web page said that the common lib, the one containing the word breaking code, depended on only the data lib. maybe one of the manu config/compile options will make the collation stuff unnecessary. - followup (mar 11): the error console was broken in my build, so i pulled the source, blew away obj-debug, and rebuilt. icu still works. no copying of dylibs necessary when you set DYLD_LIBRARY_PATH to point to dylibs, wherever on disk they may be. also did not copy icu source into tree; i just point -I at my icu build on the desktop. - mar 11 2010 - continuing fts - mfinkle says fennec uses our autocomplete search provider "in a wrapped way". hopefully fts will help them a lot, and i can have claimed to improved fennec performance. - toolkit/components/places/src/nsPlacesAutoComplete.js constructs the queries (both text of the queries and the stmts themselves). - sqlite custom function autocomplete_match is where the text matching actually occurs. toolkit/components/places/src/SQLFunctions.h toolkit/components/places/src/SQLFunctions.cpp#52 - that function is registered with the db connection here: http://mxr.mozilla.org/mozilla-central/source/toolkit/components/places/src/nsNavHistory.cpp#1038 - MatchAutoCompleteFunction::OnFunctionCall is the entry point: http://mxr.mozilla.org/mozilla-central/source/toolkit/components/places/src/SQLFunctions.cpp#207 - so we end up calling CaseInsensitiveFindInReadable (or a related platform string-searching function) for every row in the results against title, tags, etc. => autocomplete_match needs to be replaced? - sqlite fts3 readme: http://www.sqlite.org/cvstrac/rlog?f=sqlite/ext/fts3/README.syntax wiki page: http://www.sqlite.org/cvstrac/wiki?p=FtsUsage - what would google do? - it looks like chromium does the fts tables lookup first, gets all the results in memory, and then does a second lookup into the history db (HistoryBackend::QueryHistoryFTS). - so they aren't worried about fitting all fts results in mem? they restrict the search to "present_databases_" and inside that by date (TextDatabaseManager::GetTextMatches). does "present" mean present in mem? - TextDatabaseManager::GetTextMatches yields a vector of TextDatabase::Match'es, which is just a simple struct containing title, url, time, "match positions", and snippet - then HistoryBackend::QueryHistoryFTS for-loops over those matches, looking up each in a db (i assume history db) by the url in the match. - this gives me more confidence that i don't need to bend over backwards to try and do everything in sql. if it's good enough for them. but, brett wilson's name is in the chromium comments, so. - hmm, does our current impl (autocomplete_match?) also compute positions within the string of the match? which is how the right substrings in the awesomebar dropdown are highlighted? i'd need to replicate that... can fts do it for me? - is every column in an fts table fts indexed? the FtsUsage wiki page isn't clear, so i assume so. chromium doesn't have to worry about it because they (i'm guessing) key off of urls; their key is also text to be searched. if i'm wrong, then they're looking up records by non-indexed column, which is very un-google-like. [looking at HistoryBackend::URLQuerier::GetRowForURL, they have a "urls" table, and they simply SELECT some columns WHERE url=:url. using that table like an index i guess.] - if every col is indexed... probably not ok to stuff foreign keys into fts table. chromium's text databases keep an auxilary "info" table, and they're using rowid to tie that table to the "pages" fts table. yeah, comment about the design here: http://www.google.com/codesearch/p?hl=en#h0RrPvyPu-c/chrome/browser/history/text_database.cc&q=package:%22src.chromium.org/svn/trunk%22%20URLDatabase&d=3&l=19 quoting: // We do joins across these two tables by using their internal rowids, // which we keep in sync between the two tables. The internal rowid is // the only part of an FTS table that is indexed like a normal table, // and the index over it is free since sqlite always indexes the // internal rowid. (i bet brett wilson wrote that.) - so i could do something similar, instead of times the auxilary table would keep track of foreign keys into moz_places and moz_bookmarks. kind of a pain in the ass though, and seems like whenever you use fts tables you'd need to do this. - looks like chromium has a "main" history db and an "archived" history db, judging by HistoryBackend::URLQuerier::GetRowForURL. - can use fst to improve find-in-page? probably not, since you'd have to crawl the page to index it first. but then subsequent finds could be faster... something to think about. - mar 12 2010 - from the icu faq (http://userguide.icu-project.org/icufaq): What is the performance difference between UTF-8 and UTF-16? Most of the time, the memory throughput of the hard drive and RAM is the main performance constraint. UTF-8 is 50% smaller than UTF-16 for US-ASCII, but UTF-8 is 50% larger than UTF-16 for East and South Asian scripts. There is no memory difference for Latin extensions, Greek, Cyrillic, Hebrew, and Arabic. - shrinking icu - good page on the data lib: http://userguide.icu-project.org/icudata - section Customizing ICU's Data Library - quoting: Reducing the size of ICU's data by eliminating unneeded resources can make sense on small systems with limited or no disk, but for desktop or server systems there is no real advantage to trimming. ICU's data is memory mapped into an application's address space, and only those portions of the data actually being used are ever paged in, so there are no significant RAM savings. As for disk space, with the large size of today's hard drives, saving a few MB is not worth the bother. - fts - rowid info - general: http://www.sqlite.org/lang_createtable.html#rowid - re: inserts: http://www.sqlite.org/autoinc.html - don't appear to be any weirdnesses about giving rowids while inserting. - quoting http://www.sqlite.org/fts3.html The rowid of an FTS3 table behaves in the same way as the rowid column of an ordinary SQLite table, except that the values stored in the rowid column of an FTS3 table remain unchanged if the database is rebuilt using the VACUUM command. For FTS3 tables, "docid" is allowed as an alias along with the usual "rowid", "oid" and "_oid_" identifiers. Attempting to insert or update a row with a docid value that already exists in the table is an error, just as it would be with an ordinary SQLite table. - wait, what the fuck is this? http://www.sqlite.org/fts3.html a giant treatise on fts3. why hasn't this come up before? hmm, when was it written? it uses a strange syntax for specifying tokenizers. - fst index optimization: Executing an SQL statement of the form "INSERT INTO <fts3-table>(<fts3-table>) VALUES('optimize')" causes FTS3 to merge all existing index b-trees into a single large b-tree containing the entire index. This can be an expensive operation, but may speed up future queries. - re: perf of fst, quote from http://www.sqlite.org/fts3.html: Then either of the two queries below may be executed to find the number of documents in the database that contain the word "linux" (351). Using one desktop PC hardware configuration, the query on the FTS3 table returns in approximately 0.03 seconds, versus 22.5 for querying the ordinary table. - hmm, you can match against cols, so could i stuff non-fts'ed stuff in an fts table (like foreign keys) and just not use that col in queries? probably it'll be indexed though. - mar 16 2010 - fts - autocomplete_match - aVisitCount only used to short-circuit and negate match if it's 0 and aSearchBehavior & HISTORY. - similar with aTyped, if aSearchBehavior & TYPED. - similar with aBookmark, if aSearchBehavior & BOOKMARK. - similar with aTags, if aSearchBehavior & TAG. - IMPORTANT!! read Appendix A of http://www.sqlite.org/fts3.html before commiting to any impl! quoting: CREATE VIRTUAL TABLE documents USING fts3(title, content); SELECT title FROM documents WHERE documents MATCH <query> ORDER BY rank(matchinfo(document)) DESC OFFSET 0 LIMIT 10 The SQL query in the example above uses less CPU than the first example in this section, but still has a non-obvious performance problem. SQLite satisfies this query by retreiving the value of the "title" column and matchinfo data from the FTS3 module for every row matched by the users query before it sorts and limits the results. Because of the way SQLite's virtual table interface works, retrieving the value of the "title" column requires loading the entire row from disk (including the "content" field, which may be quite large). This means that if the users query matches several thousand documents, many megabytes of "title" and "content" data may be loaded from disk into memory even though they will never be used for any purpose. - mar 17 2010 - fts - modifying places autocomplete code - nsPlacesAutoComplete.prototype.startSearch is entry point. - nsPlacesAutoComplete.prototype._getSearch just discards the search terms that are equal to those dumb "*", "@", "#", etc. tokens the user can type to restrict the search to tags, title, etc. returns { tokens, query }, tokens = search terms that weren't discarded, query = this._getBoundSearchQuery(this._matchBehavior, aTokens). - _getBoundSearchQuery's query is: // We use more optimized queries for restricted searches, so we will // always return the most restrictive one to the least restrictive one // if more than one token is found. let query = this._hasBehavior("tag") ? this._tagsQuery : this._hasBehavior("bookmark") ? this._bookmarkQuery : this._hasBehavior("typed") ? this._typedQuery : this._hasBehavior("history") ? this._historyQuery : this._defaultQuery; so in the common case -- no dumb restricting tokens -- the query returned from _getBoundSearchQuery and _getSearch is _defaultQuery. - _defaultQuery is simply SQL_BASE with the additional conditions clause replaced with the empty string. - so i should add the fts lookup step in startSearch after _getSearch returns the tokens...? - AUTOCOMPLETE_MATCH is also used by _adaptiveQuery - need to change/override _getBoundAdaptiveQuery(), _defaultQuery. _getBoundKeywordQuery(tokens) doesn't use autocomplete_match. - nsPlacesAutoComplete.prototype.handleCompletion searches again if there are no results, so it's not just startSearch that starts! - what would google do? they synchronously execute the fts MATCH stmt: http://www.google.com/codesearch/p?hl=en#h0RrPvyPu-c/chrome/browser/history/text_database.cc&q=package:%22src.chromium.org/svn/trunk%22%20URLDatabase&d=3&l=323 this isn't run off the main thread is it? hmm, if they take pains to keep the text index in mem, and it looks like they do, that would work ok. - should the fts tables live in the same db as the tables they index? or separate db? what are the pros and cons? try it both ways, measure? - if we ever shard the places db, should use separate db. - text db should be in mem at all times. how would keeping all tables in same db affect that? - i'm getting a crash when i copied by own places.sqlite to my test profile. crash comes when either the fts tables are built or when the match is done, not sure which, since the latter happens right after the former the first time the awesomebar is typed in. text console output: Assertion failed: (iStart<=iEnd), function icuNext, file /Users/adw/trees/mozilla-central/db/sqlite3/src/sqlite3.c, line 110806. Abort trap googled, found this http://www.sqlite.org/cvstrac/tktview?tn=3963,39 , a terse sqlite bug ticket that mentions a problem with icuNext and chinese text but doesn't mention a crash in specific. says icuNext should use U16_NEXT, not U8_NEXT. and indeed, chromium's fts2 uses U16_NEXT, not U8_TEXT: http://www.google.com/codesearch/p?hl=en#h0RrPvyPu-c/third_party/sqlite/ext/fts2/fts2_icu.c&q=file:fts2_icu.c%20package:src.chromium.org/svn/trunk&sa=N&cd=1&ct=rc&l=201 so i modified line 110799 in our sqlite3.c source and it worked. took about 1.5 or 2 minutes to build the indexes though. - what would google do? look at how chromium caches sql stmts: they have a GetCachedStatement() that takes an id and string of sql. i presume if the stmt is not yet cached it becomes cached with the call, which is why the sql is given? they use a macro that generates an id based on file and line that the stmt appears on. nice thing is they don't have this giant list of stmts at the tops of their files with long identifiers, and sql is next to the code that uses it. http://www.google.com/codesearch/p?hl=en#h0RrPvyPu-c/app/sql/connection.h&q=package:%22src.chromium.org/svn/trunk%22%20GetCachedStatement&sa=N&cd=2&ct=rc&l=210 - idea: since our metric is frecency, it would behoove us to keep in mem and in a small db the most frecent stuff. sharding? - mar 18 2010 - fts - important: creating statements for each invocation of startSearch is really a perf hit -- as opposed to making a stmt beforehand and reusing it each time. maybe it depends on the query, but i noticed when i wasn't reusing a stmt i could have been, the ui was actually jerky, and i wondered why since the exec is done async. i reused the stmt and it seems to have cleared up. - prefix search terms are incredibly slow... converting the search string in the awesomebar to be prefix terms makes it even slower than linear scan -- even with the ORDER BY term removed! i'd better be doing something wrong. - ideas for doing postfix searches: http://osdir.com/ml/sqlite-users/2009-08/msg00150.html e.g., store reversed words also in index. - doing some manual time testing in javascript console. this query takes ~930ms: "SELECT title " + "FROM moz_places_fts " + "WHERE moz_places_fts MATCH 'm*' " + "LIMIT 10 " this takes 2ms: "SELECT title " + "FROM moz_places_fts " + "WHERE moz_places_fts MATCH 'm' " + "LIMIT 10 " is prefix searching really this slow? with these LIMITs, these times: limit: 1 | ms: 927 10 | 930 100 | 952 1000 | 999 10000 | 1627 same table but for no-prefix, i.e., "MATCH 'm'": limit: 1 | ms: 1 10 | 2 100 | 10 1000 | 17 10000 | 18 for comparison, this query: "SELECT title " + "FROM moz_places " + "WHERE title LIKE 'm%' " + "LIMIT 100 " has these numbers: limit: 1 | ms: 0 10 | 1 100 | 6 1000 | 78 10000 | 320 how in the hell is this slower than a linear scan? - what would google do? - they only do prefix searches when strlen >= 3. http://www.google.com/codesearch/p?hl=en#h0RrPvyPu-c/chrome/browser/history/query_parser.cc&q=package:%22src.chromium.org/svn/trunk%22%20QueryNodeList&sa=N&cd=1&ct=rc&l=246 - '*' suffix appended here: http://www.google.com/codesearch/p?hl=en#h0RrPvyPu-c/chrome/browser/history/query_parser.cc&q=package:%22src.chromium.org/svn/trunk%22%20QueryNodeList&sa=N&cd=1&ct=rc&l=114 - indeed, when i search for "w" and "wi" in chrome, wikipedia entries don't show up. (actually one does on "w", but i wonder if it's unrelated, since it doesn't come up for "wi", and the word "wikipedia" is not highlighted in the dropdown.) several come up when i enter "wik". - so i tried testing 'moz*' instead of 'm*', and it's still slower than the "LIKE 'm%'" query. - huh? chromium uses fts only for history, not bookmarks? what is QueryParser::DoesQueryMatch? called by BookmarkIndex::AddMatchToResults. comment in BookmarkIndex::AddMatchToResults: // Check that the result matches the query. The previous search // was a simple per-word search, while the more complex matching // of QueryParser may filter it out. For example, the query // ["thi"] will match the bookmark titled [Thinking], but since // ["thi"] is quoted we don't want to do a prefix match. so they do two passes? http://www.google.com/codesearch/p?hl=en#h0RrPvyPu-c/chrome/browser/bookmarks/bookmark_index.cc&q=package:%22src.chromium.org/svn/trunk%22%20DoesQueryMatch&sa=N&cd=3&ct=rc&l=107 - re: slow prefix searches: http://www.mail-archive.com/sqlite-users@sqlite.org/msg29729.html - quote: One reason you might find full-table-scans to sometimes be faster than fts is if fts ends up essentially looking at most documents. In that case, the fts stuff isn't going to be particularily efficient. There may be ways to take advantage of this entirely within fts, so that you can use a single syntax and have it automatically figure out how to optimize the query. - perf depends on query and documents. i did some of the tests above with 'foo*' and 'LIKE foo%' instead of moz, and fts was faster this time. only 16 rows match 'foo%', but 2617 match 'moz%'. great, common terms are slower. (but when i think about my own use of the awesomebar, the terms i use are not common i think -- using common terms isn't helpful, since they give lots of results.) - mar 19 2010 - fts - idea: do prefix searches only for the final term in the string. e.g., if the user types in "mozilla fire", the string given to sqlite is "mozilla fire*", not "mozilla* fire*". i18n issues? - mar 23 2010 - fts - _executeQueries just hands the stmts to executeAsync. _processRow assumes that it's called with each result row in order, but that doesn't appear to be guaranteed. in fact when i search for "food" with my fts changes, one row is out of order. is this a problem without my changes? _processRow should keep an ordered list, which means it has to wait for all results before it can show any... => the order of queries passed to _executeQueries is important? is shawn passing in stmt A before B because he's exploiting the fact that all results of A are guaranteed to have higher frecencies than B? if so, i'll have to partition fts tables by frecency, which means updates will really suck... - when timing a stock build, beware that it does a second-pass search! see handleCompletion. - maybe i should back up and time raw storage queries, autocomplete_search vs. fts. no awesomebar, no nsPlacesAutocomplete, no joining with other tables. and try with a very large db, maybe mine's too small to see benefit. - mar 25 2010 - fts - timing autocomplete_match vs. fts in raw sql stmts - methodology - places.sqlite of my "tmp" profile, which was copied over from my default profile a few weeks ago - count(moz_places) = 33831 - 34 fts tables of size 1000 rows (except for 34th table) - fts tables are in places.sqlite - all search strings are suffixed with "*" in the fts cases to perform prefix searches - obj-opt-shared - search string: "f" ("f*" for fts) - fts - num results: 16064 - sync - first result: 1-5ms after start - last result: 130ms after start (first runs after app start or other query are consistently longer: 500ms, 244, 249) - async - first: 3-4 - last: 170 (first runs: 241) - autocomplete_match - num results: 17974 - sync - first: 0-1 - last: 310-325 (first runs: 341) - async - first: 1 - last: 350 (first runs: no diff) - search string: "foo" ("foo*" for fts) - fts - num results: 64 - sync - first: 0-2 - last: 15 (first runs: 44) - async - first: 4-5 - last: 16-17 (first runs: 46) - autocomplete_match - num results: 81 - sync - first: 5 - last: 245 (first runs: 281) - async - first: 70 - last: 250 (first runs: 279) - search string: "m" ("m*" for fts) - fts - num results: 13326 - sync - first: 2 - last: 113 (first runs: 122, 171) - async - first: 3-4 - last: 130 (first runs: 628) - autocomplete_match - num results: 20304 - sync - first: 0 - last: 310 (first runs: 343) - async - first: 1 - last: 350 (first runs: 389) - search string: "moz" ("moz*" for fts) - fts - num results: 5108 - sync - first: 0-1 - last: 41 (first runs: 93) - async - first: 3-4 - last: 50-60 (first runs: 120, 109) - autocomplete_match - num results: 12683 - sync - first: 0 - last: 295-300 (first runs: 331) - async - first: 0-1 - last: 301-309 (first runs: 358) - search string: "mozilla" ("mozilla*" for fts) - fts - num results: 5031 - sync - first: 1 - last: 41 (first runs: 100) - async - first: 2-4 - last: 50-60 (first runs: 107) - autocomplete_match - num results: 12598 - sync - first: 0 - last: 295-300 (first runs: 335) - async - first: 0-1 - last: 300-328 (first runs: 353) - search string: "google" ("google*" for fts) - fts - num results: 4148 - sync - first: 0-1 - last: 35 (first runs: 87) - async - first: 2-3 - last: 45 (first runs: 222) - autocomplete_match - num results: 5332 - sync - first: 1 - last: 252-260 (first runs: 285) - async - first: 2-3 - last: 260 (first runs: 300) - so fts is consistently faster. why is using stock nsPlacesAutoComplete stripped to the bone sometimes faster? - it's the LIMIT clause. when i add a limit clause, raw sql times match the times i was seeing with nsPlacesAutoComplete. even when i add an ORDER BY, as nsPlacesAutoComplete uses. the ORDER BY must be using indexes -- it couldn't be so fast otherwise... yeah, when i order by title instead, which i presume is not indexed, time is much greater. - one thing the fts version does wrong: i LIMIT each stmt, but i'm running 34 stmts, so in total i'm getting 34 * limit results. - yeah, looks like this is where i'm falling behind. when i restrict the entire query to a single fts table/stmt, i beat the linear scan. - it's not the overhead of executing 34 separate stmts, because when i do all 34 in the same stmt, it's still slower.