User:Adw/FTS

- mar 8 2010
- fts in sqlite
- module is called fts3
- http://www.sqlite.org/cvstrac/wiki?p=FtsUsage
- thread about l10n and fts tokenizer:
http://old.nabble.com/FTS3-Unicode-support-td15078601.html
- fts3 tokenizers readme:
http://www.sqlite.org/cvstrac/fileview?f=sqlite/ext/fts3/README.tokenizers
- see my sep 10 2009 note also
- i don't see how to do fts in the face of multi-lingual data for us. a
user can have pages of any language in his history. would need a way of
tokenizing a string whose language is unknown or at best hinted at by
the encoding of the web page it came from. we don't even have a way to
tokenize for a known lang other than english.
- could do it for urls, though. tokenize on punctuation.
- what would google do? looked at chromium source. relevant parts:
- HistoryContentsProvider is the class that provides data to their
autocomplete. it calls HistoryBackend::QueryHistory, which calls
HistoryBackend::QueryHistoryFTS if the query contains text. (branches
to HistoryBackend::QueryHistoryBasic otherwise.)
- HistoryBackend::QueryHistoryFTS in
trunk/src/chrome/browser/history/history_backend.h:
http://www.google.com/codesearch/p?hl=en#h0RrPvyPu-c/chrome/browser/history/history_backend.h&q=package:%22src.chromium.org/svn/trunk%22%20QueryHistoryFTS&sa=N&cd=2&ct=rc
- comment above QueryHistoryFTS: "The FTS version queries the
text_database, then merges with the history DB."
- impl is here:
http://www.google.com/codesearch/p?hl=en#h0RrPvyPu-c/chrome/browser/history/history_backend.cc&q=package:%22src.chromium.org/svn/trunk%22%20QueryHistoryFTS&sa=N&cd=1&ct=rc&l=1100
- it queries the text database mentioned in the comment. what is it,
how's it made? the exact call is |text_database_->GetTextMatches()|.
- text_database_ is a TextDatabaseManager ptr. they have sources named
text_database.{h,cc} and text_database_manager.{h,cc}. comment above
text_database_ says "Full text database manager".
- comment at the top of text_database_manager.h:
// Manages a set of text databases representing different time
// periods. This will page them in and out as necessary, and will
// manage queries for times spanning multiple databases.
//
// It will also keep a list of partial changes, such as page adds and
// title and body sets, all of which come in at different times for a
// given page. When all data is received or enough time has elapsed
// since adding, the indexed data will be comitted.
//
// This allows us to minimize inserts and modifications, which are
// slow for the full text database, since each page's information is
// added exactly once.
//
// Note: be careful to delete the relevant entries from this
// uncommitted list when clearing history or this information may get
// added to the database soon after the clear.
- TextDatabaseManager::GetTextMatches:
http://www.google.com/codesearch/p?hl=en#h0RrPvyPu-c/chrome/browser/history/text_database_manager.cc&l=401
- so they partition these text dbs by date. comment from
TextDatabaseManager::GetTextMatches:
// Iterate over the databases from the most recent backwards.
- TextDatabaseManager::GetTextMatches calls each db's
TextDatabase::GetTextMatches, defined here:
http://www.google.com/codesearch/p?hl=en#h0RrPvyPu-c/chrome/browser/history/text_database.cc&d=3&l=293
- aha, finally we get to sqlite:
"SELECT url, title, time, offsets(pages), body "
"FROM pages LEFT OUTER JOIN info ON pages.rowid = info.rowid "
"WHERE pages MATCH ? AND time >= ? AND time < ? "
"ORDER BY time DESC "
"LIMIT ?"));
- TextDatabase::AddPageData i'm guessing does the insertions:
// Add to the pages table.
"INSERT INTO pages (url, title, body) VALUES (?,?,?)"
// Add to the info table with the same rowid.
"INSERT INTO info (rowid, time) VALUES (?,?)"
http://www.google.com/codesearch/p?hl=en#h0RrPvyPu-c/chrome/browser/history/text_database.cc&d=3&l=196
- question is, how are they handling non-ascii data?
- aha! TextDatabase::CreateTables
// FTS table of page contents.
if (!db_.DoesTableExist("pages")) {
if (!db_.Execute("CREATE VIRTUAL TABLE pages USING fts2("
"TOKENIZE icu,"
"url LONGVARCHAR,"
"title LONGVARCHAR,"
"body LONGVARCHAR)"))
return false;
}
// Non-FTS table containing URLs and times so we can efficiently find
// them using a regular index (all FTS columns are special and are
// treated as full-text-search, which is not what we want when
// retrieving this data).
if (!db_.DoesTableExist("info")) {
// Note that there is no point in creating an index over time. Since
// we must always query the entire FTS table (it can not efficiently do
// subsets), we will always end up doing that first, and joining the
// info table off of that.
if (!db_.Execute("CREATE TABLE info(time INTEGER NOT NULL)"))
return false;
}
- "icu" stands for "International Components for Unicode"
- icu is an extension to sqlite:
http://www.sqlite.org/cvstrac/fileview?f=sqlite/ext/fts3/README.tokenizers
- important: icu is a library, but it's not a library provided by sqlite.
sqlite's icu facilities are merely glue to integrate the icu lib with
sqlite. see comment here:
http://www.hwaci.com/cgi-bin/sqlite/artifact/850e9a3656
shawn's comment in the thread linked above suggests that we aren't
likely to include icu in firefox... huh, chromium must include it then.
looks like it:
http://src.chromium.org/svn/trunk/src/base/third_party/icu, although the
readme says it's just a subset. oh, here?
http://src.chromium.org/svn/trunk/deps/third_party/icu42
http://src.chromium.org/svn/trunk/deps/third_party/icu42/README.google
- sqlite's icu-based tokenizer is here:
http://www.hwaci.com/cgi-bin/sqlite/artifact/ac494aed69
can port it to gecko, ugh? does icu provide things that gecko doesn't?
these questions might be answered in that bug 472764. (a quick search
for "port" showed that they talked about porting other impls at least,
but nothing concrete came of it afaict.)
- looks like icu has facilities for getting and iterating over code
points, which i think means characters? word iterator also, which can
detect word boundaries:
http://icu-project.org/apiref/icu4c/ubrk_8h.html
- i don't think anything like that exists in gecko. surely the smart
people in bug 472764 would have known.
- nsIEntityConverter.idl?
- what is this?
http://mxr.mozilla.org/mozilla-central/source/intl/build/nsI18nModule.cpp#64
64 { "Word Breaker", NS_WBRK_CID,
65 NS_WBRK_CONTRACTID, nsSampleWordBreakerConstructor},
- nsISemanticUnitScanner
- this defines a class called nsIWordBreaker, but there's no idl:
http://mxr.mozilla.org/mozilla-central/source/intl/lwbrk/public/nsIWordBreaker.h
an instance of it is exposed as a static method on nsContentUtils:
http://mxr.mozilla.org/mozilla-central/source/content/base/public/nsContentUtils.h#587
i wonder if there's no idl because it's not implemented. there's
intl/lwbrk/src/nsSampleWordBreaker.cpp which looks half-assed.
- i wonder if the icu tokenizer simply tokenizes on code points (assuming
that translates into characters), because when i type the "go" part of
"nihongo" into chrome's address bar in kanji, my visit to wikipedia's
nihongo page shows up. (same for "ji" in ja.wikipedia's shokuji page.)
but then that would mean that english words get tokenized into chars
also, and they don't... chrome also indexes the text of pages, so maybe
it's pulling out those chars from there.
- oh:
45 class nsSemanticUnitScanner : public nsISemanticUnitScanner
46 , public nsSampleWordBreaker
- testing chromium's tokenization:
http://doraku.asahi.com/lifestyle/pet/gallery_091201.html?ref=kanren
page title has this run of katakana:
スーパードッグカーニバル (super dog carnival)
スーパー is matched. ドッグ is matched, but the カ is also hilighted.
カーニバル is not matched.

- mar 9 2010
- fts
- conclusion: gecko includes a less-than-half-assed i18n tokenizer,
nsISemanticUnitScanner. it supports ascii and half-assedly supports
cjk. (e.g., it doesn't differentiate between kanji in chinese langs and
kanji in japanese langs; there's even a note in the code
(nsSampleWordBreaker.cpp) about breaking words per kanji in chinese, but
using bigger units in japanese, but the japanese case is not
implemented, so all individual kanji are treated as words.) no other
langs. icu is a "big" library because it's competent, unlike gecko's
i18n support, which is shoddy and scattershot. either someone has to do
the immense work of re-implementing icu's facilities in gecko, or we use
icu. both tasks are too big for me to lift. the first is stupid, since
the open source community has already done that work. the second is
politics, but it's the better option. we will never have fts unless
someone who knows how to work the levers better than i do drives the
effort.
- should look at:
- porting or including icu's tokenization bits, but i suspect they have
deep dependencies on large parts of the rest of the lib.
- marco looked at their web site, points out that icu is meant to be
configurable, modularizable. maybe we can include the word break
code afterall.
- http://source.icu-project.org/repos/icu/icu/trunk/readme.html
- http://source.icu-project.org/repos/icu/icu/trunk/readme.html#HowToPackage
- http://userguide.icu-project.org/packaging
- http://userguide.icu-project.org/design#TOC-API-Dependencies
- http://icu-project.org/repos/icu/icuhtml/trunk/design/modularization/icu_modularization_overview.html
- http://userguide.icu-project.org/boundaryanalysis
- note: icu is sometimes called icu4c to distinguish the c/c++ version
of the lib. (there's also icu4j, a java version.)
- chromium's changes to icu, including word-breaking changes:
http://src.chromium.org/svn/trunk/deps/third_party/icu42/README.google
- chromium's dir structure re: icu: icu has only 3 top-level dirs,
as_is, packaging, and source. they removed the first two since they
don't need them. they added all dirs except source. public contains
some dirs they moved out of source. see readme.google. so they have:
http://src.chromium.org/svn/trunk/deps/third_party/icu42
which is the root of the tarball that icu provides.
- including other tokenizers. marco mentioned boost has some.
- is thunderbird's fts search really ascii and cjk only? what happens for
other lang users, like arabic?

- mar 10 2010
- continuing fts
- looking at other non-icu tokenizers like marco suggested yesterday
- boost has tokenizers but they're simple and i18n-unaware.
http://www.boost.org/doc/libs?view=categorized
- looking at state-of-the-art fts in thunderbird
- apparently still using bi-grams. i don't know how well it works in
practice for non-cjk text. should ask andrew sutherland.
- http://mxr.mozilla.org/comm-central/source/mailnews/extensions/fts3/src/fts3_porter.c
- http://mxr.mozilla.org/comm-central/source/mailnews/extensions/fts3/src/README.mozilla
- http://mxr.mozilla.org/comm-central/source/mailnews/extensions/fts3/public/nsIFts3Tokenizer.idl
- chat with andrew sutherland (asuth) in #maildev, see chatlog today.
upshot: they're not planning more work on their tokenizer. my thoughts:
all the progress so far has been due to makoto. as it exists now, it's
not good enough for firefox.
- trying to hack together icu, sqlite, firefox just to get it running:
- downloaded, untared the icu source to intl/icu.
- copied the source to a directory on my desktop and built there. followed
instructions in the tarball's readme but did not do the install step.
- copied source/lib/libicudata.42.1.dylib and libicuuc.42.1.dylib from
desktop build into obj-debug/dist/MinefieldDebug.app/Contents/MacOS/
components/. i have no idea what i'm doing.
- copied source/common/unicode/platform.h from desktop build to source that
i dropped into the mozilla tree. that file is generated when building
from platform.h.in.
- modified db/sqlite3/src/Makefile.in by adding at the very end of the file:
(because i think that the include $(topsrcdir)/config/rules.mk is what
contains whatever cflags set by the build system):
CFLAGS +=-I$(topsrcdir)/intl/icu/source/common -I$(topsrcdir)/intl/icu/source/i18n/
probably could/should (for the time being) have just pointed these at my
desktop build.
- also modified that makefile to include -DSQLITE_ENABLE_ICU=1.
- that made the compile work but now i get linker errors, undefined symbols
in sqlite3.o. hmm. i have an icu dylib from building it. how do i get
sqlite3.o to not need to statically link against a static icu lib? other
option is to do just that: make an icu static lib, link sqlite3.o against
it.
- sqlite's icu extension readme says easiest way is to build the extension
as a dynamically loadable sqlite extension. i guess that is exactly what
i am not doing by using the SQLITE_ENABLE_ICU amalgamation switch.
sdwilsh says moz storage doesn't expose the sqlite dynamic extension
loading api though, but there's a bug for it.
- perhaps just to get started i can make an icu static lib, link against
sqlite3.o.
- libmozsqlite3.dylib did not change in size after my changes.
- sqlite is borked on start up, get this error:
nsNativeModuleLoader::LoadModule("/Users/adw/trees/mozilla-central/obj-debug/dist/MinefieldDebug.app/Contents/MacOS/components/libstoragecomps.dylib") - load FAILED, rv: 80004005, error:
Unknown error: -2804
- noticed that the last step in make -C db/sqlite3/src is:
/Users/adw/trees/mozilla-central/obj-debug/config/nsinstall -L /Users/adw/trees/mozilla-central/obj-debug/db/sqlite3/src -m 755 libmozsqlite3.dylib ../../../dist/lib
/Users/adw/trees/mozilla-central/obj-debug/config/nsinstall -L /Users/adw/trees/mozilla-central/obj-debug/db/sqlite3/src -m 755 libmozsqlite3.dylib ../../../dist/bin
i.e., installing dylibs into obj-debug/dist/lib and dist/bin, so i copied
the three dylibs i linked sqlite3.o against to the same dirs:
adw@host-3-121:~/trees/mozilla-central/obj-debug$ cp ~/Desktop/icu4c-4_2_1/source/lib/libicudata.dylib dist/lib/
adw@host-3-121:~/trees/mozilla-central/obj-debug$ cp ~/Desktop/icu4c-4_2_1/source/lib/libicui18n.dylib dist/lib/
adw@host-3-121:~/trees/mozilla-central/obj-debug$ cp ~/Desktop/icu4c-4_2_1/source/lib/libicuuc.dylib dist/lib/
adw@host-3-121:~/trees/mozilla-central/obj-debug$ cp ~/Desktop/icu4c-4_2_1/source/lib/libicudata.dylib dist/bin/
adw@host-3-121:~/trees/mozilla-central/obj-debug$ cp ~/Desktop/icu4c-4_2_1/source/lib/libicui18n.dylib dist/bin/
adw@host-3-121:~/trees/mozilla-central/obj-debug$ cp ~/Desktop/icu4c-4_2_1/source/lib/libicuuc.dylib dist/bin/
- didn't fix error. how is libstoragecomps.dylib involved?
- noticed dist/MinefieldDebug.app/Contents/MacOS also contains the dylibs,
so copied them there, but same error.
- noticed dist/MinefieldDebug.app/Contents/MacOS/components also contains
the dylibs, so copied them there, but that didn't work either.
- i wonder if it's some symbol location issue, like libstoragecomps.dylib
was built against libsqlite3.dylib, and now i've changed libsqlite3.dylib.
but i thought these dylibs weren't linked against each other?
- make -C storage, storage/build, etc. didn't fix.
- ah, i did make -C browser/app, removed compatibility.ini et al. in my
profdir, and ran the build with
DYLD_LIBRARY_PATH=/Users/adw/Desktop/icu4c-4_2_1/source/lib/.
i'm not sure if the first two steps were necessary, but it worked after
setting DYLD_LIBRARY_PATH. so the problem was libstoragecomps.dylib
(i think? or firefox?) couldn't find those three dylibs.
- note that libicui18n.dylib was necessary in the sqlite3.o linking stage,
because some collation functions were being called. that's even though
that chart on the design web page said that the common lib, the one
containing the word breaking code, depended on only the data lib. maybe
one of the manu config/compile options will make the collation stuff
unnecessary.
- followup (mar 11): the error console was broken in my build, so i pulled
the source, blew away obj-debug, and rebuilt. icu still works. no
copying of dylibs necessary when you set DYLD_LIBRARY_PATH to point to
dylibs, wherever on disk they may be. also did not copy icu source into
tree; i just point -I at my icu build on the desktop.

- mar 11 2010
- continuing fts
- mfinkle says fennec uses our autocomplete search provider "in a wrapped
way". hopefully fts will help them a lot, and i can have claimed to
improved fennec performance.
- toolkit/components/places/src/nsPlacesAutoComplete.js constructs the
queries (both text of the queries and the stmts themselves).
- sqlite custom function autocomplete_match is where the text matching
actually occurs.
toolkit/components/places/src/SQLFunctions.h
toolkit/components/places/src/SQLFunctions.cpp#52
- that function is registered with the db connection here:
http://mxr.mozilla.org/mozilla-central/source/toolkit/components/places/src/nsNavHistory.cpp#1038
- MatchAutoCompleteFunction::OnFunctionCall is the entry point:
http://mxr.mozilla.org/mozilla-central/source/toolkit/components/places/src/SQLFunctions.cpp#207
- so we end up calling CaseInsensitiveFindInReadable (or a related platform
string-searching function) for every row in the results against title,
tags, etc. => autocomplete_match needs to be replaced?
- sqlite fts3 readme:
http://www.sqlite.org/cvstrac/rlog?f=sqlite/ext/fts3/README.syntax
wiki page:
http://www.sqlite.org/cvstrac/wiki?p=FtsUsage
- what would google do?
- it looks like chromium does the fts tables lookup first, gets all the
results in memory, and then does a second lookup into the history db
(HistoryBackend::QueryHistoryFTS).
- so they aren't worried about fitting all fts results in mem? they
restrict the search to "present_databases_" and inside that by date
(TextDatabaseManager::GetTextMatches). does "present" mean present in
mem?
- TextDatabaseManager::GetTextMatches yields a vector of
TextDatabase::Match'es, which is just a simple struct containing title,
url, time, "match positions", and snippet
- then HistoryBackend::QueryHistoryFTS for-loops over those matches,
looking up each in a db (i assume history db) by the url in the match.
- this gives me more confidence that i don't need to bend over backwards
to try and do everything in sql. if it's good enough for them. but,
brett wilson's name is in the chromium comments, so.
- hmm, does our current impl (autocomplete_match?) also compute positions
within the string of the match? which is how the right substrings in the
awesomebar dropdown are highlighted? i'd need to replicate that... can
fts do it for me?
- is every column in an fts table fts indexed? the FtsUsage wiki page isn't
clear, so i assume so. chromium doesn't have to worry about it because
they (i'm guessing) key off of urls; their key is also text to be
searched. if i'm wrong, then they're looking up records by non-indexed
column, which is very un-google-like. [looking at
HistoryBackend::URLQuerier::GetRowForURL, they have a "urls" table, and
they simply SELECT some columns WHERE url=:url. using that table like
an index i guess.]
- if every col is indexed... probably not ok to stuff foreign keys into fts
table. chromium's text databases keep an auxilary "info" table, and
they're using rowid to tie that table to the "pages" fts table. yeah,
comment about the design here:
http://www.google.com/codesearch/p?hl=en#h0RrPvyPu-c/chrome/browser/history/text_database.cc&q=package:%22src.chromium.org/svn/trunk%22%20URLDatabase&d=3&l=19
quoting:
// We do joins across these two tables by using their internal rowids,
// which we keep in sync between the two tables. The internal rowid is
// the only part of an FTS table that is indexed like a normal table,
// and the index over it is free since sqlite always indexes the
// internal rowid.
(i bet brett wilson wrote that.)
- so i could do something similar, instead of times the auxilary table would
keep track of foreign keys into moz_places and moz_bookmarks. kind of a
pain in the ass though, and seems like whenever you use fts tables you'd
need to do this.
- looks like chromium has a "main" history db and an "archived" history db,
judging by HistoryBackend::URLQuerier::GetRowForURL.
- can use fst to improve find-in-page? probably not, since you'd have to
crawl the page to index it first. but then subsequent finds could be
faster... something to think about.

- mar 12 2010
- from the icu faq (http://userguide.icu-project.org/icufaq):
What is the performance difference between UTF-8 and UTF-16?
Most of the time, the memory throughput of the hard drive and RAM is the
main performance constraint. UTF-8 is 50% smaller than UTF-16 for US-ASCII,
but UTF-8 is 50% larger than UTF-16 for East and South Asian scripts. There
is no memory difference for Latin extensions, Greek, Cyrillic, Hebrew, and
Arabic.
- shrinking icu
- good page on the data lib: http://userguide.icu-project.org/icudata
- section Customizing ICU's Data Library
- quoting:
Reducing the size of ICU's data by eliminating unneeded resources can
make sense on small systems with limited or no disk, but for desktop
or server systems there is no real advantage to trimming. ICU's data
is memory mapped into an application's address space, and only those
portions of the data actually being used are ever paged in, so there
are no significant RAM savings. As for disk space, with the large size
of today's hard drives, saving a few MB is not worth the bother.
- fts
- rowid info
- general: http://www.sqlite.org/lang_createtable.html#rowid
- re: inserts: http://www.sqlite.org/autoinc.html
- don't appear to be any weirdnesses about giving rowids while inserting.
- quoting http://www.sqlite.org/fts3.html
The rowid of an FTS3 table behaves in the same way as the rowid column
of an ordinary SQLite table, except that the values stored in the rowid
column of an FTS3 table remain unchanged if the database is rebuilt
using the VACUUM command. For FTS3 tables, "docid" is allowed as an
alias along with the usual "rowid", "oid" and "_oid_"
identifiers. Attempting to insert or update a row with a docid value
that already exists in the table is an error, just as it would be with
an ordinary SQLite table.
- wait, what the fuck is this?
http://www.sqlite.org/fts3.html
a giant treatise on fts3. why hasn't this come up before? hmm, when was
it written? it uses a strange syntax for specifying tokenizers.
- fst index optimization:
Executing an SQL statement of the form "INSERT INTO
<fts3-table>(<fts3-table>) VALUES('optimize')" causes FTS3 to merge all
existing index b-trees into a single large b-tree containing the entire
index. This can be an expensive operation, but may speed up future
queries.
- re: perf of fst, quote from http://www.sqlite.org/fts3.html:
Then either of the two queries below may be executed to find the number of
documents in the database that contain the word "linux" (351). Using one
desktop PC hardware configuration, the query on the FTS3 table returns in
approximately 0.03 seconds, versus 22.5 for querying the ordinary table.
- hmm, you can match against cols, so could i stuff non-fts'ed stuff in an
fts table (like foreign keys) and just not use that col in queries?
probably it'll be indexed though.

- mar 16 2010
- fts
- autocomplete_match
- aVisitCount only used to short-circuit and negate match if it's 0 and
aSearchBehavior & HISTORY.
- similar with aTyped, if aSearchBehavior & TYPED.
- similar with aBookmark, if aSearchBehavior & BOOKMARK.
- similar with aTags, if aSearchBehavior & TAG.
- IMPORTANT!! read Appendix A of http://www.sqlite.org/fts3.html before
commiting to any impl! quoting:
CREATE VIRTUAL TABLE documents USING fts3(title, content);
SELECT title FROM documents
WHERE documents MATCH <query>
ORDER BY rank(matchinfo(document)) DESC
OFFSET 0 LIMIT 10
The SQL query in the example above uses less CPU than the first example in
this section, but still has a non-obvious performance problem. SQLite
satisfies this query by retreiving the value of the "title" column and
matchinfo data from the FTS3 module for every row matched by the users
query before it sorts and limits the results. Because of the way SQLite's
virtual table interface works, retrieving the value of the "title" column
requires loading the entire row from disk (including the "content" field,
which may be quite large). This means that if the users query matches
several thousand documents, many megabytes of "title" and "content" data
may be loaded from disk into memory even though they will never be used
for any purpose.

- mar 17 2010
- fts
- modifying places autocomplete code
- nsPlacesAutoComplete.prototype.startSearch is entry point.
- nsPlacesAutoComplete.prototype._getSearch just discards the search terms
that are equal to those dumb "*", "@", "#", etc. tokens the user can
type to restrict the search to tags, title, etc. returns
{ tokens, query }, tokens = search terms that weren't discarded,
query = this._getBoundSearchQuery(this._matchBehavior, aTokens).
- _getBoundSearchQuery's query is:
// We use more optimized queries for restricted searches, so we will
// always return the most restrictive one to the least restrictive one
// if more than one token is found.
let query = this._hasBehavior("tag") ? this._tagsQuery :
this._hasBehavior("bookmark") ? this._bookmarkQuery :
this._hasBehavior("typed") ? this._typedQuery :
this._hasBehavior("history") ? this._historyQuery :
this._defaultQuery;
so in the common case -- no dumb restricting tokens -- the query
returned from _getBoundSearchQuery and _getSearch is _defaultQuery.
- _defaultQuery is simply SQL_BASE with the additional conditions clause
replaced with the empty string.
- so i should add the fts lookup step in startSearch after _getSearch
returns the tokens...?
- AUTOCOMPLETE_MATCH is also used by _adaptiveQuery
- need to change/override _getBoundAdaptiveQuery(), _defaultQuery.
_getBoundKeywordQuery(tokens) doesn't use autocomplete_match.
- nsPlacesAutoComplete.prototype.handleCompletion searches again if there
are no results, so it's not just startSearch that starts!
- what would google do? they synchronously execute the fts MATCH stmt:
http://www.google.com/codesearch/p?hl=en#h0RrPvyPu-c/chrome/browser/history/text_database.cc&q=package:%22src.chromium.org/svn/trunk%22%20URLDatabase&d=3&l=323
this isn't run off the main thread is it? hmm, if they take pains to
keep the text index in mem, and it looks like they do, that would work
ok.
- should the fts tables live in the same db as the tables they index? or
separate db? what are the pros and cons? try it both ways, measure?
- if we ever shard the places db, should use separate db.
- text db should be in mem at all times. how would keeping all tables
in same db affect that?
- i'm getting a crash when i copied by own places.sqlite to my test
profile. crash comes when either the fts tables are built or when the
match is done, not sure which, since the latter happens right after the
former the first time the awesomebar is typed in. text console output:
Assertion failed: (iStart<=iEnd), function icuNext, file /Users/adw/trees/mozilla-central/db/sqlite3/src/sqlite3.c, line 110806.
Abort trap
googled, found this http://www.sqlite.org/cvstrac/tktview?tn=3963,39 ,
a terse sqlite bug ticket that mentions a problem with icuNext and
chinese text but doesn't mention a crash in specific. says icuNext
should use U16_NEXT, not U8_NEXT. and indeed, chromium's fts2 uses
U16_NEXT, not U8_TEXT:
http://www.google.com/codesearch/p?hl=en#h0RrPvyPu-c/third_party/sqlite/ext/fts2/fts2_icu.c&q=file:fts2_icu.c%20package:src.chromium.org/svn/trunk&sa=N&cd=1&ct=rc&l=201
so i modified line 110799 in our sqlite3.c source and it worked. took
about 1.5 or 2 minutes to build the indexes though.
- what would google do? look at how chromium caches sql stmts: they have a
GetCachedStatement() that takes an id and string of sql. i presume if the
stmt is not yet cached it becomes cached with the call, which is why the
sql is given? they use a macro that generates an id based on file and line
that the stmt appears on. nice thing is they don't have this giant list of
stmts at the tops of their files with long identifiers, and sql is next to
the code that uses it.
http://www.google.com/codesearch/p?hl=en#h0RrPvyPu-c/app/sql/connection.h&q=package:%22src.chromium.org/svn/trunk%22%20GetCachedStatement&sa=N&cd=2&ct=rc&l=210
- idea: since our metric is frecency, it would behoove us to keep in mem and
in a small db the most frecent stuff. sharding?

- mar 18 2010
- fts
- important: creating statements for each invocation of startSearch is
really a perf hit -- as opposed to making a stmt beforehand and reusing it
each time. maybe it depends on the query, but i noticed when i wasn't
reusing a stmt i could have been, the ui was actually jerky, and i
wondered why since the exec is done async. i reused the stmt and it seems
to have cleared up.
- prefix search terms are incredibly slow... converting the search string
in the awesomebar to be prefix terms makes it even slower than linear
scan -- even with the ORDER BY term removed! i'd better be doing
something wrong.
- ideas for doing postfix searches:
http://osdir.com/ml/sqlite-users/2009-08/msg00150.html
e.g., store reversed words also in index.
- doing some manual time testing in javascript console. this query takes
~930ms:
"SELECT title " +
"FROM moz_places_fts " +
"WHERE moz_places_fts MATCH 'm*' " +
"LIMIT 10 "
this takes 2ms:
"SELECT title " +
"FROM moz_places_fts " +
"WHERE moz_places_fts MATCH 'm' " +
"LIMIT 10 "
is prefix searching really this slow? with these LIMITs, these times:
limit: 1 | ms: 927
10 | 930
100 | 952
1000 | 999
10000 | 1627
same table but for no-prefix, i.e., "MATCH 'm'":
limit: 1 | ms: 1
10 | 2
100 | 10
1000 | 17
10000 | 18
for comparison, this query:
"SELECT title " +
"FROM moz_places " +
"WHERE title LIKE 'm%' " +
"LIMIT 100 "
has these numbers:
limit: 1 | ms: 0
10 | 1
100 | 6
1000 | 78
10000 | 320
how in the hell is this slower than a linear scan?
- what would google do?
- they only do prefix searches when strlen >= 3.
http://www.google.com/codesearch/p?hl=en#h0RrPvyPu-c/chrome/browser/history/query_parser.cc&q=package:%22src.chromium.org/svn/trunk%22%20QueryNodeList&sa=N&cd=1&ct=rc&l=246
- '*' suffix appended here:
http://www.google.com/codesearch/p?hl=en#h0RrPvyPu-c/chrome/browser/history/query_parser.cc&q=package:%22src.chromium.org/svn/trunk%22%20QueryNodeList&sa=N&cd=1&ct=rc&l=114
- indeed, when i search for "w" and "wi" in chrome, wikipedia entries
don't show up. (actually one does on "w", but i wonder if it's
unrelated, since it doesn't come up for "wi", and the word "wikipedia"
is not highlighted in the dropdown.) several come up when i enter
"wik".
- so i tried testing 'moz*' instead of 'm*', and it's still slower than
the "LIKE 'm%'" query.
- huh? chromium uses fts only for history, not bookmarks? what is
QueryParser::DoesQueryMatch? called by BookmarkIndex::AddMatchToResults.
comment in BookmarkIndex::AddMatchToResults:
// Check that the result matches the query. The previous search
// was a simple per-word search, while the more complex matching
// of QueryParser may filter it out. For example, the query
// ["thi"] will match the bookmark titled [Thinking], but since
// ["thi"] is quoted we don't want to do a prefix match.
so they do two passes?
http://www.google.com/codesearch/p?hl=en#h0RrPvyPu-c/chrome/browser/bookmarks/bookmark_index.cc&q=package:%22src.chromium.org/svn/trunk%22%20DoesQueryMatch&sa=N&cd=3&ct=rc&l=107
- re: slow prefix searches:
http://www.mail-archive.com/sqlite-users@sqlite.org/msg29729.html
- quote:
One reason you might find full-table-scans to sometimes be faster than
fts is if fts ends up essentially looking at most documents. In that
case, the fts stuff isn't going to be particularily efficient. There
may be ways to take advantage of this entirely within fts, so that you
can use a single syntax and have it automatically figure out how to
optimize the query.
- perf depends on query and documents. i did some of the tests above with
'foo*' and 'LIKE foo%' instead of moz, and fts was faster this time.
only 16 rows match 'foo%', but 2617 match 'moz%'. great, common terms
are slower. (but when i think about my own use of the awesomebar, the
terms i use are not common i think -- using common terms isn't helpful,
since they give lots of results.)

- mar 19 2010
- fts
- idea: do prefix searches only for the final term in the string. e.g., if
the user types in "mozilla fire", the string given to sqlite is
"mozilla fire*", not "mozilla* fire*". i18n issues?

- mar 23 2010
- fts
- _executeQueries just hands the stmts to executeAsync. _processRow assumes
that it's called with each result row in order, but that doesn't appear to
be guaranteed. in fact when i search for "food" with my fts changes, one
row is out of order. is this a problem without my changes? _processRow
should keep an ordered list, which means it has to wait for all results
before it can show any... => the order of queries passed to
_executeQueries is important? is shawn passing in stmt A before B because
he's exploiting the fact that all results of A are guaranteed to have
higher frecencies than B? if so, i'll have to partition fts tables by
frecency, which means updates will really suck...
- when timing a stock build, beware that it does a second-pass search!
see handleCompletion.
- maybe i should back up and time raw storage queries, autocomplete_search
vs. fts. no awesomebar, no nsPlacesAutocomplete, no joining with other
tables. and try with a very large db, maybe mine's too small to see
benefit.

- mar 25 2010
- fts
- timing autocomplete_match vs. fts in raw sql stmts
- methodology
- places.sqlite of my "tmp" profile, which was copied over from my
default profile a few weeks ago
- count(moz_places) = 33831
- 34 fts tables of size 1000 rows (except for 34th table)
- fts tables are in places.sqlite
- all search strings are suffixed with "*" in the fts cases to perform
prefix searches
- obj-opt-shared
- search string: "f" ("f*" for fts)
- fts
- num results: 16064
- sync
- first result: 1-5ms after start
- last result: 130ms after start (first runs after app start or
other query are consistently longer: 500ms, 244, 249)
- async
- first: 3-4
- last: 170 (first runs: 241)
- autocomplete_match
- num results: 17974
- sync
- first: 0-1
- last: 310-325 (first runs: 341)
- async
- first: 1
- last: 350 (first runs: no diff)
- search string: "foo" ("foo*" for fts)
- fts
- num results: 64
- sync
- first: 0-2
- last: 15 (first runs: 44)
- async
- first: 4-5
- last: 16-17 (first runs: 46)
- autocomplete_match
- num results: 81
- sync
- first: 5
- last: 245 (first runs: 281)
- async
- first: 70
- last: 250 (first runs: 279)
- search string: "m" ("m*" for fts)
- fts
- num results: 13326
- sync
- first: 2
- last: 113 (first runs: 122, 171)
- async
- first: 3-4
- last: 130 (first runs: 628)
- autocomplete_match
- num results: 20304
- sync
- first: 0
- last: 310 (first runs: 343)
- async
- first: 1
- last: 350 (first runs: 389)
- search string: "moz" ("moz*" for fts)
- fts
- num results: 5108
- sync
- first: 0-1
- last: 41 (first runs: 93)
- async
- first: 3-4
- last: 50-60 (first runs: 120, 109)
- autocomplete_match
- num results: 12683
- sync
- first: 0
- last: 295-300 (first runs: 331)
- async
- first: 0-1
- last: 301-309 (first runs: 358)
- search string: "mozilla" ("mozilla*" for fts)
- fts
- num results: 5031
- sync
- first: 1
- last: 41 (first runs: 100)
- async
- first: 2-4
- last: 50-60 (first runs: 107)
- autocomplete_match
- num results: 12598
- sync
- first: 0
- last: 295-300 (first runs: 335)
- async
- first: 0-1
- last: 300-328 (first runs: 353)
- search string: "google" ("google*" for fts)
- fts
- num results: 4148
- sync
- first: 0-1
- last: 35 (first runs: 87)
- async
- first: 2-3
- last: 45 (first runs: 222)
- autocomplete_match
- num results: 5332
- sync
- first: 1
- last: 252-260 (first runs: 285)
- async
- first: 2-3
- last: 260 (first runs: 300)
- so fts is consistently faster. why is using stock nsPlacesAutoComplete
stripped to the bone sometimes faster?
- it's the LIMIT clause. when i add a limit clause, raw sql times match
the times i was seeing with nsPlacesAutoComplete. even when i add an
ORDER BY, as nsPlacesAutoComplete uses. the ORDER BY must be using
indexes -- it couldn't be so fast otherwise... yeah, when i order by
title instead, which i presume is not indexed, time is much greater.
- one thing the fts version does wrong: i LIMIT each stmt, but i'm running
34 stmts, so in total i'm getting 34 * limit results.
- yeah, looks like this is where i'm falling behind. when i restrict the
entire query to a single fts table/stmt, i beat the linear scan.
- it's not the overhead of executing 34 separate stmts, because when i do
all 34 in the same stmt, it's still slower.

User:Adw/FTS

Navigation menu

Search