User:Adw/FTS

From MozillaWiki
Jump to: navigation, search
- mar 8 2010
  - fts in sqlite
    - module is called fts3
    - http://www.sqlite.org/cvstrac/wiki?p=FtsUsage
    - thread about l10n and fts tokenizer:
      http://old.nabble.com/FTS3-Unicode-support-td15078601.html
    - fts3 tokenizers readme:
      http://www.sqlite.org/cvstrac/fileview?f=sqlite/ext/fts3/README.tokenizers
    - see my sep 10 2009 note also
    - i don't see how to do fts in the face of multi-lingual data for us.  a
      user can have pages of any language in his history.  would need a way of
      tokenizing a string whose language is unknown or at best hinted at by
      the encoding of the web page it came from.  we don't even have a way to
      tokenize for a known lang other than english.
      - could do it for urls, though.  tokenize on punctuation.
    - what would google do?  looked at chromium source.  relevant parts:
      - HistoryContentsProvider is the class that provides data to their
        autocomplete.  it calls HistoryBackend::QueryHistory, which calls
        HistoryBackend::QueryHistoryFTS if the query contains text.  (branches
        to HistoryBackend::QueryHistoryBasic otherwise.)
      - HistoryBackend::QueryHistoryFTS in
        trunk/src/chrome/browser/history/history_backend.h:
        http://www.google.com/codesearch/p?hl=en#h0RrPvyPu-c/chrome/browser/history/history_backend.h&q=package:%22src.chromium.org/svn/trunk%22%20QueryHistoryFTS&sa=N&cd=2&ct=rc
      - comment above QueryHistoryFTS: "The FTS version queries the
        text_database, then merges with the history DB."
      - impl is here:
        http://www.google.com/codesearch/p?hl=en#h0RrPvyPu-c/chrome/browser/history/history_backend.cc&q=package:%22src.chromium.org/svn/trunk%22%20QueryHistoryFTS&sa=N&cd=1&ct=rc&l=1100
      - it queries the text database mentioned in the comment.  what is it,
        how's it made?  the exact call is |text_database_->GetTextMatches()|.
      - text_database_ is a TextDatabaseManager ptr.  they have sources named
        text_database.{h,cc} and text_database_manager.{h,cc}.  comment above
        text_database_ says "Full text database manager".
      - comment at the top of text_database_manager.h:
        // Manages a set of text databases representing different time
        // periods. This will page them in and out as necessary, and will
        // manage queries for times spanning multiple databases.
        //
        // It will also keep a list of partial changes, such as page adds and
        // title and body sets, all of which come in at different times for a
        // given page. When all data is received or enough time has elapsed
        // since adding, the indexed data will be comitted.
        //
        // This allows us to minimize inserts and modifications, which are
        // slow for the full text database, since each page's information is
        // added exactly once.
        //
        // Note: be careful to delete the relevant entries from this
        // uncommitted list when clearing history or this information may get
        // added to the database soon after the clear.
      - TextDatabaseManager::GetTextMatches:
        http://www.google.com/codesearch/p?hl=en#h0RrPvyPu-c/chrome/browser/history/text_database_manager.cc&l=401
      - so they partition these text dbs by date.  comment from
        TextDatabaseManager::GetTextMatches:
        // Iterate over the databases from the most recent backwards.
      - TextDatabaseManager::GetTextMatches calls each db's
        TextDatabase::GetTextMatches, defined here:
        http://www.google.com/codesearch/p?hl=en#h0RrPvyPu-c/chrome/browser/history/text_database.cc&d=3&l=293
      - aha, finally we get to sqlite:
        "SELECT url, title, time, offsets(pages), body "
            "FROM pages LEFT OUTER JOIN info ON pages.rowid = info.rowid "
            "WHERE pages MATCH ? AND time >= ? AND time < ? "
            "ORDER BY time DESC "
            "LIMIT ?"));
      - TextDatabase::AddPageData i'm guessing does the insertions:
        // Add to the pages table.
        "INSERT INTO pages (url, title, body) VALUES (?,?,?)"
        // Add to the info table with the same rowid.
        "INSERT INTO info (rowid, time) VALUES (?,?)"
        http://www.google.com/codesearch/p?hl=en#h0RrPvyPu-c/chrome/browser/history/text_database.cc&d=3&l=196
      - question is, how are they handling non-ascii data?
      - aha!  TextDatabase::CreateTables
        // FTS table of page contents.
        if (!db_.DoesTableExist("pages")) {
          if (!db_.Execute("CREATE VIRTUAL TABLE pages USING fts2("
                           "TOKENIZE icu,"
                           "url LONGVARCHAR,"
                           "title LONGVARCHAR,"
                           "body LONGVARCHAR)"))
            return false;
        }
        // Non-FTS table containing URLs and times so we can efficiently find
        // them using a regular index (all FTS columns are special and are
        // treated as full-text-search, which is not what we want when
        // retrieving this data).
        if (!db_.DoesTableExist("info")) {
          // Note that there is no point in creating an index over time. Since
          // we must always query the entire FTS table (it can not efficiently do
          // subsets), we will always end up doing that first, and joining the
          // info table off of that.
          if (!db_.Execute("CREATE TABLE info(time INTEGER NOT NULL)"))
            return false;
        }
    - "icu" stands for "International Components for Unicode"
    - icu is an extension to sqlite:
      http://www.sqlite.org/cvstrac/fileview?f=sqlite/ext/fts3/README.tokenizers
    - important: icu is a library, but it's not a library provided by sqlite.
      sqlite's icu facilities are merely glue to integrate the icu lib with
      sqlite.  see comment here:
      http://www.hwaci.com/cgi-bin/sqlite/artifact/850e9a3656
      shawn's comment in the thread linked above suggests that we aren't
      likely to include icu in firefox...  huh, chromium must include it then.
      looks like it:
      http://src.chromium.org/svn/trunk/src/base/third_party/icu, although the
      readme says it's just a subset.  oh, here?
      http://src.chromium.org/svn/trunk/deps/third_party/icu42
      http://src.chromium.org/svn/trunk/deps/third_party/icu42/README.google
    - sqlite's icu-based tokenizer is here:
      http://www.hwaci.com/cgi-bin/sqlite/artifact/ac494aed69
      can port it to gecko, ugh?  does icu provide things that gecko doesn't?
      these questions might be answered in that bug 472764.  (a quick search
      for "port" showed that they talked about porting other impls at least,
      but nothing concrete came of it afaict.)
    - looks like icu has facilities for getting and iterating over code
      points, which i think means characters?  word iterator also, which can
      detect word boundaries:
      http://icu-project.org/apiref/icu4c/ubrk_8h.html
    - i don't think anything like that exists in gecko.  surely the smart
      people in bug 472764 would have known.
      - nsIEntityConverter.idl?
      - what is this?
        http://mxr.mozilla.org/mozilla-central/source/intl/build/nsI18nModule.cpp#64
        64   { "Word Breaker", NS_WBRK_CID,
        65     NS_WBRK_CONTRACTID, nsSampleWordBreakerConstructor},
      - nsISemanticUnitScanner
      - this defines a class called nsIWordBreaker, but there's no idl:
        http://mxr.mozilla.org/mozilla-central/source/intl/lwbrk/public/nsIWordBreaker.h
        an instance of it is exposed as a static method on nsContentUtils:
        http://mxr.mozilla.org/mozilla-central/source/content/base/public/nsContentUtils.h#587
        i wonder if there's no idl because it's not implemented.  there's
        intl/lwbrk/src/nsSampleWordBreaker.cpp which looks half-assed.
    - i wonder if the icu tokenizer simply tokenizes on code points (assuming
      that translates into characters), because when i type the "go" part of
      "nihongo" into chrome's address bar in kanji, my visit to wikipedia's
      nihongo page shows up.  (same for "ji" in ja.wikipedia's shokuji page.)
      but then that would mean that english words get tokenized into chars
      also, and they don't... chrome also indexes the text of pages, so maybe
      it's pulling out those chars from there.
    - oh:
      45 class nsSemanticUnitScanner : public nsISemanticUnitScanner
      46                             , public nsSampleWordBreaker
    - testing chromium's tokenization:
      http://doraku.asahi.com/lifestyle/pet/gallery_091201.html?ref=kanren
      page title has this run of katakana:
      スーパードッグカーニバル (super dog carnival)
      スーパー is matched.  ドッグ is matched, but the カ is also hilighted.
      カーニバル is not matched.

- mar 9 2010
  - fts
    - conclusion: gecko includes a less-than-half-assed i18n tokenizer,
      nsISemanticUnitScanner.  it supports ascii and half-assedly supports
      cjk.  (e.g., it doesn't differentiate between kanji in chinese langs and
      kanji in japanese langs; there's even a note in the code
      (nsSampleWordBreaker.cpp) about breaking words per kanji in chinese, but
      using bigger units in japanese, but the japanese case is not
      implemented, so all individual kanji are treated as words.)  no other
      langs.  icu is a "big" library because it's competent, unlike gecko's
      i18n support, which is shoddy and scattershot.  either someone has to do
      the immense work of re-implementing icu's facilities in gecko, or we use
      icu.  both tasks are too big for me to lift.  the first is stupid, since
      the open source community has already done that work.  the second is
      politics, but it's the better option.  we will never have fts unless
      someone who knows how to work the levers better than i do drives the
      effort.
    - should look at:
      - porting or including icu's tokenization bits, but i suspect they have
        deep dependencies on large parts of the rest of the lib.
        - marco looked at their web site, points out that icu is meant to be
          configurable, modularizable.  maybe we can include the word break
          code afterall.
        - http://source.icu-project.org/repos/icu/icu/trunk/readme.html
        - http://source.icu-project.org/repos/icu/icu/trunk/readme.html#HowToPackage
        - http://userguide.icu-project.org/packaging
        - http://userguide.icu-project.org/design#TOC-API-Dependencies
        - http://icu-project.org/repos/icu/icuhtml/trunk/design/modularization/icu_modularization_overview.html
        - http://userguide.icu-project.org/boundaryanalysis
        - note: icu is sometimes called icu4c to distinguish the c/c++ version
          of the lib.  (there's also icu4j, a java version.)
        - chromium's changes to icu, including word-breaking changes:
          http://src.chromium.org/svn/trunk/deps/third_party/icu42/README.google
        - chromium's dir structure re: icu:  icu has only 3 top-level dirs,
          as_is, packaging, and source.  they removed the first two since they
          don't need them.  they added all dirs except source.  public contains
          some dirs they moved out of source.  see readme.google.  so they have:
            http://src.chromium.org/svn/trunk/deps/third_party/icu42
          which is the root of the tarball that icu provides.
      - including other tokenizers.  marco mentioned boost has some.
      - is thunderbird's fts search really ascii and cjk only?  what happens for
        other lang users, like arabic?

- mar 10 2010
  - continuing fts
    - looking at other non-icu tokenizers like marco suggested yesterday
      - boost has tokenizers but they're simple and i18n-unaware.
        http://www.boost.org/doc/libs?view=categorized
    - looking at state-of-the-art fts in thunderbird
      - apparently still using bi-grams.  i don't know how well it works in
        practice for non-cjk text.  should ask andrew sutherland.
      - http://mxr.mozilla.org/comm-central/source/mailnews/extensions/fts3/src/fts3_porter.c
      - http://mxr.mozilla.org/comm-central/source/mailnews/extensions/fts3/src/README.mozilla
      - http://mxr.mozilla.org/comm-central/source/mailnews/extensions/fts3/public/nsIFts3Tokenizer.idl
      - chat with andrew sutherland (asuth) in #maildev, see chatlog today.
        upshot: they're not planning more work on their tokenizer.  my thoughts:
        all the progress so far has been due to makoto.  as it exists now, it's
        not good enough for firefox.
  - trying to hack together icu, sqlite, firefox just to get it running:
    - downloaded, untared the icu source to intl/icu.
    - copied the source to a directory on my desktop and built there.  followed
      instructions in the tarball's readme but did not do the install step.
    - copied source/lib/libicudata.42.1.dylib and libicuuc.42.1.dylib from
      desktop build into obj-debug/dist/MinefieldDebug.app/Contents/MacOS/
      components/.  i have no idea what i'm doing.
    - copied source/common/unicode/platform.h from desktop build to source that
      i dropped into the mozilla tree.  that file is generated when building
      from platform.h.in.
    - modified db/sqlite3/src/Makefile.in by adding at the very end of the file:
      (because i think that the include $(topsrcdir)/config/rules.mk is what
      contains whatever cflags set by the build system):
        CFLAGS +=-I$(topsrcdir)/intl/icu/source/common -I$(topsrcdir)/intl/icu/source/i18n/
      probably could/should (for the time being) have just pointed these at my
      desktop build.
    - also modified that makefile to include -DSQLITE_ENABLE_ICU=1.
    - that made the compile work but now i get linker errors, undefined symbols
      in sqlite3.o.  hmm.  i have an icu dylib from building it.  how do i get
      sqlite3.o to not need to statically link against a static icu lib?  other
      option is to do just that:  make an icu static lib, link sqlite3.o against
      it.
    - sqlite's icu extension readme says easiest way is to build the extension
      as a dynamically loadable sqlite extension.  i guess that is exactly what
      i am not doing by using the SQLITE_ENABLE_ICU amalgamation switch.
      sdwilsh says moz storage doesn't expose the sqlite dynamic extension
      loading api though, but there's a bug for it.
    - perhaps just to get started i can make an icu static lib, link against
      sqlite3.o.
    - libmozsqlite3.dylib did not change in size after my changes.
    - sqlite is borked on start up, get this error:
        nsNativeModuleLoader::LoadModule("/Users/adw/trees/mozilla-central/obj-debug/dist/MinefieldDebug.app/Contents/MacOS/components/libstoragecomps.dylib") - load FAILED, rv: 80004005, error:
	Unknown error: -2804
    - noticed that the last step in make -C db/sqlite3/src is:
        /Users/adw/trees/mozilla-central/obj-debug/config/nsinstall -L /Users/adw/trees/mozilla-central/obj-debug/db/sqlite3/src -m 755 libmozsqlite3.dylib ../../../dist/lib
        /Users/adw/trees/mozilla-central/obj-debug/config/nsinstall -L /Users/adw/trees/mozilla-central/obj-debug/db/sqlite3/src -m 755 libmozsqlite3.dylib ../../../dist/bin
      i.e., installing dylibs into obj-debug/dist/lib and dist/bin, so i copied
      the three dylibs i linked sqlite3.o against to the same dirs:
        adw@host-3-121:~/trees/mozilla-central/obj-debug$ cp ~/Desktop/icu4c-4_2_1/source/lib/libicudata.dylib dist/lib/
        adw@host-3-121:~/trees/mozilla-central/obj-debug$ cp ~/Desktop/icu4c-4_2_1/source/lib/libicui18n.dylib dist/lib/
        adw@host-3-121:~/trees/mozilla-central/obj-debug$ cp ~/Desktop/icu4c-4_2_1/source/lib/libicuuc.dylib dist/lib/
        adw@host-3-121:~/trees/mozilla-central/obj-debug$ cp ~/Desktop/icu4c-4_2_1/source/lib/libicudata.dylib dist/bin/
        adw@host-3-121:~/trees/mozilla-central/obj-debug$ cp ~/Desktop/icu4c-4_2_1/source/lib/libicui18n.dylib dist/bin/
        adw@host-3-121:~/trees/mozilla-central/obj-debug$ cp ~/Desktop/icu4c-4_2_1/source/lib/libicuuc.dylib dist/bin/
    - didn't fix error.  how is libstoragecomps.dylib involved?
    - noticed dist/MinefieldDebug.app/Contents/MacOS also contains the dylibs,
      so copied them there, but same error.
    - noticed dist/MinefieldDebug.app/Contents/MacOS/components also contains
      the dylibs, so copied them there, but that didn't work either.
    - i wonder if it's some symbol location issue, like libstoragecomps.dylib
      was built against libsqlite3.dylib, and now i've changed libsqlite3.dylib.
      but i thought these dylibs weren't linked against each other?
    - make -C storage, storage/build, etc. didn't fix.
    - ah, i did make -C browser/app, removed compatibility.ini et al. in my
      profdir, and ran the build with
      DYLD_LIBRARY_PATH=/Users/adw/Desktop/icu4c-4_2_1/source/lib/.
      i'm not sure if the first two steps were necessary, but it worked after
      setting DYLD_LIBRARY_PATH.  so the problem was libstoragecomps.dylib
      (i think?  or firefox?) couldn't find those three dylibs.
    - note that libicui18n.dylib was necessary in the sqlite3.o linking stage,
      because some collation functions were being called.  that's even though
      that chart on the design web page said that the common lib, the one
      containing the word breaking code, depended on only the data lib.  maybe
      one of the manu config/compile options will make the collation stuff
      unnecessary.
    - followup (mar 11): the error console was broken in my build, so i pulled
      the source, blew away obj-debug, and rebuilt.  icu still works.  no
      copying of dylibs necessary when you set DYLD_LIBRARY_PATH to point to
      dylibs, wherever on disk they may be.  also did not copy icu source into
      tree; i just point -I at my icu build on the desktop.

- mar 11 2010
  - continuing fts
    - mfinkle says fennec uses our autocomplete search provider "in a wrapped
      way".  hopefully fts will help them a lot, and i can have claimed to
      improved fennec performance.
    - toolkit/components/places/src/nsPlacesAutoComplete.js constructs the
      queries (both text of the queries and the stmts themselves).
    - sqlite custom function autocomplete_match is where the text matching
      actually occurs.
      toolkit/components/places/src/SQLFunctions.h
      toolkit/components/places/src/SQLFunctions.cpp#52
    - that function is registered with the db connection here:
      http://mxr.mozilla.org/mozilla-central/source/toolkit/components/places/src/nsNavHistory.cpp#1038
    - MatchAutoCompleteFunction::OnFunctionCall is the entry point:
      http://mxr.mozilla.org/mozilla-central/source/toolkit/components/places/src/SQLFunctions.cpp#207
    - so we end up calling CaseInsensitiveFindInReadable (or a related platform
      string-searching function) for every row in the results against title,
      tags, etc.  => autocomplete_match needs to be replaced?
    - sqlite fts3 readme:
      http://www.sqlite.org/cvstrac/rlog?f=sqlite/ext/fts3/README.syntax
      wiki page:
      http://www.sqlite.org/cvstrac/wiki?p=FtsUsage
    - what would google do?
      - it looks like chromium does the fts tables lookup first, gets all the
        results in memory, and then does a second lookup into the history db
        (HistoryBackend::QueryHistoryFTS).
      - so they aren't worried about fitting all fts results in mem?  they
        restrict the search to "present_databases_" and inside that by date
        (TextDatabaseManager::GetTextMatches).  does "present" mean present in
        mem?
      - TextDatabaseManager::GetTextMatches yields a vector of
        TextDatabase::Match'es, which is just a simple struct containing title,
        url, time, "match positions", and snippet
      - then HistoryBackend::QueryHistoryFTS for-loops over those matches,
        looking up each in a db (i assume history db) by the url in the match.
      - this gives me more confidence that i don't need to bend over backwards
        to try and do everything in sql.  if it's good enough for them.  but,
        brett wilson's name is in the chromium comments, so.
    - hmm, does our current impl (autocomplete_match?) also compute positions
      within the string of the match?  which is how the right substrings in the
      awesomebar dropdown are highlighted?  i'd need to replicate that... can
      fts do it for me?
    - is every column in an fts table fts indexed?  the FtsUsage wiki page isn't
      clear, so i assume so.  chromium doesn't have to worry about it because
      they (i'm guessing) key off of urls; their key is also text to be
      searched.  if i'm wrong, then they're looking up records by non-indexed
      column, which is very un-google-like.  [looking at
      HistoryBackend::URLQuerier::GetRowForURL, they have a "urls" table, and
      they simply SELECT some columns WHERE url=:url.  using that table like
      an index i guess.]
    - if every col is indexed... probably not ok to stuff foreign keys into fts
      table.  chromium's text databases keep an auxilary "info" table, and
      they're using rowid to tie that table to the "pages" fts table.  yeah,
      comment about the design here:
      http://www.google.com/codesearch/p?hl=en#h0RrPvyPu-c/chrome/browser/history/text_database.cc&q=package:%22src.chromium.org/svn/trunk%22%20URLDatabase&d=3&l=19
      quoting:
        // We do joins across these two tables by using their internal rowids,
        // which we keep in sync between the two tables. The internal rowid is
        // the only part of an FTS table that is indexed like a normal table,
        // and the index over it is free since sqlite always indexes the
        // internal rowid.
      (i bet brett wilson wrote that.)
    - so i could do something similar, instead of times the auxilary table would
      keep track of foreign keys into moz_places and moz_bookmarks.  kind of a
      pain in the ass though, and seems like whenever you use fts tables you'd
      need to do this.
  - looks like chromium has a "main" history db and an "archived" history db,
    judging by HistoryBackend::URLQuerier::GetRowForURL.
  - can use fst to improve find-in-page?  probably not, since you'd have to
    crawl the page to index it first.  but then subsequent finds could be
    faster... something to think about.

- mar 12 2010
  - from the icu faq (http://userguide.icu-project.org/icufaq):
    What is the performance difference between UTF-8 and UTF-16?
    Most of the time, the memory throughput of the hard drive and RAM is the
    main performance constraint. UTF-8 is 50% smaller than UTF-16 for US-ASCII,
    but UTF-8 is 50% larger than UTF-16 for East and South Asian scripts. There
    is no memory difference for Latin extensions, Greek, Cyrillic, Hebrew, and
    Arabic.
  - shrinking icu
    - good page on the data lib: http://userguide.icu-project.org/icudata
      - section Customizing ICU's Data Library
        - quoting:
          Reducing the size of ICU's data by eliminating unneeded resources can
          make sense on small systems with limited or no disk, but for desktop
          or server systems there is no real advantage to trimming. ICU's data
          is memory mapped into an application's address space, and only those
          portions of the data actually being used are ever paged in, so there
          are no significant RAM savings. As for disk space, with the large size
          of today's hard drives, saving a few MB is not worth the bother.
  - fts
    - rowid info
      - general: http://www.sqlite.org/lang_createtable.html#rowid
      - re: inserts: http://www.sqlite.org/autoinc.html
      - don't appear to be any weirdnesses about giving rowids while inserting.
      - quoting http://www.sqlite.org/fts3.html
        The rowid of an FTS3 table behaves in the same way as the rowid column
        of an ordinary SQLite table, except that the values stored in the rowid
        column of an FTS3 table remain unchanged if the database is rebuilt
        using the VACUUM command. For FTS3 tables, "docid" is allowed as an
        alias along with the usual "rowid", "oid" and "_oid_"
        identifiers. Attempting to insert or update a row with a docid value
        that already exists in the table is an error, just as it would be with
        an ordinary SQLite table.
    - wait, what the fuck is this?
      http://www.sqlite.org/fts3.html
      a giant treatise on fts3.  why hasn't this come up before?  hmm, when was
      it written?  it uses a strange syntax for specifying tokenizers.
      - fst index optimization:
        Executing an SQL statement of the form "INSERT INTO
        <fts3-table>(<fts3-table>) VALUES('optimize')" causes FTS3 to merge all
        existing index b-trees into a single large b-tree containing the entire
        index. This can be an expensive operation, but may speed up future
        queries.
    - re: perf of fst, quote from http://www.sqlite.org/fts3.html:
      Then either of the two queries below may be executed to find the number of
      documents in the database that contain the word "linux" (351). Using one
      desktop PC hardware configuration, the query on the FTS3 table returns in
      approximately 0.03 seconds, versus 22.5 for querying the ordinary table.
    - hmm, you can match against cols, so could i stuff non-fts'ed stuff in an
      fts table (like foreign keys) and just not use that col in queries?
      probably it'll be indexed though.

- mar 16 2010
  - fts
    - autocomplete_match
      - aVisitCount only used to short-circuit and negate match if it's 0 and
        aSearchBehavior & HISTORY.
      - similar with aTyped, if aSearchBehavior & TYPED.
      - similar with aBookmark, if aSearchBehavior & BOOKMARK.
      - similar with aTags, if aSearchBehavior & TAG.
    - IMPORTANT!!  read Appendix A of http://www.sqlite.org/fts3.html before
      commiting to any impl!  quoting:
      CREATE VIRTUAL TABLE documents USING fts3(title, content);
      SELECT title FROM documents
        WHERE documents MATCH <query>
        ORDER BY rank(matchinfo(document)) DESC
        OFFSET 0 LIMIT 10
      The SQL query in the example above uses less CPU than the first example in
      this section, but still has a non-obvious performance problem. SQLite
      satisfies this query by retreiving the value of the "title" column and
      matchinfo data from the FTS3 module for every row matched by the users
      query before it sorts and limits the results. Because of the way SQLite's
      virtual table interface works, retrieving the value of the "title" column
      requires loading the entire row from disk (including the "content" field,
      which may be quite large). This means that if the users query matches
      several thousand documents, many megabytes of "title" and "content" data
      may be loaded from disk into memory even though they will never be used
      for any purpose.

- mar 17 2010
  - fts
    - modifying places autocomplete code
      - nsPlacesAutoComplete.prototype.startSearch is entry point.
      - nsPlacesAutoComplete.prototype._getSearch just discards the search terms
        that are equal to those dumb "*", "@", "#", etc. tokens the user can
        type to restrict the search to tags, title, etc.  returns
        { tokens, query }, tokens = search terms that weren't discarded,
        query = this._getBoundSearchQuery(this._matchBehavior, aTokens).
      - _getBoundSearchQuery's query is:
          // We use more optimized queries for restricted searches, so we will
          // always return the most restrictive one to the least restrictive one
          // if more than one token is found.
          let query = this._hasBehavior("tag") ? this._tagsQuery :
                      this._hasBehavior("bookmark") ? this._bookmarkQuery :
                      this._hasBehavior("typed") ? this._typedQuery :
                      this._hasBehavior("history") ? this._historyQuery :
                      this._defaultQuery;
        so in the common case -- no dumb restricting tokens -- the query
        returned from _getBoundSearchQuery and _getSearch is _defaultQuery.
      - _defaultQuery is simply SQL_BASE with the additional conditions clause
        replaced with the empty string.
      - so i should add the fts lookup step in startSearch after _getSearch
        returns the tokens...?
      - AUTOCOMPLETE_MATCH is also used by _adaptiveQuery
      - need to change/override _getBoundAdaptiveQuery(), _defaultQuery.
        _getBoundKeywordQuery(tokens) doesn't use autocomplete_match.
      - nsPlacesAutoComplete.prototype.handleCompletion searches again if there
        are no results, so it's not just startSearch that starts!
    - what would google do?  they synchronously execute the fts MATCH stmt:
      http://www.google.com/codesearch/p?hl=en#h0RrPvyPu-c/chrome/browser/history/text_database.cc&q=package:%22src.chromium.org/svn/trunk%22%20URLDatabase&d=3&l=323
      this isn't run off the main thread is it?  hmm, if they take pains to
      keep the text index in mem, and it looks like they do, that would work
      ok.
    - should the fts tables live in the same db as the tables they index?  or
      separate db?  what are the pros and cons?  try it both ways, measure?
      - if we ever shard the places db, should use separate db.
      - text db should be in mem at all times.  how would keeping all tables
        in same db affect that?
    - i'm getting a crash when i copied by own places.sqlite to my test
      profile.  crash comes when either the fts tables are built or when the
      match is done, not sure which, since the latter happens right after the
      former the first time the awesomebar is typed in.  text console output:
        Assertion failed: (iStart<=iEnd), function icuNext, file /Users/adw/trees/mozilla-central/db/sqlite3/src/sqlite3.c, line 110806.
        Abort trap
      googled, found this http://www.sqlite.org/cvstrac/tktview?tn=3963,39 ,
      a terse sqlite bug ticket that mentions a problem with icuNext and
      chinese text but doesn't mention a crash in specific.  says icuNext
      should use U16_NEXT, not U8_NEXT.  and indeed, chromium's fts2 uses
      U16_NEXT, not U8_TEXT:
      http://www.google.com/codesearch/p?hl=en#h0RrPvyPu-c/third_party/sqlite/ext/fts2/fts2_icu.c&q=file:fts2_icu.c%20package:src.chromium.org/svn/trunk&sa=N&cd=1&ct=rc&l=201
      so i modified line 110799 in our sqlite3.c source and it worked.  took
      about 1.5 or 2 minutes to build the indexes though.
  - what would google do?  look at how chromium caches sql stmts: they have a
    GetCachedStatement() that takes an id and string of sql.  i presume if the
    stmt is not yet cached it becomes cached with the call, which is why the
    sql is given?  they use a macro that generates an id based on file and line
    that the stmt appears on.  nice thing is they don't have this giant list of
    stmts at the tops of their files with long identifiers, and sql is next to
    the code that uses it.
    http://www.google.com/codesearch/p?hl=en#h0RrPvyPu-c/app/sql/connection.h&q=package:%22src.chromium.org/svn/trunk%22%20GetCachedStatement&sa=N&cd=2&ct=rc&l=210
  - idea: since our metric is frecency, it would behoove us to keep in mem and
    in a small db the most frecent stuff.  sharding?

- mar 18 2010
  - fts
    - important: creating statements for each invocation of startSearch is
      really a perf hit -- as opposed to making a stmt beforehand and reusing it
      each time.  maybe it depends on the query, but i noticed when i wasn't
      reusing a stmt i could have been, the ui was actually jerky, and i
      wondered why since the exec is done async.  i reused the stmt and it seems
      to have cleared up.
    - prefix search terms are incredibly slow... converting the search string
      in the awesomebar to be prefix terms makes it even slower than linear
      scan -- even with the ORDER BY term removed!  i'd better be doing
      something wrong.
    - ideas for doing postfix searches:
      http://osdir.com/ml/sqlite-users/2009-08/msg00150.html
      e.g., store reversed words also in index.
    - doing some manual time testing in javascript console.  this query takes
      ~930ms:
               "SELECT title " +
               "FROM moz_places_fts " +
               "WHERE moz_places_fts MATCH 'm*' " +
               "LIMIT 10 "
      this takes 2ms:
               "SELECT title " +
               "FROM moz_places_fts " +
               "WHERE moz_places_fts MATCH 'm' " +
               "LIMIT 10 "
      is prefix searching really this slow?  with these LIMITs, these times:
        limit: 1 | ms: 927
              10 |     930
             100 |     952
            1000 |     999
           10000 |    1627
      same table but for no-prefix, i.e., "MATCH 'm'":
        limit: 1 | ms: 1
              10 |     2
             100 |    10
            1000 |    17
           10000 |    18
      for comparison, this query:
               "SELECT title " +
               "FROM moz_places " +
               "WHERE title LIKE 'm%' " +
               "LIMIT 100 "
      has these numbers:
        limit: 1 | ms: 0
              10 |     1
             100 |     6
            1000 |    78
           10000 |   320
      how in the hell is this slower than a linear scan?
    - what would google do?
      - they only do prefix searches when strlen >= 3.
        http://www.google.com/codesearch/p?hl=en#h0RrPvyPu-c/chrome/browser/history/query_parser.cc&q=package:%22src.chromium.org/svn/trunk%22%20QueryNodeList&sa=N&cd=1&ct=rc&l=246
      - '*' suffix appended here:
        http://www.google.com/codesearch/p?hl=en#h0RrPvyPu-c/chrome/browser/history/query_parser.cc&q=package:%22src.chromium.org/svn/trunk%22%20QueryNodeList&sa=N&cd=1&ct=rc&l=114
      - indeed, when i search for "w" and "wi" in chrome, wikipedia entries
        don't show up.  (actually one does on "w", but i wonder if it's
        unrelated, since it doesn't come up for "wi", and the word "wikipedia"
        is not highlighted in the dropdown.)  several come up when i enter
        "wik".
      - so i tried testing 'moz*' instead of 'm*', and it's still slower than
        the "LIKE 'm%'" query.
    - huh?  chromium uses fts only for history, not bookmarks?  what is
      QueryParser::DoesQueryMatch?  called by BookmarkIndex::AddMatchToResults.
      comment in BookmarkIndex::AddMatchToResults:
        // Check that the result matches the query.  The previous search
        // was a simple per-word search, while the more complex matching
        // of QueryParser may filter it out.  For example, the query
        // ["thi"] will match the bookmark titled [Thinking], but since
        // ["thi"] is quoted we don't want to do a prefix match.
      so they do two passes?
      http://www.google.com/codesearch/p?hl=en#h0RrPvyPu-c/chrome/browser/bookmarks/bookmark_index.cc&q=package:%22src.chromium.org/svn/trunk%22%20DoesQueryMatch&sa=N&cd=3&ct=rc&l=107
    - re: slow prefix searches:
      http://www.mail-archive.com/sqlite-users@sqlite.org/msg29729.html
      - quote:
        One reason you might find full-table-scans to sometimes be faster than
        fts is if fts ends up essentially looking at most documents.  In that
        case, the fts stuff isn't going to be particularily efficient.  There
        may be ways to take advantage of this entirely within fts, so that you
        can use a single syntax and have it automatically figure out how to
        optimize the query.
    - perf depends on query and documents.  i did some of the tests above with
      'foo*' and 'LIKE foo%' instead of moz, and fts was faster this time.
      only 16 rows match 'foo%', but 2617 match 'moz%'.  great, common terms
      are slower.  (but when i think about my own use of the awesomebar, the
      terms i use are not common i think -- using common terms isn't helpful,
      since they give lots of results.)

- mar 19 2010
  - fts
    - idea: do prefix searches only for the final term in the string.  e.g., if
      the user types in "mozilla fire", the string given to sqlite is
      "mozilla fire*", not "mozilla* fire*".  i18n issues?

- mar 23 2010
  - fts
    - _executeQueries just hands the stmts to executeAsync.  _processRow assumes
      that it's called with each result row in order, but that doesn't appear to
      be guaranteed.  in fact when i search for "food" with my fts changes, one
      row is out of order.  is this a problem without my changes?  _processRow
      should keep an ordered list, which means it has to wait for all results
      before it can show any... => the order of queries passed to
      _executeQueries is important?  is shawn passing in stmt A before B because
      he's exploiting the fact that all results of A are guaranteed to have
      higher frecencies than B?  if so, i'll have to partition fts tables by
      frecency, which means updates will really suck...
    - when timing a stock build, beware that it does a second-pass search!
      see handleCompletion.
    - maybe i should back up and time raw storage queries, autocomplete_search
      vs. fts.  no awesomebar, no nsPlacesAutocomplete, no joining with other
      tables.  and try with a very large db, maybe mine's too small to see
      benefit.

- mar 25 2010
  - fts
    - timing autocomplete_match vs. fts in raw sql stmts
      - methodology
        - places.sqlite of my "tmp" profile, which was copied over from my
          default profile a few weeks ago
        - count(moz_places) = 33831
        - 34 fts tables of size 1000 rows (except for 34th table)
        - fts tables are in places.sqlite
        - all search strings are suffixed with "*" in the fts cases to perform
          prefix searches
        - obj-opt-shared
      - search string: "f" ("f*" for fts)
        - fts
          - num results: 16064
          - sync
            - first result: 1-5ms after start
            - last result: 130ms after start (first runs after app start or
              other query are consistently longer: 500ms, 244, 249)
          - async
            - first: 3-4
            - last: 170 (first runs: 241)
        - autocomplete_match
          - num results: 17974
          - sync
            - first: 0-1
            - last: 310-325 (first runs: 341)
          - async
            - first: 1
            - last: 350 (first runs: no diff)
      - search string: "foo" ("foo*" for fts)
        - fts
          - num results: 64
          - sync
            - first: 0-2
            - last: 15 (first runs: 44)
          - async
            - first: 4-5
            - last: 16-17 (first runs: 46)
        - autocomplete_match
          - num results: 81
          - sync
            - first: 5
            - last: 245 (first runs: 281)
          - async
            - first: 70
            - last: 250 (first runs: 279)
      - search string: "m" ("m*" for fts)
        - fts
          - num results: 13326
          - sync
            - first: 2
            - last: 113 (first runs: 122, 171)
          - async
            - first: 3-4
            - last: 130 (first runs: 628)
        - autocomplete_match
          - num results: 20304
          - sync
            - first: 0
            - last: 310 (first runs: 343)
          - async
            - first: 1
            - last: 350 (first runs: 389)
      - search string: "moz" ("moz*" for fts)
        - fts
          - num results: 5108
          - sync
            - first: 0-1
            - last: 41 (first runs: 93)
          - async
            - first: 3-4
            - last: 50-60 (first runs: 120, 109)
        - autocomplete_match
          - num results: 12683
          - sync
            - first: 0
            - last: 295-300 (first runs: 331)
          - async
            - first: 0-1
            - last: 301-309 (first runs: 358)
      - search string: "mozilla" ("mozilla*" for fts)
        - fts
          - num results: 5031
          - sync
            - first: 1
            - last: 41 (first runs: 100)
          - async
            - first: 2-4
            - last: 50-60 (first runs: 107)
        - autocomplete_match
          - num results: 12598
          - sync
            - first: 0
            - last: 295-300 (first runs: 335)
          - async
            - first: 0-1
            - last: 300-328 (first runs: 353)
      - search string: "google" ("google*" for fts)
        - fts
          - num results: 4148
          - sync
            - first: 0-1
            - last: 35 (first runs: 87)
          - async
            - first: 2-3
            - last: 45 (first runs: 222)
        - autocomplete_match
          - num results: 5332
          - sync
            - first: 1
            - last: 252-260 (first runs: 285)
          - async
            - first: 2-3
            - last: 260 (first runs: 300)
    - so fts is consistently faster.  why is using stock nsPlacesAutoComplete
      stripped to the bone sometimes faster?
      - it's the LIMIT clause.  when i add a limit clause, raw sql times match
        the times i was seeing with nsPlacesAutoComplete.  even when i add an
        ORDER BY, as nsPlacesAutoComplete uses.  the ORDER BY must be using
        indexes -- it couldn't be so fast otherwise... yeah, when i order by
        title instead, which i presume is not indexed, time is much greater.
      - one thing the fts version does wrong: i LIMIT each stmt, but i'm running
        34 stmts, so in total i'm getting 34 * limit results.
      - yeah, looks like this is where i'm falling behind.  when i restrict the
        entire query to a single fts table/stmt, i beat the linear scan.
      - it's not the overhead of executing 34 separate stmts, because when i do
        all 34 in the same stmt, it's still slower.