Places:AutoComplete: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
No edit summary
(Complete rewrite of the behavior for mozilla 1.9(.1))
Line 1: Line 1:
(Please comment by clicking "discussion" above, and not by editing this page.)
(Please comment by clicking "discussion" above, and not by editing this page.)


== Algorithmic changes ==
The Places History AutoComplete uses a SQL database to find pages to display results to the user for a given search query. This page describes how results are processed and given to the front-end (e.g., Firefox 3) and various optimizations to speed things up.


The algorithm description and implementation are [http://mxr.mozilla.org/seamonkey/source/toolkit/components/places/src/nsNavHistoryAutoComplete.cpp#43 here], quoted:
__TOC__


<blockquote>
= Basic Work Flow =
* The current algorithm searches history from now backwards to the oldest
* visit in chunks of time (AUTOCOMPLETE_SEARCH_CHUNK).  We currently
* do SQL LIKE searches of the search term in the url and title
* within in each chunk of time.  the results are ordered by visit_count
* (and then visit_date), giving us poor man's "frecency".
</blockquote>


Anything below this line is likely out of date, and needs to be revised. See also [http://wiki.mozilla.org/User:Mconnor/PlacesFrecency Mconnor's frecency page] for more info.
The general outline of how a search query incrementally finds results is as follows:
# Initialize data structures
# Process search query
# Search database and save matches
# Update search status
# Continue searching if necessary


-----------------------------------------------------------------
== Initialize data structures ==
Places (Firefox 3.0) will <i>not</i> likely see a large change in the AutoComplete algorithm. The current differences are:


* It is faster. The places autocomplete code should be faster than the old code on and normal amount of data. See below.
Searches are started through the nsIAutoCompleteSearch::StartSearch interface, where we get the search query and a listener to give results. Here we standardize the query internally by trimming spaces, switching to lowercase, etc. to simplify matching it against other queries.


* Suggestion of host names first. If the first URL has not been visited very many times and it has a path component, a new entry is suggested as the first item consisting of just the host portion, even if that URL has never been visited. If the first item has been visited many times, we assume the user actually wants to go to that page and don't do this addition.
Here we initialize data to store the listener, new results, and information about bookmarks/livemarks. We additionally reset previous search information like the status of the search, search position/offset, and matching urls.


* We give a little boost to URLs that are bookmarked.
== Process search query ==


There are some additional simple low-risk changes that we might consider making.
The search query is a plain string that can be split into multiple words by spaces. Typically this is for multi-word searching, but special tokens are also processed to change the behavior of the search.


*  Better use of the "typed" flag (set for URLs that have ever been entered in the address bar). The current implementation gives typed URLs a constant boost over non-typed URLs. It would be nice to weight the various parameters by time, so a recent typed URL would still take precidence over something visited several times but many weeks ago.
As the query is split into individual words/tokens, the special tokens are recorded as changing the behavior and removed from the query, so we don't try matching it against the title/URL.


* When considering the aforementioned "aging" of priority, we might want to take into account browsing behavior. It sucks to come back from a two-week vacation and have all your history and autocomplete expired. At startup, we can detect periods of little or no activity and adjust the aging parameters accordingly.
== Search database and save matches ==


== Performance with the new database ==
The bulk of the history processing is in this stage, and it consists mostly of executing a SQL query and filtering results.


If a significant amount of history will be stored, a much more efficient method of searching will be required (the current implemtation searches all of history for matches).
=== Query database ===


The database can be queried for URLs matching multiple criteria: recency, popularity, and type-edness. We can probably encode some of the ranking in a query. For example, select all pages visited in the last <i>n</i> days OR where (visit count - days last visited ago > 1). This will save us from ranking most of the pages in the user's history.
Queries are actually fairly simple in that they blindly fetch all results that fall within the current chunk (a fixed chunk size and an updating chunk offset). One major aspect of the results is that they come in frecency order, where we've precomputed the ordering ahead of time and the database has an index on that column.


We might want to consider another column in the history DB for autocomplete purposes. This would just contain the URL minus the protocol type and with the common prefixes stripped (the current implementation strips "www" and "ftp", and we'll probably just continue this). Then an index could be created on this column and we can quickly find matching pages without schlepping through all of them comparing strings.
=== Filter results ===
 
The filtering of results is not handled by SQL as we do fairly complex matching against titles/tags/URLs with varying behavior specified by the user's preferences.
 
For each page returned by the database query, we get a number of properties/attributes such as page title, URL, tags, visit count, etc. For a page to match a query, it needs to have all tokens match somewhere in the title, URL, or tags. The filtering supports matching query tokens on word boundaries, anywhere, and at the beginning of each of those properties depending on the preferences.
 
Additionally, special tokens and behavior preferences can additionally filter results by forcing url matches or allowing only visited pages. These filters are also processed while looking at each search token against each page.
 
=== Store matches ===
 
Once a match is determined, we package it into a result that has the URL, title, favicon, and style. The title is overloaded to additionally hold the tags of the page by concatenating it to the page/bookmark title and separated by an endash.
 
The style of the page is to let the front-end know if it's a plain page, bookmarked page, etc.
 
== Update search status ==
 
After a chunk is finished, we report to the listener that we've either found results or not and if we're going to continue searching.
 
This is done by setting the result status on the nsIAutoCompleteResult package and delivering the results to the nsIAutoCompleteListener.
 
== Continue searching if necessary ==
 
If we have enough search results, we stop searching; otherwise, we keep searching until we run out of pages to process. The continuing search is set up with a timer that waits a little bit before continuing again.
 
Additionally, if the search behavior is to match on word boundaries first, it's possible for the search to start back at the beginning (from the first chunk) and process all the pages again without forcing a match on word boundaries. This is to ensure we show word boundary matches first even if a match-anywhere page has a higher frecency.
 
= Optimizations =
 
There are a number of optimizations to help speed up finding matching pages. They tend to focus on the fact that the user is incrementally typing a query, i.e., an additional letter is added to the end of the previous query.
 
# Stop searching when there's enough results
# Reuse existing results when filtering more
# Don't search if it won't find anything
 
== Stop searching when there's enough results ==
== Reuse existing results when filtering more ==
== Don't search if it won't find anything ==

Revision as of 17:22, 9 February 2009

(Please comment by clicking "discussion" above, and not by editing this page.)

The Places History AutoComplete uses a SQL database to find pages to display results to the user for a given search query. This page describes how results are processed and given to the front-end (e.g., Firefox 3) and various optimizations to speed things up.

Basic Work Flow

The general outline of how a search query incrementally finds results is as follows:

  1. Initialize data structures
  2. Process search query
  3. Search database and save matches
  4. Update search status
  5. Continue searching if necessary

Initialize data structures

Searches are started through the nsIAutoCompleteSearch::StartSearch interface, where we get the search query and a listener to give results. Here we standardize the query internally by trimming spaces, switching to lowercase, etc. to simplify matching it against other queries.

Here we initialize data to store the listener, new results, and information about bookmarks/livemarks. We additionally reset previous search information like the status of the search, search position/offset, and matching urls.

Process search query

The search query is a plain string that can be split into multiple words by spaces. Typically this is for multi-word searching, but special tokens are also processed to change the behavior of the search.

As the query is split into individual words/tokens, the special tokens are recorded as changing the behavior and removed from the query, so we don't try matching it against the title/URL.

Search database and save matches

The bulk of the history processing is in this stage, and it consists mostly of executing a SQL query and filtering results.

Query database

Queries are actually fairly simple in that they blindly fetch all results that fall within the current chunk (a fixed chunk size and an updating chunk offset). One major aspect of the results is that they come in frecency order, where we've precomputed the ordering ahead of time and the database has an index on that column.

Filter results

The filtering of results is not handled by SQL as we do fairly complex matching against titles/tags/URLs with varying behavior specified by the user's preferences.

For each page returned by the database query, we get a number of properties/attributes such as page title, URL, tags, visit count, etc. For a page to match a query, it needs to have all tokens match somewhere in the title, URL, or tags. The filtering supports matching query tokens on word boundaries, anywhere, and at the beginning of each of those properties depending on the preferences.

Additionally, special tokens and behavior preferences can additionally filter results by forcing url matches or allowing only visited pages. These filters are also processed while looking at each search token against each page.

Store matches

Once a match is determined, we package it into a result that has the URL, title, favicon, and style. The title is overloaded to additionally hold the tags of the page by concatenating it to the page/bookmark title and separated by an endash.

The style of the page is to let the front-end know if it's a plain page, bookmarked page, etc.

Update search status

After a chunk is finished, we report to the listener that we've either found results or not and if we're going to continue searching.

This is done by setting the result status on the nsIAutoCompleteResult package and delivering the results to the nsIAutoCompleteListener.

Continue searching if necessary

If we have enough search results, we stop searching; otherwise, we keep searching until we run out of pages to process. The continuing search is set up with a timer that waits a little bit before continuing again.

Additionally, if the search behavior is to match on word boundaries first, it's possible for the search to start back at the beginning (from the first chunk) and process all the pages again without forcing a match on word boundaries. This is to ensure we show word boundary matches first even if a match-anywhere page has a higher frecency.

Optimizations

There are a number of optimizations to help speed up finding matching pages. They tend to focus on the fact that the user is incrementally typing a query, i.e., an additional letter is added to the end of the previous query.

  1. Stop searching when there's enough results
  2. Reuse existing results when filtering more
  3. Don't search if it won't find anything

Stop searching when there's enough results

Reuse existing results when filtering more

Don't search if it won't find anything