Browser History:Redirects

From MozillaWiki
Jump to: navigation, search

Rationale

The current history implementation does not handle redirects well. If you visit http://google.com, you will get redirected to http://www.google.com. The redirect page does not have a favicon or a title, and only clutters history views with unhelpful information.

This type of problem will be much more significant when planned enhancements are made with the places system. For example, if there is an easy-to-use bookmark button that shows the bookmark state of the current page, this button will be broken if a bookmark redirects. You will also not get proper favicons associated with the bookmark, and any notes or other annotations on the page will be ambiguous.

We will therefore need to pay special attention to redirects and provide some type of canonicalization to URLs to determine if they are bookmarked or what we should display in history views.

Note that this problem is not generally solvable. First, some types of redirects are not detectable as redirects, such as navigations done by JavaScript on the web page. Second, it is always possible to construct cases where it is not well-defined what the correct answer is. For example, different combinations of redirect destinations changing temporally, multiple pages redirecting to the same page, multiple levels of redirects, and annotations associated mean in some cases it is not well-defined what set of data should be shown for the page. As a result, the goal is to do something that catches reasonable cases of redirection, and doesn't give answers that are clearly wrong.

Scenarios

Link Permanent Redirect

This case happens when the user follows a link that is redirected by one or more 301 (moved permanently) redirects.

This is the easy case, the canonical URL is the redirected version. All interesting information such as title and favicon are stored on the destination, we should just show the destination in history.

Bookmark Permanent Redirect

If you follow a bookmark and get a permanent redirect, should the bookmark be updated automatically? This might be annoying in some cases, especially since some wireless proxies requiring logon (like in hotels) are misconfigured and give permanent redirects to the login page. It would also be a security problem in the case of a malicious proxy, which would be free to rewrite any bookmark you follow. See bugs 103610, 8648, and 213467.

Typed Permanent Redirect

This is probably the most common case. The user types a common URL, such as "http://google.com/" or "http://amazon.com/" and is 301 (moved permanently) redirected to "http://www.google.com/" or "http://www.amazon.com/exec/obidos/subst/home/home.html".

The approaches, advantages, and drawbacks will be the same as for the bookmarked case, except we don't need to worry about updating the bookmark.

Temporary redirects

In these cases, the source URL is probably the the "best" one, since the resource will presumably move again.

  • Treat the source as canonical. This opens a vulnerability: a malicious site would construct temporary redirects to a good site, and could be noted as canonical. Bookmark or subsequent visits to the good site might then be associated with the malicious site, which is free to change content. This approach also has problems for titles and favicons, which will be associated with the destination page. Keeping track of the appropriate attributes may be challenging.
  • Show the destination only. This would treat the redirect just like a permanent redirect, but we shouldn't update any bookmarks. The danger is if the user tries to revisit the page by clicking in history, the resource may have moved again.
  • Possibly the best compromise is to treat the source URL as canonical only if the destination has never been visited from somewhere else. If the destination is visited from somewhere else, we undo it and treat the destination as canonical. Bookmarks on the destination should always be assigned to the destination page regardless of canonicalization. This gets around the security problems (at worst, the user sees a funny URL in history, and then only for pages they've never seen before), while preserving most of the functionality for legitimate temporary redirects.

Approach

Redirect information will be stored in the transition type of each visit. The current history API only has a boolean flag for redirects. It would be nice to differentiate permanent and temporary redirects, we will have to look into if this information is available from the docshell.

History View

Coalescing redirects will be an optional portion of history queries, so callers can get the raw visit data if desired.

When the history system gives query results by visit, we will hide the sources of redirects and only show destinations. For all visits X that result in redirects to Y, we will remove X and add Y (if Y isn't already in the result set). Then repeat this operation until there are no redirected visits (to handle multiple redirect hops)

This handles many cases: multiple redirected destinations from X, mutiple redirect sources from Y. It also means that if you do a query and it matches the source, you'll get the destionation even if the query doesn't match the destination. There are some strange results, however:

  • In rare cases, the user will be manually looking for a given host name or other attribute that occurs in the source but not the destination. This attribute may not appear in the result set.
  • If the query is "all pages on host Z" and one of those results is a redirect to another host, it will be surprising to see another host in the result set.

When queries for visited web pages are done, it is less clear what to do. Since redirects can change temporally, it may not be clear which redirect to use, and which pages deserve to be in the result. The ideal solution depends on what the expected use of non-visit-based history queries are, which is not yet clear. One solution is to not coalesce redirects at all. The other solution is, for a page X, if there is some (or perhaps, the most recent visit) that is a redirect to Y, substitute X with Y. This will be wrong if X's redirect changes, but this should handle most cases.

Bookmark Queries

In the history view or in the main page if a "bookmark" toggle button is added, we will need to quickly answer the question "is this page or an equivalent page bookmarked?"

First, let us discard one case: when the page in question redirects to a bookmarked page. In this case, the user has specifically bookmarked the destination page and probably does not consider the source page to be the same. Furthermore, the source page's redirect may have changed, so the equivalence of the URLs may not even be true any more.

Therefore, the question becomes: "has the page in question been redirected from any bookmarked URL?" This question can be answered by querying history, although the possibility of multiple redirects makes this more difficult because we may have to execute multiple statements.

This approach could cause problems if the user clears their history. For example, if they create a bookmark for "google.com", clear their history (removing all redirect information) and manually type in "www.google.com" the equivalence will not be detected.

This approach is also very inefficient. If we wanted to mark all bookmarked items in a history view, we'll have to execute at least one query per row just to determine if it has been bookmarked. A simple optimization would be to add a boolean column in history that indicates whether this URL has ever been a redirect destination. This way, we can quickly discard most history items, and only do these queries for the minority of pages that have been redirected.

An additional level of optimization would include the URL of the redirect source in the history column. If we ever need to add another redirect source and the redirect column is already full and does not match, we can specify some known string that indicates there is more than one source. This saves a little bit of time since in the next-most-common case where there is only one redirect source can use a simple query over history rather than a join over all visits. This optimization may or may not be worth it.

Capturing redirects and visit paths

Visit paths

The new history system would like to track visit trails, with information about how each visit occurred. For example, we would like to know that the user typed the URL for site A, went from A to B by following a link, was automatically redirected to C, and then followed a link on C to open D in a new window.

Currently, this information can not be tracked accurately. When the user follows links, the history system has the referrer, but does not have any other data. For example, if the user follows A->B->C and D->B->F in separate windows, we'll get confused becuase B was open in two windows at once and we can't figure out where it came from. This is not a critical case to handle correctly, but it would be nice to do.

More critical is that redirects can not be tracked. If a user follows a link from A to B, but B redirects to C, we'll get 2 messages: (B, redirect=true, parent=A), (C, redirect=false, parent=A). No information is available to the history system that B redirected to C.

We would also like to be able to track the types of transitions: regular link following, bookmark opening, typing, and redirects. There would also be other flags, such as opened in new window and opened in new tab.

How the "typed" flag works

The current system keeps track of whether URLs in history were typed. This is used to boost the ranking of those pages in autocomplete. When the user types a URL into the browser, the javascript code in the browser calls the history system and notifies it that the URL has been typed. The history system will set the flag on the corresponding history entry (if it is not already set).

If there is no history entry for the typed URL, a new one is created marked typed but hidden and never visited. When the document is loaded, the history system is notified of the visit and it unhides the entry and saves the visit date, preserving the typed flag. This way, typing an invalid URL that is never loaded will not fill the history and autocomplete lists with junk.

With this setup, we have no way of knowing when a specific visit was typed. Because the typed and the visit notifications arrive asynchronously from different components, we have no way of associating specific typed flags with specific visits.

One approach would be to keep a list of the N most recent typed notifications. When a visit comes through, associate that visit with the typed nofitication and remove it from the queue. We would also like to add a bookmark notification equivalent to the typed flag for when the user follows a bookmark.

How redirecting notifications work from the DocShell/DocLoader level

nsDocLoader implements nsIChannelEventSink, which gets channel event messages. When it gets OnChannelRedirect, it has both the new and the old channels. It sets the STATE_IS_DOCUMENT flag if the channel has its LOAD_DOCUMENT_URI flag set, and fires a state change notification.

The state change firing notifies all WebProgressListeners that have been attached to the DocLoader. The DocShell, which is derived from the DocLoader, is registered as an event listener itself. Then we get the OnStateChange event with the source channel (no information is available about the new destination channel or URI), which checks the STATE_IS_DOCUMENT and STATE_REDIRECTING flags. When these flags are set, it calls AddToGlobalHistory with a redirect flag of true.

META refreshes are a little more complicated. We would like to be able to detect meta refreshes with a time less than X (where X is something like 20 seconds). These get loaded with LOAD_FLAGS_REPLACE_HISTORY set. Apparently, this can be set for other types of documents as well, and I'm unclear at the moment how to detect meta refreshes, or where to get the refresh time (we probably don't want to worry about refreshes with very long timeouts). In any case, we don't know that the page will refresh until it does, meaning that when this happens we'll want to go back and hide the source page from global history.

Example: I have a web page redirect.html that as a META refresh to www/demo.html. My web server will rewrite requests from www to www.mydomain.com.

  • Type "www/refresh.html"
  • Add "www/refresh.html" to global history as a redirect
  • Get a OnChannelRedirect to "www.mydomain.com/refresh.html"
  • Add "www.mydomain.com/refresh.html" to global history as a normal page.
  • Get a load for "www/demo.html" with LOAD_FLAGS_REPLACE_HISTORY
  • Add "www/demo.html" to global history as a redirect
  • Get a OnChannelRedirect to "www.mydomain.com/demo.html", LOAD_FLAGS_REPLACE_HISTORY is still set.
  • Add "www.mydomain.com/refresh.html" to global history as a normal page.

It seems that it is not possible to get a frame's session history entry? This will be necessary to get the proper visit for these items.

Proposal

Add a virtual function to nsDocLoader "OnRedirectStateChange". This function would be called manually from OnChannelRedirect will full information on the source and destionation channels. DocShell will provide an implementation of this function, and we can move the redirect handling code from OnStateChange to there.

We could add a new redirect function to the existing nsIGlobalHistory2 or add it in a new interface nsIGlobalhistory3. Adding to the existing interface is cleaner, but would force embedders to update their history code. A new interface could be optional (docshell would fall back on AddURI if history didn't QI to nsIGlobalHistory3) so embedders would not have to change, but it would make history more difficult to understand. Probably adding an additional method is preferrable; the GeckoFlags function was successfully added without too much pain to GlobalHistory2 in Firefox 1.5.

Session history entries would be extended to support nsIWritablePropertyBag. When the history system gets visit notifications, it can store necessary state on the corresponding entry (probably just a 64-bit visit ID number). This way, it can associate a given call with a specific referring visit, even if the same document is open in more than one docshell.

Global history would maintain queues of URLs for typed and bookmarked flags. When new URLs come through that have no referrer, it would check these queues to see if the URL had either of these flags set. This requires no changes to non-history system code, but it could be confused in some unusual cases (for example, if the user types and and follows a bookmark in quick succession for the same URL). The side effect of getting it wrong is very minor.

For dealing with meta refreshes, it might be sufficient to detect LOAD_FLAGS_REPLACE_HISTORY in docshell::AddToGlobalHistory and provide some indication to history about this predicament. If the time between the original page and the replacement is low, history can assume the replacement is due to a refresh and mark it accordingly. This might also catch some cases where scripts cause the page to redirect somewhere else, which would be nice.

Review of changes in docshell code:

  • New nsIGlobalHistory3 interface defined in docshell/base or add to nsIGlobalHistory2.
  • New virtual function OnRedirectStateChange in docloader.
  • Move HTTP 30X redirect code from OnStateChange to OnRedirectStateChange.
  • Change docshell AddToGlobalHistory to use the new AddURIFrom history function.
  • Implement nsIWritablePropertyBag on nsISHistoryEntry
  • Tell GlobalHistory when a load is due to a refresh. This might be provided as a flag to the new AddURIFrom which is called from AddToGlobalHistory.