Phishing Protection: Design Documentation

From MozillaWiki
Revision as of 21:29, 19 January 2007 by Bryner (talk | contribs)

Jump to: navigation, search

Note: Main content is at: http://wiki.mozilla.org/Phishing_Protection


Phishing Protection Design Documentation

Fritz Schneider,
Niels Provos (niels at google.com),
Raphael Moll (raphael at google.com),
Monica Chew (mmc at google.com),
Brian Rakowski (bdr at google.com)

Objective

Phishing protection in Firefox that:

  • has near-zero false positives (perhaps: average user should get one false positive every 2-3 years)
  • has a usable, effective user interface
  • preserves privacy if the user so desires

Threat Model

By a phishing attack we mean a web page mimicking the appearance of a well-known e-commerce of financial web page for the purposes of theft of personal information. At this time we're primarily concerned with passive adversaries, those that are not actively trying to subvert our countermeasures. Additionally, we concern ourselves here only with "regular" web-based attacks, and not, for example, attacks carried out by accessing the user's local machine or DNS server.

One could broaden the definition of phishing to include shady retailers and the like, but making judgements about such pages is significantly harder than the more restricted case above.

Non-goals

Non-goals include approaches other than detect-and-warn or detect-and-prevent (e.g., trusted browser paths, two factor authentication, mutual authentication protocols, etc.), techniques we can't launch right about now, perfect coverage, and anything outside of our threat model, including (especially including) highly active adversaries.

Background

This project is an outgrowth of the Safe Browsing extension released by Google. It was released under MPL to Mozilla in order that it might be taken advantage of in Firefox 2.

Note that it is pure coincidence that Internet Explorer 7 will have (has) phishing protection.

Overview

Our current approach is blacklist-based. Blacklisting requires timely updates in order to be effective, but has the advantages of being simple and easy to deploy. If the shortcomings of a blacklisting approach offend you, consider that blacklisting would be a necessary component of any more comprehensive approach, so if you like, you can think of this initial version as the first step on the road to something much better.

The extension has an "enhanced protection" option that controls whether the extension uses remote or local blacklists.

With enhanced protection disabled, the client keeps a local blacklist of phishing pages and consults it for every URL the browser requests. This list is updated periodically from an update server, which send diffs from the client's current blacklist or a new, full blacklist (whichever is smaller).

With enhanced protection enabled, the client looks URLs up in a remote blacklist hosted on a lookup server (optionally encrypting the query for increased security). The enhanced protection mode provides better coverage because the server blacklist is up-to-the-minute fresh, as well as potentially more comprehensive because providers might be forced to prune client blacklists due to size.

When the client detects that a blacklisted page has loaded, the page is disabled, greyed out, and the user is shown a warning. The user has the option to navigate away or to continue to use the page. The extension also affords users a way to report both false positives and false negatives if they so desire.

At the moment the update and lookup servers reside at Google, but of course the provider should be user-selectable. Google's blacklists currently are built from commercial sources as well as internal Google sources.

Detailed Design

Warning Dialog UI

The hard part of a feature like this is not identifying phishing pages, but rather figuring out how to present this information to users. With this in mind, Google did usability testing of a multitude of potential UIs. In the process we learned:

  • It's hard but possible to create a warning dialog users will pay attention to
  • When a phishing warning is encountered, users have three primary reactions: a strong desire to continue to use the page, a near-panicked desire to get away from the page, and a desire to see the villains brought to justice. In cases where users heeded the warning, they typically also needed a sense of closure, so it's important to give it to them (hence the "report to google" option).
  • We originally showed the warning only when the user began interacting with a problematic page, but it quickly became clear that users prefer we just disable the page as it is loading.
  • Warnings of "suspicious" pages (pages we're not positive are phishing, but that might be) were by and large ineffective. Users either ignored the warning or said they'd prefer us to make a judgment for them, and that they'd trust us on it. Such warnings probably don't give the user enough information to go on; perhaps they'd be more effective if accompanied by a way for users to do a "background check" on the page.

Here's the current UI. This is by no means final; we're still doing testing. Additionally there are obvious branding issues here that need to be considered. But this is the basic idea.

Antiphishingui.png

Client (code in Firefox)

The client code is object-oriented and relies heavily on closures and custom JS abstractions.

Before getting into specifics, let's clarify some nomenclature. A browser window is a user-visible browser window, aka the object corresponding to browser.xul's <window> onto which we overlay. A tabbed browser is an object corresponding to a <tabbrowser> tag, of which there is one per browser window. A browser is a tab, one of potentially many browsers within a tabbed browser.

Major Abstractions

The extension has the following four major components.

  • Phishing Warden: One phishing warden exists for the whole application. It monitors HTTP fetch requests and determines whether the page requested is problematic. It makes this determination, as we've said, by looking the URL up in either a remote or local blacklist.

    You can imagine in the future having additional wardens for other kinds of problematic content, for example a Spyware Warden.
  • Controller: There is one controller per browser window. A controller bound to a browser window is responsible for receiving and interpreting user actions and UI activity ("a new tab opened", etc.) within that window.
  • Browser View: Each controller has-a browser view, meaning that there is one browser view per browser window. The browser view is the clearinghouse for state about problem pages within the window with which its controller is associated. It embodies the logic of determining when to display a warning message to users, and mediates cases where multiple problems are evident in a single web page at the same time (e.g., it has two phishy frames). It also keeps track of problem documents so it knows when to hide warnings or clear state, for example if the user switches tabs or navigates away from a problem page.
  • Displayer: A displayer embodies how to display a warning, and what the content of that warning is. A warden has-a displayer factory for each context in which a message might be displayed. Right now there is only one kind of displayer (the kind that displays a warning after a page has begun loading), but you can imagine in the future that it might have other displayers, for example one that is used when the user is clicking a link.

Other, less significant abstractions include:

  • TabbedBrowserWatcher: There is no single technique you can use to receive all relevant page or navigation-related notifications. Additionally, different kinds of notifications give you different kinds of information, often leaving you with insufficient information with which to work (e.g., you know there's a new request, but you have no way of knowing which DOMWindow or even browser is responsible for it).

    For this reason each controller has-a TabbedBrowserWatcher that adds a level of indirection to the whole mess. The tabwatcher flattens all the load/unload- and tab-related events in a browser window by attaching to the tabbedbrowser, browsers, documents, and frames and sending sane notifications to the controller about what's happening. It also provides utilities for locating Documents that have a particular URL loaded, the browser that has a particular Document loaded, and the like.
  • ListManager: There is a single ListManager responsible for downloading and updating white and black lists when enhanced protection is disabled. The actual deserialization (from the updater server's response) is handled by a db update XPCOM service, as is serialization (from an updated table to local disk). The lists are stored in a MozStorage (sqlite) database. To avoid being flagged by some anti-virus software, urls are ROT13'ed before being placed on disk. The database is stored in the user's profile directory and client receives updates once every 30 minutes.
  • TRTables: Since white and blacklists can be expressed in many different ways (lookup tables by domain, lookup tables by URL, etc), we have a simple abstraction that knows how to look URLs up in tables of these various kinds. For each type of white/blacklist (map), we have a TRTable that knows how to look items up in a list of that type. A TRTable is-a Map with a specialized exists() method that knows the semantics of the name/value pairs it stores. These tables are stored in a mozStorage database.

Execution Context

There are many serious disadvantages to overlaying application js into the context of the browser window (browser.xul). Aside from cluttering the global namespace, overlaying application code prevents two versions of the same library from being used at the same time, makes sharing state across windows harder, and wastes resources.

For these reasons, the extension runs in its own context. This is achieved by having a bootstrap loader that is a self-registering XPCOM component. This component uses the subscript loader to load application js files into its context. Only small amounts of "glue" code exist in the browser window context, and these pieces of glue essentially just call into the application context to do the heavy lifting.

Client Backoff

Providing the data on the server for updates and lookups requires a fair amount of resources. To help maintain a high quality of service, it may be necessary for the update and lookup servers to ask the client to make less frequent requests. To handle this, the client watches for HTTP timeouts or errors from the server and if too many errors occur, it increases in the time between requests. If remote lookups start to fail, we fall back on using the tables provided during update requests.

The first update request happens at a random interval between 0-5 minutes after the browser starts. The second update request happens between 15-45 minutes later. After that, each update is once every 30 minutes.

If the client receives an error during update, it tries again in a minute. If it receives three errors in a row, it skips updates until at least 60 minutes have passed before trying again. If it then receives another (4th) error, skips updates for the next 180 minutes and if it receives another (5th) error, it skips updates for the next 360 minutes. It will continue to check once every 360 minutes until the server responds with a success message. The current implementation doesn't change the 30 minute timer interval, it just involves skipping updates until the back off time has elapsed.

A lookup request happens on page load if the user has opted into remote checking. If a lookup request fails, we automatically fall back on a local table. If there are 3 lookup failures in a 10 minute period, we skip lookups during the next 10 minutes. Each successive lookup failure increases the wait by (2*last wait + 10 minutes). The maximum wait before trying again is 360 minutes. As mentioned above, if we're not doing lookups, we query the local lists instead.

In both the update requests and lookup requests, once the server starts to send successful HTTP replies, the error stats are reset.

The Server Side

The provider (Google) currently exposes four interfaces: lookup, update, report, and getkey.

Protocol4

Note: this is not a specification; it's just a high-level description.

Before discussing the specifics of the server-side interfaces, we describe "protocol 4", a very simple format for describing name/value fairs we use in several places.

A protocol4 message is a newline-separated sequence of name/value pairs. Each line consists of the name, the value length, and the value itself, all separated by colons. For example:

pi:4:3.14\n
fritz:33:issickofdynamicallytypedlanguages\n

Most clients should simply ignore the length -- it's a historical artifact and not neccessary (and even dangerous) in most circumstances. A client should skip lines of names it does not understand, and should be tolerant of blank lines.

Lookup Server

Note: this is not a specification; it's just a high-level description.

Lookup requests are GETs to a lookup server. In their unencrypted form they include a client and a q parameter. The client is used to identify who is making the request (e.g., "firefox" or "someextension") and the q contains the uri-encoded URL to look up. For example:

http://.../lookup?client=foo&q=http%3A%2F%2Fwww.google.com%2F

In their encrypted form (see below), the query parameter holding the URL is encrypted and the request URL contains extra parameters holding the meta-information required to decrypt the query. For example:

http://.../lookup?client=foo&nonce=1338535465&wrkey=MToSZ063o42vIwicBoO9SLTG
     &encparams=m2etCO%2BBEAeTEs5IDr%2BHeCSAk0R2OexXRVW1h9mUYH59&

When the server receives a query, it canonicalizes the URL and looks it up in its lists. It responds with a protocol4 message. This response is blank if the provider has no information about the URL, or gives a response with "phishy" and a value of "1" if the URL is blacklisted. For example:

phishy:1:1

For various reasons, the lookup server might also respond with the name "rekey" with value of "1", telling the client to fetch a new encryption key.

The lookup server presumably has whitelists in addition to its blacklists, and might have blacklists in various forms, for example as regular expressions to match against the URLs of tricky phishers.

The client sends the full URL, including all query parameters. This is necessary because the base URL is often not sufficient to express the page that is dangerous.

Update Server

Note: this is not a specification; it's just a high-level description.

The client can download and update various kinds of tables (lists) from an update server. Each table has a name with three components: provider-type-format. The provider is just a name used to identify where the list comes from. The type indicates whether the list is a white or blacklist. The format indicates how URLs should be looked up in the list, for example the list might contain domains, hosts, or URLs. For example:

goog-black-url      // A blacklist from Google; lookups should be by URL
acme-white-domain   // A whitelist of domains from Acme, Inc.; lookups by domain

Tables are versioned with a major and minor numbers. The major version is currently 1, and is used to describe the wire format (see below), how the table is serialized. The minor number is the version of the list. When providers add new items to a list or take items out of it, they increment the minor version number.

The client keeps a list of tables it knows about, as well as the version it has of each. To request an update from a provider, the client does a GET to the update server and expresses its tables and versions as a query parameter like: version=type:major:minor[,type:major:minor]*. For example:

http://.../update?client=foo&version=goog-black-url:1:432,acme-white-domain:1:32

The server responds with updates to all tables in the wire format. For each table, the response includes either a completely new table or a diff between the client's version of the table and the most current version, whichever is smaller.

Wire Format

Note: this is not a specification; it's just a high-level description.

The serialized form of the tables is called our "wire format." It's the format the update server responds with, as well as the form used by the extension when it serializes tables to disk.

The wire format is a simple line-oriented protocol. It consists of a sequence of sections consisting of a header line like [type major.minor [update]] followed by lines of data comprising the table described by the header. If the "update" token appears in the header line, the data following constitute an update to the client's existing table. Else the data specify a full, new table.

Data lines start with a + or -. A plus indicates an addition to the table and is followed by a tab-separated key/value pair. A minus means to remove a key from the table and is followed by the key itself.

An example update response is:

[goog-black-url 1.372 update]
+http://payments.g00gle.com/   1
+http://www.ovrture.com/givemeallyourmoney.htm   1 
+http://www.microfist.com/foo?bar=x   1 
-http://www.gewgul.com/index.html
-http://yah0o.com/login.shtml
...

[acme-white-domain 1.13]
+google.com 1
+slashdot.org 1
+amazon.co.uk 1
...

In this example, the client has some version of goog-black-url prior to 372 and the server is telling the client to bring itself up to version 372 by applying the adds and deletes that follow. The client has some version of acme-white-domain earlier than version 13, but the diff would be longer than the entire version 13 table, so it is sending a complete replacement.

The data lines are opaque to the wire format. They come in some format that the extension knows how to use, based on the table type. The wire format reader/writer only knows how to build tables consisting of a map from names to values; it knows nothing of the semantics of the names or values. The user of the tables (the TRTable, in the extension's case) is aware of the semantics of the data, based on the type encoded in the table's name.

To illustrate, in the example above the wire format reader would build an acme-white-domain table from the name/value pairs. This table is part of a TRTable, an object that knows how to do lookups. It knows that because of the table's type ("domain") it can do lookups on the data by parsing the domain from a URL and checking to see if the map has a value "1" for that domain. If so, it's in this whitelist. If not, it isn't.

More complicated types of tables than just domain-, host-, and URL-lookup are possible. For example, a table could map hosts to regular expressions matching phishy pages on the host in question. The wire format reader/writer doesn't need to know anything about this; only the the TRTables do.

Report Requests

Note: this is not a specification; it's just a high-level description.

In enhanced protection mode the extension reports back interesting phishing-related events to the provider for analysis. For example, it might report that the user declined the warning on a blacklisted page, a cue to the provider that the page might be a false positive (or that the warning is ineffective). For example:

http://.../report?client=foo&evts=phishdecline&evtd=http://somephishydomain.com/
    login.html

Cookies are stripped from report requests.

GetKey Requests

Note: this is not a specification; it's just a high-level description.

The extension generates a shared secret with the provider by making an HTTPS getkey request. This request looks like:

https://.../getkey?client=foo

And the server responds with a protocol4 message like:

clientkey:24:f/eBhklsZNmOFwelcs0aJg==
wrappedkey:24:MTpGMWDxuDpGg6KDlKKksVdV

These values are used to encrypt lookup requests when in enhanced protection mode (see below)

The Need for Data Collection

Service providers (well, we at Google, anyway) need information to improve our coverage and accuracy. This is a fact, and not some smoke and mirrors attempt to violate privacy. If someone does not wish to contribute their data to this effort, fine, but we really need to give those people that do want to contribute the opportunity to do so.

To be clear, there are two kinds of data that are important for us to collect if we want to improve the service:

  1. explicit user reports of phishing sites (and false positives). User reports are a very valuable source of information for us, and it's a requirement that we're able to collect such reports. At the moment the extension adds an item to to the Tools menu that enables users to make such reports, but something more appropriate is probably warranted going forward.

  2. automatic reports of "interesting" phishing-related events. In enhanced protection mode (and only in enhanced protection mode) the extension sends cookieless pings to the provider when certain "events" occur. At the moment these reports are generated when the user lands on a blacklisted page, when the user accepts or declines the warning dialog, and when the user navigates away from a phishing page. The information transmitted is what happened, and the URL (and, as we've said, no cookies).

    We use this information to understand what's happening to the users of this feature. How often do people hit these sites? What sites are hit most often? How often do they actually heed the warning? etc.

URL Canonicalization

URL canonicalization is a thorny problem. Encoding is slightly less so. At minimum, there is ample opportunity for the URLs provided by third parties to be encoded differently than by the browser. For example:

  • hex encodings may be upper or lowercase
  • spaces in query parameters might be hex-encoded or plus-encoded
  • the origin of a URL effects its encoding (e.g., a URL with a space taken from an HTML document will retain the space, but the same URL copied from the urlbar of a browser after it has been clicked will generally have a %20)
  • canonicalization or encoding might have been done by the third party

We solve the encoding problem, but not the canonicalization problem. We repeatedly URL-unescape a URL until it has no more hex-encodings, then we escape it once. Yes, this can map several distinct URLs onto the same string, but these cases are rare, and happen primarily in query params. But taking this approach solves a multitude of other potential problems.

Additionally, we canonicalize the hostname as mentioned in the server spec. Enchash lookups involve truncating the hostname at 5 dots. Url and domain table lookups do not do any truncation.

Relationship to Existing Products

SafeBrowsing was released as an extension on labs.google.com and has also been integrated with the Google Toolbar for Firefox. What's the planned relationship between these other incarnations of the product and the extension as it lives in Firefox? Glad you asked.

The stand-alone SafeBrowsing extension is essentially end-of-lifed. We will continue to release bugfixes, but new development has moved elsewhere (either into Firefox or into the Toolbar).

If SB is included in some form in Firefox, canonical development moves into the Moz cvs source tree and we remove/disable SB from the Toolbar as soon as there is a public Firefox release that includes it. If SB isn't included in Firefox, it will continue to live in the Toolbar, and new development will most likely happen in-house.

No matter where it lives, the SB code tries to play nicely with other incarnations of itself. The standalone extension claims compatibility only with Firefox 1.5. SB in the Toolbar is not enabled in 1.0, and disables itself if it notices that the standalone extension is present. If SB is included in Firefox, versions of the Toolbar that include SB will not claim compatibility with versions of Firefox that also include it.

Security

Remote Lookups

In enhanced protection mode, we're shipping the full URLs of every document the user loads off to the lookup server. It would be problematic to have this request go in the clear, but at the same time it would be too expensive to make every lookup request over HTTPS. As a compromise, we use HTTPS to bootstrap a shared secret between the client and server, and use this secret to protect the user's URLs in transit.

Specifically:

  1. Server generates a secret key KS
  2. Client starts up and requests a new key KC from the server via HTTPS
  3. Server generates KC and WrappedKey, which is KC encrypted with KS
  4. Server responds with KC and WrappedKey
  5. When client wants to encrypt a URL, it encrypts it with KC and sends the ciphertext along with WrappedKey to the server via HTTP
  6. Server decrypts WrappedKey with KS to KC, and uses KC to decrypt the URL

The current implementation uses ARC4 for encryption, 128 bit KC, and MD5(nonce || KC) to derive actual encryption keys, where nonce is chosen differently for every encrypt.

TODO: specify the algorithm more precisely, and give test vectors

Active Adversaries

TODO: discussion of what they can do, what we can do

Snippet from Firefox bug 340061, which was lodged prior to discovering the stated "non-goals" in this wiki. Bug has been closed due to this but the brief narrative of the logical circumvention has been copied from there.


Allowing javascript redirects from a blacklisted site gives the attacker the ability to keep changing the ultimate URL which serves the content to make sure it isn't on the blacklist.

If the scam email (for example) points people to http://mydevserver/fish.htm, then this URL could be added to the blacklist. The scammers could then change fish.htm to redirect to haddock.htm, which could be responded to by adding all of http://mydevserver/* to the blacklist.

But even then the attacker could change fish.htm again to redirect people to http://someotherserver/fish.htm. They would always be able to keep changing the URL of the eventual page to make sure it wasn't on the blacklist, but keep the same entry point at http://mydevserver/fish.htm. Given that attackers get access to the blacklist at the same time as everyone else, they could easily make sure that the page was almost always not blacklisted (provided they have enough URLs to use, which they usually do).



The simple solution to this is to block until an answer has been received from the blacklist lookup, but this would obviously cause unacceptable delays to all pages (in the case of a remote lookup instead of local blacklist). Is there a decent compromise?

Future Work

  • Once we are up and running with Firefox 2.0, we would also like for other client applications to integrate with Phishing Protection. Note that applications that do not have a written agreement with Google are forbidden from using Phishing Protection data at this time. The only acceptable use of the data is via a browser addon distributed by Google.

TODO: more discussion TODO: discussion of heuristics