Gecko:Effective TLD Service

From MozillaWiki
Revision as of 12:21, 12 April 2009 by Agitos (talk | contribs) (→‎Links)
Jump to navigation Jump to search

What is the Effective TLD Service, and why do we need one?

The Effective TLD Service examines a hostname passed to it and determines the longest portion that should be treated as though it were a top-level domain (TLD). Although technically the TLD for a hostname is the last dot-portion of the name (such as .com or .org), many domains (such as co.uk) function as though they were TLDs, allocating any number of more specific, essentially unrelated names beneath them. Put another way, .uk is a TLD, but nobody is allowed to register a domain directly under .uk; the effective TLDs are ac.uk, co.uk, and so on. We wouldn't want to allow any site in *.co.uk to set a cookie for the entire co.uk domain, so it's important to be able to identify which higher-level domains function as effective TLDs.

Other components may also find this service useful. For instance, Places wants to allow users to group history results by domain, which is more useful when it has better knowledge of what constitutes a logical domain.

The domain file

The service obtains its information about effective TLDs from a text file which must be in the following format:

  • The file should be named effective_tld_names.dat
    • A file with that name in the application's "res" directory will always be used.
    • In addition, if a file with the same name is found in the user's profile directory, its contents will also be used, as though they were appended to the system file.
  • It should use the UTF-8 text encoding. Since UTF-8 is a superset of 7-bit ASCII (i.e., plain text with no diacritics or other special characters), a 7-bit ASCII file is also acceptable.
  • Rules will be normalized by nsIDNService::Normalize, which implements RFC 3454. For ASCII, that means they're not case-sensitive; other normalizations are applied for other characters.
  • It should contain one effective-TLD rule per line.
    • Rules are only read up to the first whitespace (space or tab), allowing comments after each rule if desired.
    • An entire line may also be marked as a comment by starting it with two forward slashes (//). Any line starting with // will be completely ignored.
    • Similarly, blank lines will be ignored.
  • Each rule should list the entire TLD-like domain name, with the subdomain portions separated by dots (.) as usual. A leading dot is optional.
  • The wildcard character * (asterisk) will match any valid sequence of characters.
    • Wildcards may only appear as the entire most specific level of a rule. That is, a wildcard must come at the beginning of a line and must be followed by a dot.
  • An exclamation mark (!) at the start of a rule marks an exception to a previous wildcard rule. An exception rule will be used instead of any other matching rule.
  • True TLDs, i.e. domains with no dots, are never needed in the list, but they may be included for completeness.

Example

(This is only an example of a valid rule format. The rules listed here are not necessarily correct or complete.)

com  // not needed, but acceptable
uk
*.uk
be
ac.be
jp
ac.jp
// Hosts in .hokkaido.jp can't set cookies below level 4...
*.hokkaido.jp
*.tokyo.jp
// ...except hosts in pref.hokkaido.jp, which can set level 3.
!pref.hokkaido.jp
!city.shizuoka.jp
!metro.tokyo.jp

Interpretation

A few additional comments are in order to fully explain how the effective-TLD service interprets the file:

  • Hostnames are also normalized according to RFC 3454 before they are matched against the rules. Thus, among other things, the matching is not case-sensitive.
  • If a hostname matches more than one rule in the file, the longest matching rule (the one with the most levels) will be used.
  • Although exception rules are typically used to override wildcard rules, that is not a requirement. If an exception matches, it will be used, whether or not there is any other rule for it to override.
  • If no rule matches, the first-level domain (the portion of the hostname after its last dot, or the entire hostname if it contains no dots) will be used, whether or not it appears in the file.
  • Longer names do not implicitly include their shorter components. That is, if "bar.baz.uk" is the only rule in the list, "foo.bar.baz.uk" will match "bar.baz.uk", but "baz.uk" will only match the default first-level domain "uk".

Examples

Given the effective-TLD file shown above, the effective TLDs and their lengths for the following hostnames are

  • mozilla.org => org => 3
  • cam.ac.uk => ac.uk => 5
  • something.hokkaido.jp => something.hokkaido.jp => 21
  • pref.hokkaido.jp => hokkaido.jp => 11
  • foo.pref.hokkaido.jp => hokkaido.jp => 11

Interface

The service provides a single scriptable function:

    /**
     * getEffectiveTLDLength
     *
     * Finds the length of the effective TLD of a hostname.  An effective TLD
     * is the highest-level domain under which individual domains may be
     * registered, and may therefore contain one or more dots.  For example,
     * the effective TLD for "www.bbc.co.uk" is "co.uk", because the .uk TLD
     * does not allow the registration of domains at the second level ("bbc.uk"
     * is forbidden).  Similarly, the effective TLD of "developer.mozilla.com"
     * is "com".
     *
     * The hostname will be normalized using nsIDNService::Normalize, which
     * follows RFC 3454.  getEffectiveTLDLength() will fail, generating an
     * error, if the hostname contains characters that are invalid in URIs.
     *
     * @param   aHostname   The hostname to be analyzed, in UTF-8
     *
     * @returns the number of bytes that the longest identified effective TLD
     *          (TLD or TLD-like higher-level subdomain) occupies, not including
     *          the leading dot:
     *              bugzilla.mozilla.org -> org -> 3
     *              theregister.co.uk -> co.uk -> 5
     *              mysite.us -> us -> 2
     */
    PRInt32 getEffectiveTLDLength(in AUTF8String aHostname);

To call the service, use the getService function:

#include "nsIEffectiveTLDService.h"

var tld 
    = Components.classes["@mozilla.org/network/effective-tld-service;1"]
                .getService(Components.interfaces.nsIEffectiveTLDService);

var hostLength = tld.getEffectiveTLDLength(hostname);

Links

--PamG 11:52, 7 June 2006 (PDT)