Gecko:Effective TLD Service: Difference between revisions
| Line 47: | Line 47: | ||
A few additional comments are in order to fully explain how the effective-TLD service interprets the file: | A few additional comments are in order to fully explain how the effective-TLD service interprets the file: | ||
* | * Hostnames are also normalized according to RFC 3454 before they are matched against the rules. Thus, among other things, the matching is not case-sensitive. | ||
* If a hostname matches one or more exception rules, the ''shortest'' matching exception will be used. | * If a hostname matches one or more exception rules, the ''shortest'' matching exception will be used. | ||
* If a hostname matches more than one non-exception rule in the file, the ''longest'' matching rule (the one with the most levels) will generally be used, unless an exception rule also matches. | * If a hostname matches more than one non-exception rule in the file, the ''longest'' matching rule (the one with the most levels) will generally be used, unless an exception rule also matches. | ||
Revision as of 18:34, 7 June 2006
What is the Effective TLD Service, and why do we need one?
The Effective TLD Service examines a hostname passed to it and determines the longest portion that should be treated as though it were a top-level domain (TLD). Although technically the TLD for a hostname is the last dot-portion of the name (such as .com or .org), many domains (such as co.uk) function as though they were TLDs, allocating any number of more specific, essentially unrelated names beneath them. Put another way, .uk is a TLD, but nobody is allowed to register a domain directly under .uk; the effective TLDs are ac.uk, co.uk, and so on. We wouldn't want to allow any site in *.co.uk to set a cookie for the entire co.uk domain, so it's important to be able to identify which higher-level domains function as effective TLDs.
Other components may also find this service useful. For instance, Places wants to allow users to group history results by domain, which is more useful when it has better knowledge of what constitutes a logical domain.
The domain file
The service obtains its information about effective TLDs from a text file which must be in the following format:
- The file should be named effective_tld_names.dat
- If a file with that name is found in the user's profile directory, it will be used.
- Otherwise, the file from the application's "res" directory will be used.
- It should use the UTF-8 text encoding. Since UTF-8 is a superset of 7-bit ASCII (i.e., plain text with no diacritics or other special characters), a 7-bit ASCII file is also acceptable.
- Rules will be normalized by nsIDNService::Normalize, which implements RFC 3454. For ASCII, that means they're not case-sensitive; other normalizations are applied for other characters.
- It should contain one effective-TLD rule per line.
- Rules are only read up to the first whitespace (space or tab), allowing comments after each rule if desired.
- An entire line may also be marked as a comment by starting it with two forward slashes (//). Any line starting with // will be completely ignored.
- Similarly, blank lines will be ignored.
- Each rule should list the entire TLD-like domain name, with the subdomain portions separated by dots (.) as usual. A leading dot is optional.
- The wildcard character * (asterisk) will match any valid sequence of characters.
- Wildcards may only appear as an entire level of a rule. That is, they must be surrounded by dots (or implicit dots, at the beginning of a line).
- An exclamation mark (!) at the start of a rule marks an exception to a previous wildcard rule. An exception rule will be used instead of any other matching rule.
- True TLDs, i.e. domains with no dots, are never needed in the list, but they may be included for completeness.
Example
com // not needed, but acceptable uk *.uk be ac.be jp ac.jp // Hosts in .hokkaido.jp can't set cookies below level 4... *.hokkaido.jp *.tokyo.jp // ...except hosts in pref.hokkaido.jp, which can set level 3. !pref.hokkaido.jp !city.shizuoka.jp !metro.tokyo.jp
Interpretation
A few additional comments are in order to fully explain how the effective-TLD service interprets the file:
- Hostnames are also normalized according to RFC 3454 before they are matched against the rules. Thus, among other things, the matching is not case-sensitive.
- If a hostname matches one or more exception rules, the shortest matching exception will be used.
- If a hostname matches more than one non-exception rule in the file, the longest matching rule (the one with the most levels) will generally be used, unless an exception rule also matches.
- However, a shorter rule that contains no wildcards will be used in preference to a longer wildcard rule, and a rule that falls back to a wildcard match at a higher level (closer to the beginning of the name) will be used in preference to one that uses a wildcard at a lower level.
- Specifically, the service begins matching hostnames one level at a time, starting from the end of the name. At each level, a rule that is an exact match is always used in preference to a rule that only matches because of a wildcard. Therefore, a longer rule might be "missed" if it requires using a wildcard at a level for which an exact match also exists.
- Although exception rules are typically used to override wildcard rules, that is not a requirement. If an exception matches, it will be used, whether or not there is any other rule for it to override.
- If no rule matches, the first-level domain (the portion of the hostname after its last dot, or the entire hostname if it contains no dots) will be used, whether or not it appears in the file.
- Longer names do not implicitly include their shorter components. That is, if "bar.baz.uk" is the only rule in the list, "foo.bar.baz.uk" will match "bar.baz.uk", but "baz.uk" will only match the default first-level domain "uk".
Examples
Given the effective-TLD file shown above, the effective TLDs and their lengths for the following hostnames are
- mozilla.org => org => 3
- cam.ac.uk => ac.uk => 5
- something.hokkaido.jp => something.hokkaido.jp => 21
- pref.hokkaido.jp => hokkaido.jp => 11
- foo.pref.hokkaido.jp => hokkaido.jp => 11
Interface
The service provides a number of functions:
PRInt32 getEffectiveTLDLengthForHost(in AUTF8String aHostname);
AUTF8String getEffectiveTLDForHost(in AUTF8String aHostname);
PRInt32 getPrivateDomainLengthForHost(in AUTF8String aHostname);
AUTF8String getPrivateDomainForHost(in AUTF8String aHostname);
PRInt32 getEffectiveTLDLengthForURI(in nsIURI aURI);
AUTF8String getEffectiveTLDForURI(in nsIURI aURI);
PRInt32 getPrivateDomainLengthForURI(in nsIURI aURI);
AUTF8String getPrivateDomainForURI(in nsIURI aURI);
These functions operate on either a string hostname or on an nsURI object; they extract either the effective TLD or the first level of the private domain (that is, the effective TLD plus one additional subdomain level); and they return either the length of the result in bytes or the string result itself.
To call the service, use the getService function:
#include "nsIEffectiveTLDService.h"
var tld
= Components.classes["@mozilla.org/network/effective-tld-service;1"]
.getService(Components.interfaces.nsIEffectiveTLDService);
var hostLength = tld.getEffectiveTLDLengthForHost(hostname);
--PamG 17:25, 2 June 2006 (PDT)