Gecko:Effective TLD Service: Difference between revisions
| Line 20: | Line 20: | ||
* Each rule should list the entire TLD-like domain name, with the subdomain portions separated by dots ('''.''') as usual. A leading dot is optional. | * Each rule should list the entire TLD-like domain name, with the subdomain portions separated by dots ('''.''') as usual. A leading dot is optional. | ||
* The wildcard character '''*''' (asterisk) will match any valid sequence of characters. | * The wildcard character '''*''' (asterisk) will match any valid sequence of characters. | ||
** Wildcards may only appear as | ** Wildcards may only appear as the entire most specific level of a rule. That is, a wildcard must come at the beginning of a line and must be followed by a dot. | ||
* An exclamation mark ('''!''') at the start of a rule marks an exception to a previous wildcard rule. An exception rule will be used instead of any other matching rule. | * An exclamation mark ('''!''') at the start of a rule marks an exception to a previous wildcard rule. An exception rule will be used instead of any other matching rule. | ||
* True TLDs, i.e. domains with no dots, are never needed in the list, but they may be included for completeness. | * True TLDs, i.e. domains with no dots, are never needed in the list, but they may be included for completeness. | ||
Revision as of 21:54, 27 September 2007
What is the Effective TLD Service, and why do we need one?
The Effective TLD Service examines a hostname passed to it and determines the longest portion that should be treated as though it were a top-level domain (TLD). Although technically the TLD for a hostname is the last dot-portion of the name (such as .com or .org), many domains (such as co.uk) function as though they were TLDs, allocating any number of more specific, essentially unrelated names beneath them. Put another way, .uk is a TLD, but nobody is allowed to register a domain directly under .uk; the effective TLDs are ac.uk, co.uk, and so on. We wouldn't want to allow any site in *.co.uk to set a cookie for the entire co.uk domain, so it's important to be able to identify which higher-level domains function as effective TLDs.
Other components may also find this service useful. For instance, Places wants to allow users to group history results by domain, which is more useful when it has better knowledge of what constitutes a logical domain.
The domain file
The service obtains its information about effective TLDs from a text file which must be in the following format:
- The file should be named effective_tld_names.dat
- A file with that name in the application's "res" directory will always be used.
- In addition, if a file with the same name is found in the user's profile directory, its contents will also be used, as though they were appended to the system file.
- It should use the UTF-8 text encoding. Since UTF-8 is a superset of 7-bit ASCII (i.e., plain text with no diacritics or other special characters), a 7-bit ASCII file is also acceptable.
- Rules will be normalized by nsIDNService::Normalize, which implements RFC 3454. For ASCII, that means they're not case-sensitive; other normalizations are applied for other characters.
- It should contain one effective-TLD rule per line.
- Rules are only read up to the first whitespace (space or tab), allowing comments after each rule if desired.
- An entire line may also be marked as a comment by starting it with two forward slashes (//). Any line starting with // will be completely ignored.
- Similarly, blank lines will be ignored.
- Each rule should list the entire TLD-like domain name, with the subdomain portions separated by dots (.) as usual. A leading dot is optional.
- The wildcard character * (asterisk) will match any valid sequence of characters.
- Wildcards may only appear as the entire most specific level of a rule. That is, a wildcard must come at the beginning of a line and must be followed by a dot.
- An exclamation mark (!) at the start of a rule marks an exception to a previous wildcard rule. An exception rule will be used instead of any other matching rule.
- True TLDs, i.e. domains with no dots, are never needed in the list, but they may be included for completeness.
Example
com // not needed, but acceptable uk *.uk be ac.be jp ac.jp // Hosts in .hokkaido.jp can't set cookies below level 4... *.hokkaido.jp *.tokyo.jp // ...except hosts in pref.hokkaido.jp, which can set level 3. !pref.hokkaido.jp !city.shizuoka.jp !metro.tokyo.jp
Interpretation
A few additional comments are in order to fully explain how the effective-TLD service interprets the file:
- Hostnames are also normalized according to RFC 3454 before they are matched against the rules. Thus, among other things, the matching is not case-sensitive.
- If a hostname matches one or more exception rules, the shortest matching exception (the one with the fewest levels) will be used.
- If a hostname matches more than one non-exception rule in the file, the longest matching rule (the one with the most levels) will generally be used, unless an exception rule also matches.
- However, a shorter rule that contains no wildcards will be used in preference to a longer wildcard rule, and a rule that falls back to a wildcard match at a higher level (closer to the beginning of the name) will be used in preference to one that uses a wildcard at a lower level.
- Specifically, the service begins matching hostnames one level at a time, starting from the end of the name. At each level, a rule that is an exact match is always used in preference to a rule that only matches because of a wildcard. Therefore, a longer rule might be "missed" if it requires using a wildcard at a level for which an exact match also exists.
- Although exception rules are typically used to override wildcard rules, that is not a requirement. If an exception matches, it will be used, whether or not there is any other rule for it to override.
- If no rule matches, the first-level domain (the portion of the hostname after its last dot, or the entire hostname if it contains no dots) will be used, whether or not it appears in the file.
- Longer names do not implicitly include their shorter components. That is, if "bar.baz.uk" is the only rule in the list, "foo.bar.baz.uk" will match "bar.baz.uk", but "baz.uk" will only match the default first-level domain "uk".
Examples
Given the effective-TLD file shown above, the effective TLDs and their lengths for the following hostnames are
- mozilla.org => org => 3
- cam.ac.uk => ac.uk => 5
- something.hokkaido.jp => something.hokkaido.jp => 21
- pref.hokkaido.jp => hokkaido.jp => 11
- foo.pref.hokkaido.jp => hokkaido.jp => 11
Interface
The service provides a single scriptable function:
/**
* getEffectiveTLDLength
*
* Finds the length of the effective TLD of a hostname. An effective TLD
* is the highest-level domain under which individual domains may be
* registered, and may therefore contain one or more dots. For example,
* the effective TLD for "www.bbc.co.uk" is "co.uk", because the .uk TLD
* does not allow the registration of domains at the second level ("bbc.uk"
* is forbidden). Similarly, the effective TLD of "developer.mozilla.com"
* is "com".
*
* The hostname will be normalized using nsIDNService::Normalize, which
* follows RFC 3454. getEffectiveTLDLength() will fail, generating an
* error, if the hostname contains characters that are invalid in URIs.
*
* @param aHostname The hostname to be analyzed, in UTF-8
*
* @returns the number of bytes that the longest identified effective TLD
* (TLD or TLD-like higher-level subdomain) occupies, not including
* the leading dot:
* bugzilla.mozilla.org -> org -> 3
* theregister.co.uk -> co.uk -> 5
* mysite.us -> us -> 2
*/
PRInt32 getEffectiveTLDLength(in AUTF8String aHostname);
To call the service, use the getService function:
#include "nsIEffectiveTLDService.h"
var tld
= Components.classes["@mozilla.org/network/effective-tld-service;1"]
.getService(Components.interfaces.nsIEffectiveTLDService);
var hostLength = tld.getEffectiveTLDLength(hostname);
Links
Gecko:Effective TLD List - team page for the effective TLD list
--PamG 11:52, 7 June 2006 (PDT)