Gecko:Effective TLD Service: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
(→‎Links: Removed spammy link from abandoned domain name and pointed at current resource link)
 
(19 intermediate revisions by 5 users not shown)
Line 1: Line 1:
== What is the TLD Service, and why do we need one? ==
[http://www.publicsuffix.org/ publicsuffix.org] is the home of the Public Suffix List, and has the most up-to-date documentation on the list format and usage.
 
The TLD service examines a hostname passed to it and determines the longest portion that should be treated as a top-level domain (TLD).  Although technically the TLD for a hostname is the last dot-portion of the name (such as .com or .org), many domains (such as ac.uk) function as though they were TLDs, allocating any number of more specific, essentially unrelated names beneath them.  We wouldn't want to allow any site in *.ac.uk to set a cookie for the entire ac.uk domain, so it's important to be able to identify which subdomains function as effective TLDs.
 
In the remainder of this document, "TLD" will refer to any TLD or TLD-like domain: anything we wouldn't want to treat as a single entity when setting or retrieving cookies.
 
Incidentally, other components may also find this service useful.  For instance, [[Places]] wants to allow users to group history results by domain, which is more useful when it has better knowledge of what constitutes a logical domain.
 
== The domain file ==
 
The service obtains its information about TLDs from a text file in the following format:
 
* The file should be named '''tld_names.dat'''
** If a file with that name is found in the user's profile directory, it will be used.
** Otherwise, the file from the application's "res" directory will be used.
* It should use the UTF-8 text encoding.  Since UTF-8 is a superset of 7-bit ASCII (i.e., plain text with no diacritics or other special characters), a 7-bit ASCII file is also acceptable.
* It should contain one TLD rule per line.
** Rules are only read up to the first whitespace (space or tab), allowing comments after each rule if desired.
** An entire line may also be marked as a comment by starting it with two forward slashes ('''//'''). Any line starting with // will be completely ignored.
** Similarly, blank lines will be ignored.
* Each TLD rule should list the entire TLD-like domain name, with the subdomain portions separated by dots ('''.''') as usual.  A leading dot is optional.
* The wildcard character '''*''' (asterisk) will match any valid sequence of characters.
** Wildcards may only appear as an entire level of a TLD.  That is, they must be surrounded by dots (or implicit dots, at the beginning of a line).
* An exclamation mark ('''!''') at the start of a rule marks an exception to a previous wildcard rule.  An exception rule will be used instead of any other matching rule.
* True TLDs, i.e. domains with no dots, are never needed in the list, but they may be included for completeness.
 
=== Example ===
 
<pre>
com  // not needed, but acceptable
uk
*.uk
be
ac.be
jp
ac.jp
// Hosts in .hokkaido.jp can't set cookies below level 4...
*.hokkaido.jp
*.tokyo.jp
// ...except hosts in pref.hokkaido.jp, which can set level 3.
!pref.hokkaido.jp
!city.shizuoka.jp
!metro.tokyo.jp
</pre>
 
== Interpretation ==
 
A few additional comments are in order to fully explain how the TLD service interprets the file:
 
* If a hostname matches one or more exception rules, the ''shortest'' matching exception will be used.
* If a hostname matches more than one non-exception rule in the file, the ''longest'' matching rule (the one with the most levels) will generally be used, unless an exception rule also matches.
* However, a shorter rule that contains no wildcards will be used in preference to a longer wildcard rule, and a rule that falls back to a wildcard match at a higher level (closer to the beginning of the name) will be used in preference to one that uses a wildcard at a lower level.
** Specifically, the service begins matching hostnames one level at a time, starting from the end of the name.  At each level, a rule that is an exact match  is always used in preference to a rule that only matches because of a wildcard.  Therefore, a longer rule might be "missed" if it requires using a wildcard at a level for which an exact match also exists.
* Although exception rules are typically used to override wildcard rules, that is not a requirement.  If an exception matches, it will be used, whether or not there is any other rule for it to override.
* If no rule matches, the first-level domain (the portion of the hostname after its last dot, or the entire hostname if it contains no dots) will be used, whether or not it appears in the file.
* Longer names do not implicitly include their shorter components.  That is, if "bar.baz.uk" is the only rule in the list, "foo.bar.baz.uk" will match "bar.baz.uk", but "baz.uk" will only match the default first-level domain "uk".
 
=== Examples ===
 
Given the TLD file shown above,
 
* mozilla.org => org => 3
* cam.ac.uk => ac.uk => 5
* something.hokkaido.jp => something.hokkaido.jp => 21
* pref.hokkaido.jp => hokkaido.jp => 11
* foo.pref.hokkaido.jp => hokkaido.jp => 11


== Interface ==
== Interface ==


The only public, scriptable function provided by the TLD service is
See [http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/nsEffectiveTLDService.cpp nsEffectiveTLDService.cpp].
<pre>
    /**
    * getHostTLDLength
    *
    * Finds the longest known top-level domain name at the end of a hostname.
    *
    * @param  hostname  The hostname to be analyzed, in UTF-8
    *
    * @returns the number of bytes that the longest identified TLD/subdomain
    *          occupies, not including the leading dot:
    *              bugzilla.mozilla.org -> org -> 3
    *              theregister.co.uk -> co.uk -> 5
    *              mysite.us -> us -> 2
    */
    PRInt32 getHostTLDLength(in ACString aHostname);
</pre>
 
To call the service, use the getService function:
 
<pre>
var tld
    = Components.classes["@mozilla.org/network/tld-service;1"]
                .getService(Components.interfaces.nsITLDService);


var hostLength = tld.getHostTLDLength(hostname);
== Links ==
</pre>


--[[User:PamG|PamG]] 17:42, 26 May 2006 (PDT)
* [https://publicsuffix.org/learn/ Registered Domain Libraries] - Libraries in C, PHP, Perl, Ruby, Go, Python and other languages, as well as other documented known use-cases that rely on the effective TLD list to calculate registered domain names

Latest revision as of 00:55, 26 July 2019

publicsuffix.org is the home of the Public Suffix List, and has the most up-to-date documentation on the list format and usage.

Interface

See nsEffectiveTLDService.cpp.

Links

  • Registered Domain Libraries - Libraries in C, PHP, Perl, Ruby, Go, Python and other languages, as well as other documented known use-cases that rely on the effective TLD list to calculate registered domain names