Account confirmers, Anti-spam team, Confirmed users, Bureaucrats and Sysops emeriti
4,925
edits
No edit summary |
|||
| Line 1: | Line 1: | ||
This page explains the plan for changing the mechanism by which Firefox decides whether to display a given IDN domain label (a domain name is made up of one or more labels, separated by dots) in its Unicode or Punycode form. | |||
This page | |||
==Background== | ==Background== | ||
| Line 13: | Line 11: | ||
Our current algorithm is to display as Unicode all IDN labels within TLDs on our [http://www.mozilla.org/projects/security/tld-idn-policy-list.html whitelist], and display as Punycode otherwise. We check the anti-spoofing policies of a registry before adding their TLD to the whitelist. The TLD operator must apply directly (they cannot be nominated by another person), and on several occasions we have required policy updates or implementation as a condition of getting in. | Our current algorithm is to display as Unicode all IDN labels within TLDs on our [http://www.mozilla.org/projects/security/tld-idn-policy-list.html whitelist], and display as Punycode otherwise. We check the anti-spoofing policies of a registry before adding their TLD to the whitelist. The TLD operator must apply directly (they cannot be nominated by another person), and on several occasions we have required policy updates or implementation as a condition of getting in. | ||
We also have a character blacklist - characters we will never display under any circumstances. This includes those which could be used to spoof the separators "/" and ".", and invisible characters. (XXX Do we need to update this to remove some of those, like ZWJ/ZWNJ?) | We also have a character blacklist - characters we will never display under any circumstances. This includes those which could be used to spoof the separators "/" and ".", and invisible characters. (XXX Do we need to update this to remove some of those, like ZWJ/ZWNJ, for IDNA2008?) | ||
===Need For Change=== | ===Need For Change=== | ||
| Line 38: | Line 36: | ||
The plan is to augment our whitelist with something based on ascertaining whether all the characters in a label | The plan is to augment our whitelist with something based on ascertaining whether all the characters in a label | ||
all come from the same script, or are from one of a limited and defined number of allowable combinations. The | all come from the same script, or are from one of a limited and defined number of allowable combinations. The | ||
hope is that any intra-script near-homographs will be recognisable to people who understand that script. | hope is that any intra-script near-homographs will be recognisable to people who understand that script. To | ||
be clear: if a TLD is in the whitelist, we will unconditionally display Unicode. If it is not, the following | |||
algorithm will apply. | |||
[http://www.unicode.org/reports/tr36/proposed.html#Security_Levels_and_Alerts Unicode Technical Report 36], | [http://www.unicode.org/reports/tr36/proposed.html#Security_Levels_and_Alerts Unicode Technical Report 36], | ||
which is about Unicode and security, defines a "Moderately Restrictive" profile | which is about Unicode and security, defines a "Moderately Restrictive" profile. It says | ||
the following (with edits for clarity): | the following (with edits for clarity): | ||
| Line 65: | Line 65: | ||
[http://www.unicode.org/reports/tr39/#Mixed_Script_Detection Unicode Technical Report 39] gives | [http://www.unicode.org/reports/tr39/#Mixed_Script_Detection Unicode Technical Report 39] gives | ||
a definition for how we detect whether a string is "single script". | a definition for how we detect whether a string is "single script". Some Common or Inherited characters | ||
are only used in a small number (but more than one) script. Mark Davis writes: | |||
"The Unicode Consortium in U6.1 (due out soon) is adding the property Script_Extensions, | |||
to provide data about characters which are only used in a few (but more than one) script. | |||
The sample code in #39 should be updated to include that, so handling such cases." | |||
Additional checks: | |||
* Display as Punycode labels which use more than one numbering system | |||
* Display as Punycode labels which contain both simplified-only and traditional-only Chinese characters, using the Unihan data in the Unicode Character Database. (Should be < 16k of data for a simple binary test.) | |||
* Display as Punycode labels which have sequences of the same nonspacing mark. | |||
We will retain the whitelist as well, because a) removing it might break | We will retain the whitelist as well, because a) removing it might break | ||
| Line 80: | Line 90: | ||
===Possible Issues and Open Questions=== | ===Possible Issues and Open Questions=== | ||
The | The following issues are open, but should not block initial implementation. | ||
Suggestion from TR#39: | |||
* Check to see that all the characters are in the sets of exemplar characters for at least one language in the Unicode Common Locale Data Repository. [XXX What does this mean? -- Gerv] | * Check to see that all the characters are in the sets of exemplar characters for at least one language in the Unicode Common Locale Data Repository. [XXX What does this mean? -- Gerv] | ||
Also: | Also: | ||
* Should we document our character hard-blacklist as part of this exercise? Are any characters in it legal in IDNA2008? | * Should we document our character hard-blacklist as part of this exercise? It's already visible in the prefs. Are any characters in it legal in IDNA2008 anyway? | ||
* Do we want to allow the user to choose between multiple "restriction levels", or have a hidden pref? There are significant downsides to allowing this... | * Do we want to allow the user to choose between multiple "restriction levels", or have a hidden pref? There are significant downsides to allowing this... | ||
* Do we ever want to display errors other than just by using Punycode? | * Do we ever want to display errors other than just by using Punycode? I suggest not... | ||
* Should we add Armenian to the list of scripts which cannot mix with Latin? | * Should we add Armenian to the list of scripts which cannot mix with Latin? | ||
===Downsides=== | ===Downsides=== | ||
| Line 117: | Line 119: | ||
===Transition=== | ===Transition=== | ||
In between adopting this plan and shipping a Firefox with | |||
the restrictions implemented, we | the restrictions implemented, we will admit into the whitelist any | ||
TLD whose anti-spoofing policies at registration time were at least as strong as | TLD whose anti-spoofing policies <i>at registration time</i> were at least as strong as | ||
those outlined above. | those outlined above. | ||