Changes

Jump to: navigation, search

IDN Display Algorithm

104 bytes removed, 13:59, 30 January 2012
no edit summary
The plan is to augment our whitelist with something based on ascertaining whether all the characters in a label
all come from the same script, or are from one of a limited and defined number of allowable combinations. The
hope is that any intra-script near-homographs will be recognisable to people who understand that script. To  We will retain the whitelist as well, because a) removing it might breaksome domains which worked previously, and b) if a registry submits agood policy, we have the ability to give them more freedom than the default restrictions do.So an IDN domain would be clear: shown as Unicode if the TLD was on the whitelist or, if not, if itmet the criteria above. ==Algorithm== If a TLD is in the whitelist, we will unconditionally display Unicode. If it is not, the following
algorithm will apply.
[http://www.unicode.org/reports/tr36/proposed.html#Security_Levels_and_Alerts Unicode Technical Report 36],
which is about Unicode and security, defines a "Moderately Restrictive" profile. It says the following (with edits for clarity):
<blockquote>
No characters in the label can be outside of the [http://www.unicode.org/reports/tr39/#Identifier_Characters Identifier Profile](defined for us by the IDNA2008 standard, [http://tools.ietf.org/html/rfc5892 RFC 5892]).
All characters in each label must be from Common + Inherited + a single script, or fromone of the following combinations:
* Common + Inherited + Latin + Han + Hiragana + Katakana; or* Common + Inherited + Latin + Han + Bopomofo; or* Common + Inherited + Latin + Han + Hangul; or* Common + Inherited + Latin + any single other script except Cyrillic, Greek, or Cherokee
</blockquote>
 
This system would allow any single script, and also most scripts + Latin, which
is a common mixing, plus script mixings common in the Far East where
they use multiple scripts at once.
 
The Identifier Profile is defined for us by the IDNA2008 standard,
[http://tools.ietf.org/html/rfc5892 RFC 5892]. So when we upgrade to IDNA 2008 (a separate discussion),
that should hopefully eliminate a large number of non-alphabet characters for us.
[http://www.unicode.org/reports/tr39/#Mixed_Script_Detection Unicode Technical Report 39] gives
"The Unicode Consortium in U6.1 (due out soon) is adding the property Script_Extensions,
to provide data about characters which are only used in a few (but more than one) script.
The sample code in #39 should be updated to include that, so handling such cases."We shouldtake this enhancement when the data becomes available; in the mean time, Common and Inherited characters are permitted without restriction.
Additional checks:
* Display as Punycode labels which use more than one numbering system(we would need a list of numbering systems in Unicode)* Display as Punycode labels which contain both simplified-only and traditional-only Chinese characters, using the Unihan data in the Unicode Character Database. (Should should be < 16k of data for a simple binary test.)* Display as Punycode labels which have sequences of the same nonspacing mark. We will retain the whitelist as well, because (we would need a) removing it might breaksome domains which worked previouslylist of, and b) if a registry submits agood policy, we have the ability to give them more freedom than the default restrictions do.So an IDN domain would be shown as Unicode if the TLD was on the whitelist or, if not, if itmet the criteria above. I think that this would make us display name of a superset of the IDN domainsthat the other browsers displayclass containing, in a way which was consistent acrossall copies of Firefox (maintaining the certainty which is a benefit of the current systemsuch marks)and which was pretty safe from spoofing.
===Possible Issues and Open Questions===
If problems arose in the future (e.g. with homographs between a particular
script and Latin), we our response would need to quickly issue a robust responsemaking be that in the point that end, it is up to registries to make sure that their customers
cannot rip each other off. Browsers can put some technical restrictions in place,
but we are not in a position to do this job for them while still maintaining
a level playing field for non-Latin scripts on the web. The registries are theonly people in a position to implement the proper checking here. For our part,we want to make sure we don't treat non-Latin scripts as second-class citizens.
===Transition===
Accountapprovers, antispam, confirm, emeritus
4,925
edits

Navigation menu