Changes

IDN Display Algorithm

104 bytes removed, 13:59, 30 January 2012

no edit summary

The plan is to augment our whitelist with something based on ascertaining whether all the characters in a label

all come from the same script, or are from one of a limited and defined number of allowable combinations. The

hope is that any intra-script near-homographs will be recognisable to people who understand that script. To We will retain the whitelist as well, because a) removing it might breaksome domains which worked previously, and b) if a registry submits agood policy, we have the ability to give them more freedom than the default restrictions do.So an IDN domain would be ~~clear:~~ shown as Unicode if the TLD was on the whitelist or, if not, if itmet the criteria above. ==Algorithm== If a TLD is in the whitelist, we will unconditionally display Unicode. If it is not, the following

algorithm will apply.

[http://www.unicode.org/reports/tr36/proposed.html#Security_Levels_and_Alerts Unicode Technical Report 36],

~~which is about Unicode and security,~~ defines a "Moderately Restrictive" profile. It says the following (with edits for clarity):

No characters in the label can be outside of the [http://www.unicode.org/reports/tr39/#Identifier_Characters Identifier Profile](defined for us by the IDNA2008 standard, [http://tools.ietf.org/html/rfc5892 RFC 5892]).

All characters in each label must be from Common + Inherited + a single script, or fromone of the following combinations:

* Common + Inherited + Latin + Han + Hiragana + Katakana; or* Common + Inherited + Latin + Han + Bopomofo; or* Common + Inherited + Latin + Han + Hangul; or* Common + Inherited + Latin + any single other script except Cyrillic, Greek, or Cherokee

</blockquote>

~~This system would allow any single script, and also most scripts + Latin, which~~

~~is a common mixing, plus script mixings common in the Far East where~~

~~they use multiple scripts at once.~~

~~The Identifier Profile is defined for us by the IDNA2008 standard,~~

~~[http://tools.ietf.org/html/rfc5892 RFC 5892]. So when we upgrade to IDNA 2008 (a separate discussion),~~

~~that should hopefully eliminate a large number of non-alphabet characters for us.~~

[http://www.unicode.org/reports/tr39/#Mixed_Script_Detection Unicode Technical Report 39] gives

"The Unicode Consortium in U6.1 (due out soon) is adding the property Script_Extensions,

to provide data about characters which are only used in a few (but more than one) script.

The sample code in #39 should be updated to include that, so handling such cases."We shouldtake this enhancement when the data becomes available; in the mean time, Common and Inherited characters are permitted without restriction.

Additional checks:

* Display as Punycode labels which use more than one numbering system(we would need a list of numbering systems in Unicode)* Display as Punycode labels which contain both simplified-only and traditional-only Chinese characters, using the Unihan data in the Unicode Character Database. (~~Should~~ should be < 16k of data for a simple binary test.)* Display as Punycode labels which have sequences of the same nonspacing mark. ~~We will retain the whitelist as well, because~~ (we would need a~~) removing it might breaksome domains which worked previously~~list of, ~~and b) if a registry submits agood policy, we have the ability to give them more freedom than the default restrictions do.So an IDN domain would be shown as Unicode if the TLD was on the whitelist~~ or~~, if not, if itmet~~ the ~~criteria above.~~ ~~I think that this would make us display~~ name of a ~~superset of the IDN domainsthat the other browsers display~~class containing, ~~in a way which was consistent across~~all ~~copies of Firefox (maintaining the certainty which is a benefit of the current system~~such marks)~~and which was pretty safe from spoofing.~~

===Possible Issues and Open Questions===

If problems arose in the future (e.g. with homographs between a particular

script and Latin), we our response would ~~need to quickly issue a robust responsemaking~~ be that in the ~~point that~~ end, it is up to registries to make sure that their customers

cannot rip each other off. Browsers can put some technical restrictions in place,

but we are not in a position to do this job for them while still maintaining

a level playing field for non-Latin scripts on the web. The registries are theonly people in a position to implement the proper checking here. For our part,we want to make sure we don't treat non-Latin scripts as second-class citizens.

===Transition===

Gerv

Accountapprovers, antispam, confirm, emeritus

4,925

edits

Changes

IDN Display Algorithm

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

How to Contribute

MozillaWiki

Around Mozilla

Tools