User:Waldo/Internationalization API: Difference between revisions

User:Waldo/Internationalization API (view source)

Revision as of 01:38, 17 April 2013

730 bytes added , 17 April 2013

talk a little about BCP47 in the concepts introduction, talk more about language tags

Waldo

Confirmed users

446

edits

@@ Line 9: / Line 9: @@
 == Key concepts ==
-...talk about language tags and their structure and what's encoded in them, collators, date formats, and how all the stuff is implemented using what ICU primitives...copiously link to BCP47...
+Most of the concepts used by the Internationalization API are defined in [http://tools.ietf.org/html/bcp47 BCP 47]: a living aggregation of a set of RFCs (the set may change over time as RFCs in the set are obsoleted and replaced) specifying internationalization mechanics.  Full details on concepts should generally be looked up there: ECMA-402 defines most underlying concepts only by reference.
+...talk about collators, date formats, and how all the stuff is implemented using what ICU primitives...copiously link to BCP47...
 === Language tags ===
-Every operation is performed in terms of locales, specified as [http://tools.ietf.org/html/bcp47#section-2.1 language tags]: <code>en-US</code>, <code>nan-Hant-TW</code>, <code>und</code>, and so on.  The components of a language tag are the language and optionally a script, region, and variations that might exist within these; an optional private-use component may also be included at the end.  Each component is alphanumeric and case-insensitive.  The components are joined by hyphens; individual components can be distinguished by length and internal syntax (length, prefix, etc.).  For precise details of language tag structure, see [http://tools.ietf.org/html/bcp47#section-2.1 BCP 47]..
+Every operation is performed in terms of locales, specified as [http://tools.ietf.org/html/bcp47#section-2.1 language tags]: <code>en-US</code>, <code>nan-Hant-TW</code>, <code>und</code>, and so on.  The main components of a language tag are the language and optionally a script, region, and variations that might exist within these.  An extension component follows, permitting inclusion of extra structured data (usually to contextualize a use of the language tag).  Finally, an optional private-use component may include implementation-defined data.  All components are alphanumeric and case-insensitive ASCII.  The components are joined into a language tag using hyphens; individual components can be distinguished by length and internal syntax (length, prefix, etc.).  The precise details of language tag structure are quite complex, and they include a list of irregular forms for legacy compatibility reasons.  See [http://tools.ietf.org/html/bcp47#section-2.1 BCP 47] for all the gory details.
+One particular subcomponent worth noting specifically is the ''Unicode extension component'', living within the extension component.  The Unicode extension component has the basic form <code>"-u(-[a-z0-9]{2,8})+"</code>, with precise details in [https://tools.ietf.org/html/rfc6067 RFC 6067].  The Unicode component permits specifying additional details about sort order, numeric system, calendar system, and others.
 SpiderMonkey mostly ignores the language, script, region, and variant components of a language tag.  It will pass these components to ICU in language tags provided by the user, but it generally doesn't examine them, or do much of interest with them.  The one exception is for ''old-style language tags''.  '''XXX add details about the old-style mapping code in Intl.js, and why ICU doesn't perform that mapping itself'''
-SpiderMonkey ''does'', however, sometimes have to (very briefly) care about the extension component of a language tag.  The extension component may include ''Unicode extensions'' that specify things like the particular collation (sorting) algorithm to use (phone-book name sorting versus dictionary order, numeric versus versus lexical for numbers <nowiki>[</nowiki>1 12 100 or 1 100 12<nowiki>]</nowiki>), the numbering system to use when formatting a number, and so on.  Some ECMA-402 algorithms require locales be considered with a Unicode extension component removed, so SpiderMonkey must sometimes remove them before continuing with a provided language tag.
+SpiderMonkey ''does'', however, sometimes have to (very briefly) care about a Unicode extension component of a language tag -- but only to remove it.  ECMA-402 often has better-structured means of specifying the same information, and so its algorithms require the Unicode extension component be removed before processing continues.
 === ... ===