User:Waldo/Internationalization API: Difference between revisions

Jump to navigation Jump to search
Talk more about ICU, split key concepts into separate concepts/operations sections
No edit summary
(Talk more about ICU, split key concepts into separate concepts/operations sections)
Line 7: Line 7:
The Internationalization API introduces one new global property: <code>Intl</code>.  This property is an object with various properties corresponding to various sub-APIs: collation (sorting), number formatting, and date/time formatting.  (More capabilities will be added in future Internationalization API updates.)  The localization APIs from ES5 have been reformulated to use the localization capabilities provided by the Internationalization API.  Generally, however, it's preferable to use the Internationalization API directly, as this is more efficient by permitting caching of the structures needed to perform each operation.
The Internationalization API introduces one new global property: <code>Intl</code>.  This property is an object with various properties corresponding to various sub-APIs: collation (sorting), number formatting, and date/time formatting.  (More capabilities will be added in future Internationalization API updates.)  The localization APIs from ES5 have been reformulated to use the localization capabilities provided by the Internationalization API.  Generally, however, it's preferable to use the Internationalization API directly, as this is more efficient by permitting caching of the structures needed to perform each operation.


== Key concepts ==
== Concepts ==


Most of the concepts used by the Internationalization API are defined in [http://tools.ietf.org/html/bcp47 BCP 47]: a living aggregation of a set of RFCs (the set may change over time as RFCs in the set are obsoleted and replaced) specifying internationalization mechanics.  Full details on concepts should generally be looked up there: ECMA-402 defines most underlying concepts only by reference.
Most of the concepts used by the Internationalization API are defined in [http://tools.ietf.org/html/bcp47 BCP 47]: a living aggregation of a set of RFCs (the set may change over time as RFCs in the set are obsoleted and replaced) specifying internationalization mechanics.  Full details on concepts should generally be looked up there: ECMA-402 defines most underlying concepts only by reference.
...talk about collators, date formats, and how all the stuff is implemented using what ICU primitives...copiously link to BCP47...


=== Language tags ===
=== Language tags ===


Every operation is performed in terms of locales, specified as [http://tools.ietf.org/html/bcp47#section-2.1 language tags]: <code>en-US</code>, <code>nan-Hant-TW</code>, <code>und</code>, and so on.  The main components of a language tag are the language and optionally a script, region, and variations that might exist within these.  An extension component follows, permitting inclusion of extra structured data (usually to contextualize a use of the language tag).  Finally, an optional private-use component may include implementation-defined data.  All components are alphanumeric and case-insensitive ASCII.  The components are joined into a language tag using hyphens; individual components can be distinguished by length and internal syntax (length, prefix, etc.).  The precise details of language tag structure are quite complex, and they include a list of irregular forms for legacy compatibility reasons.  See [http://tools.ietf.org/html/bcp47#section-2.1 BCP 47] for all the gory details.
Every operation is performed in terms of locales, specified as [http://tools.ietf.org/html/bcp47#section-2.1 language tags]: <code>en-US</code>, <code>nan-Hant-TW</code>, <code>und</code>, and so on.  The main components of a language tag are the language and optionally a script, region, and variations that might exist within these.  An extension component follows, permitting inclusion of extra structured data (usually to contextualize a use of the language tag).  Finally, an optional private-use component may include arbitrary data (this is for the use of webpages -- not for SpiderMonkey's use).  All components are alphanumeric and case-insensitive ASCII.  The components are joined into a language tag using hyphens; individual components can be distinguished by length and internal syntax (length, prefix, etc.).  The precise details of language tag structure are quite complex, and they include a list of irregular forms for legacy compatibility reasons.  See [http://tools.ietf.org/html/bcp47#section-2.1 BCP 47] for all the gory details.


One particular subcomponent worth noting specifically is the ''Unicode extension component'', living within the extension component.  The Unicode extension component has the basic form <code>"-u(-[a-z0-9]{2,8})+"</code>, with precise details in [https://tools.ietf.org/html/rfc6067 RFC 6067].  The Unicode component permits specifying additional details about sort order, numeric system, calendar system, and others.
One particular subcomponent worth noting specifically is the ''Unicode extension component'', living within the extension component.  The Unicode extension component has the basic form <code>"-u(-[a-z0-9]{2,8})+"</code>, with precise details in [https://tools.ietf.org/html/rfc6067 RFC 6067].  The Unicode component permits specifying additional details about sort order, numeric system, calendar system, and others.


SpiderMonkey mostly ignores the language, script, region, and variant components of a language tag.  It will pass these components to ICU in language tags provided by the user, but it generally doesn't examine them, or do much of interest with them.  The one exception is for ''old-style language tags''.  '''XXX add details about the old-style mapping code in Intl.js, and why ICU doesn't perform that mapping itself'''
SpiderMonkey mostly ignores the language, script, region, variant, and private-use components of a language tag.  It will pass these components to ICU in language tags provided by the user, but it generally doesn't examine them or do much of interest with them.  The one exception is for ''old-style language tags''.  '''XXX add details about the old-style mapping code in Intl.js, and why ICU doesn't perform that mapping itself'''


SpiderMonkey ''does'', however, sometimes have to (very briefly) care about a Unicode extension component of a language tag -- but only to remove it.  ECMA-402 often has better-structured means of specifying the same information, and so its algorithms require the Unicode extension component be removed before processing continues.
SpiderMonkey ''does'', however, sometimes have to (very briefly) care about a Unicode extension component of a language tag -- but only to remove it.  ECMA-402 often has better-structured means of specifying the same information, and so its algorithms require the Unicode extension component be removed before processing continues.
Line 30: Line 28:


Date formatting requires knowledge of the time zone.  Time zone names are specified by the [http://www.iana.org/time-zones IANA Time Zone Database].
Date formatting requires knowledge of the time zone.  Time zone names are specified by the [http://www.iana.org/time-zones IANA Time Zone Database].
== Operations ==
ECMA-402 in its first iteration exposes various locale-sensitive operations.


=== Collation ===
=== Collation ===


Collation is the process of sorting a list of strings, according to particular rules.  Different locales sort text differently.  Locales may also sort differently in different contexts: dictionary sort order versus phonebook sort order, say.  Sorting also may or may not take into account numeric value: <code><nowiki>[</nowiki>1, 30, 5<nowiki>]</nowiki> order versus <code><nowiki>[</nowiki>1, 5, 30<nowiki>]</nowiki> order.
Collation is the process of sorting a list of strings, according to particular rules.  Different locales sort text differently.  Locales may also sort differently in different contexts: dictionary sort order versus phonebook sort order, say.  Sorting also may or may not take into account numeric value: <code><nowiki>[</nowiki>1, 30, 5<nowiki>]</nowiki> order versus <code><nowiki>[</nowiki>1, 5, 30<nowiki>]</nowiki></code> order.


== Internationalization in SpiderMonkey ==
== Internationalization in SpiderMonkey ==


SpiderMonkey includes significant support for the Internationalization API.  The fundamental primitives used to implement the API are provided by an in-tree imported copy of ICU.  This is an optional component of a SpiderMonkey build; support may be turned on using the <code>--enable-intl-api</code> configuration option.  Most of SpiderMonkey's Internationalization code gets built even when the API is disabled, but the <code>Intl</code> object isn't added to global objects, and the legacy <code>toLocale*String</code> methods are implemented using SpiderMonkey's old <code>JSLocaleCallbacks</code> interface.  Features and capabilities of the Internationalization API itself are implemented in both C++ and in self-hosted JavaScript that accesses ICU functionality through various intrinsic functions in the self-hosting global.
SpiderMonkey includes significant support for the Internationalization API.  The fundamental primitives used to implement the API are provided by an in-tree imported copy of ICU.  This is an optional component of a SpiderMonkey build; support may be turned on using the <code>--enable-intl-api</code> configuration option.  Most of SpiderMonkey's Internationalization code gets built even when the API is disabled, to prevent bitrot; ICU interfaces of note are stubbed out in this configuration with methods that do nothing but assert.  The most important differences in a SpiderMonkey build without Internationalization are that the <code>Intl</code> object isn't added to global objects and the legacy <code>toLocale*String</code> methods are implemented using SpiderMonkey's old <code>JSLocaleCallbacks</code> interface.  Features and capabilities of the Internationalization API itself are implemented in both C++ and in self-hosted JavaScript that accesses ICU functionality through various intrinsic functions in the self-hosting global.


The Internationalization API is enabled by default in SpiderMonkey when embedded in Firefox builds.
The Internationalization API is enabled by default in SpiderMonkey when embedded in Firefox builds.
Line 45: Line 47:
==== ICU ====
==== ICU ====


International Components for Unicode is a library implementing collation, formatting, and other locale-sensitive functions.  It provides the underlying functionality used in implementing Internationalization.  ICU is imported in {{source|intl/icu}}
International Components for Unicode is a library implementing collation, formatting, and other locale-sensitive functions.  It provides the underlying functionality used in implementing Internationalization.  ICU is imported in {{source|intl/icu}}.
 
ICU's source code is relatively huge and sprawling: hardly surprising for a 15+ year old project.  {{source|intl/icu/source/common/unicode/}} is probably the most interesting directory, from SpiderMonkey's point of view, as it contains the public headers and interfaces used by SpiderMonkey.  Each header and interface within contains copious documentation of the behavior of the function/enum/etc. in question.  The documentation isn't always perfectly clear, but quite often it's enough to know how to use the functionality without having to read the implementation.


==== Integration ====
==== Integration ====
Line 68: Line 72:


Tests live in {{source|js/src/tests/test402}}, an unmodified import of the ECMA-402 test suite.  '''XXX Explain how the tests are run, how they're skipped in no-<code>Intl</code> builds, how to update them, how to contribute to them, how we disable/annotate any tests we don't pass'''
Tests live in {{source|js/src/tests/test402}}, an unmodified import of the ECMA-402 test suite.  '''XXX Explain how the tests are run, how they're skipped in no-<code>Intl</code> builds, how to update them, how to contribute to them, how we disable/annotate any tests we don't pass'''
== Structures ==
...talk about collators, date formats, and how all the stuff is implemented using what ICU primitives...copiously link to BCP47...


=== Known bugs and issues ===
=== Known bugs and issues ===
Confirmed users
446

edits

Navigation menu