User:Waldo/Internationalization API: Difference between revisions

no edit summary
No edit summary
No edit summary
Line 15: Line 15:
Language tags are defined in [http://tools.ietf.org/html/bcp47 BCP 47]: a living aggregation of a set of RFCs (the set may change over time as RFCs in the set are obsoleted and replaced) specifying their structure, their default interpretation, and extension mechanisms. The standard is complemented by the [http://www.iana.org/assignments/language-subtag-registry IANA Language Subtag Registry].
Language tags are defined in [http://tools.ietf.org/html/bcp47 BCP 47]: a living aggregation of a set of RFCs (the set may change over time as RFCs in the set are obsoleted and replaced) specifying their structure, their default interpretation, and extension mechanisms. The standard is complemented by the [http://www.iana.org/assignments/language-subtag-registry IANA Language Subtag Registry].


Every operation is performed in terms of locales, specified as [http://tools.ietf.org/html/bcp47#section-2.1 language tags]: <code>en-US</code>, <code>nan-Hant-TW</code>, <code>und</code>, and so on.  The main components of a language tag are the language and optionally a script, region, and variations that might exist within these.  An extension component may follow, permitting inclusion of extra structured data (usually to contextualize a use of the language tag).  Finally, an optional private-use component may include arbitrary data (this is for the use of webpages -- not for SpiderMonkey's use).  All components are alphanumeric and case-insensitive ASCII.  The components are joined into a language tag using hyphens; individual components can be distinguished by length and internal syntax (length, prefix, etc.).  The precise details of language tag structure are quite complex, and they include a list of irregular forms for legacy compatibility reasons.  See [http://tools.ietf.org/html/bcp47#section-2.1 BCP 47] for all the gory details.
Every operation is performed in terms of locales, specified as [http://tools.ietf.org/html/bcp47#section-2.1 language tags]: <code>en-US</code>, <code>nan-Hant-TW</code>, <code>und</code>, and so on.  The main components of a language tag are the language and optionally a script, region, and variations that might exist within these.  An extension component may follow, permitting inclusion of extra structured data (usually to contextualize a use of the language tag).  Finally, an optional private-use component may include arbitrary data (this is for the use of webpages not for SpiderMonkey's use).  All components are alphanumeric and case-insensitive ASCII.  The components are joined into a language tag using hyphens; individual components can be distinguished by length and internal syntax (length, prefix, etc.).  The precise details of language tag structure are quite complex, and they include a list of irregular forms for legacy compatibility reasons.  See [http://tools.ietf.org/html/bcp47#section-2.1 BCP 47] for all the gory details.


One particular subcomponent worth noting specifically is the ''Unicode extension component'', living within the extension component.  The Unicode extension component has the basic form <code>"-u(-[a-z0-9]{2,8})+"</code>, with precise details in [https://tools.ietf.org/html/rfc6067 RFC 6067].  The Unicode component permits specifying additional details about sort order, numeric system, calendar system, and others.
One particular subcomponent worth noting specifically is the ''Unicode extension component'', living within the extension component.  The Unicode extension component has the basic form <code>"-u(-[a-z0-9]{2,8})+"</code>, with precise details in [https://tools.ietf.org/html/rfc6067 RFC 6067].  The Unicode component permits specifying additional details about sort order, numeric system, calendar system, and others.
Line 45: Line 45:
=== Date formatting ===
=== Date formatting ===


Locale-sensitive date formatting in ES5 admits only a single implementation-defined format with a fixed set of components.  ECMA-402 enhances date formatting to allow the selection of components to be customized -- month, day, and year, for example..  The way in which these components will be displayed in the final format string is locale-dependent, as different locales write out dates in different ways.  (For example, a date might be formatted as "September 24, 2012" for en-US and as "24 Sept. 2012" for fr-FR.)  Moreover, various styles may be chosen for the components included in the format: "narrow", "short", "long", "2-digit", and "numeric".  (Exactly which of these styles are supported and what they look like is also implementation-dependent.)  These styles feed into the final computation of an appropriate pattern to use to generate a final string.
Locale-sensitive date formatting in ES5 admits only a single implementation-defined format with a fixed set of components.  ECMA-402 enhances date formatting to allow the selection of components to be customized month, day, and year, for example..  The way in which these components will be displayed in the final format string is locale-dependent, as different locales write out dates in different ways.  (For example, a date might be formatted as "September 24, 2012" for en-US and as "24 Sept. 2012" for fr-FR.)  Moreover, various styles may be chosen for the components included in the format: "narrow", "short", "long", "2-digit", and "numeric".  (Exactly which of these styles are supported and what they look like is also implementation-dependent.)  These styles feed into the final computation of an appropriate pattern to use to generate a final string.


== Internationalization in SpiderMonkey ==
== Internationalization in SpiderMonkey ==


SpiderMonkey includes significant support for the Internationalization API.  The fundamental primitives used to implement the API are provided by an in-tree imported copy of ICU.  This is an optional component of a SpiderMonkey build; support may be turned on using the <code>--enable-intl-api</code> configuration option.  Most of SpiderMonkey's Internationalization code gets built even when the API is disabled, to prevent bitrot; ICU interfaces of note are stubbed out in this configuration with methods that do nothing but assert.  The most important differences in a SpiderMonkey build without Internationalization are that the <code>Intl</code> object isn't added to global objects and the legacy <code>toLocale*String</code> methods are implemented using SpiderMonkey's old <code>JSLocaleCallbacks</code> interface.  Features and capabilities of the Internationalization API itself are implemented in both C++ and in self-hosted JavaScript that accesses ICU functionality through various intrinsic functions in the self-hosting global.
SpiderMonkey includes significant support for the Internationalization API.  The fundamental primitives used to implement the API are provided by an in-tree imported copy of ICU.  This is an optional component of a SpiderMonkey build; support may be turned on using the <code>--enable-intl-api</code> configuration option.  Most of SpiderMonkey's Internationalization code gets built even when the API is disabled, to prevent bit rot; ICU interfaces of note are stubbed out in this configuration with methods that do nothing but assert.  The most important differences in a SpiderMonkey build without Internationalization are that the <code>Intl</code> object isn't added to global objects and the legacy <code>toLocale*String</code> methods are implemented using SpiderMonkey's old <code>JSLocaleCallbacks</code> interface.  Features and capabilities of the Internationalization API itself are implemented in both C++ and in self-hosted JavaScript that accesses ICU functionality through various intrinsic functions in the self-hosting global.


The Internationalization API is enabled by default in SpiderMonkey when embedded in desktop Firefox builds.
The Internationalization API is enabled by default in SpiderMonkey when embedded in desktop Firefox builds.
Line 87: Line 87:
The majority of the self-hosted code implementing Internationalization is in {{source|js/src/builtin/Intl.js}}.  This file defines the functions exposed on the various <code>Intl.*</code> constructor functions and the various <code>Intl.*.prototype</code> objects.
The majority of the self-hosted code implementing Internationalization is in {{source|js/src/builtin/Intl.js}}.  This file defines the functions exposed on the various <code>Intl.*</code> constructor functions and the various <code>Intl.*.prototype</code> objects.


Internationalization in various cases requires keeping around large data tables: to record the minor units of supported currency codes, to record language tag mappings, and so on. Some of this data lives in {{source|js/src/builtin/IntlData.js}} and is generated by {{source|js/src/builtin/make_intl_data.py}}.  This script downloads original (large) plaintext databases, parses them, and extracts in the proper format the data used by Internationalization.  Updating this static data &mdash; which should happen any time the underlying databases receive an update &mdash; should be as simple as rerunning the script. '''XXX Link to the mailing lists to track to learn when updates occur!'''
Internationalization in various cases requires keeping around large data tables: to record the minor units of supported currency codes, to record language tag mappings, and so on. Some of this data lives in {{source|js/src/builtin/IntlData.js}} and is generated by {{source|js/src/builtin/make_intl_data.py}}.  This script downloads original (large) plaintext databases, parses them, and extracts in the proper format the data used by Internationalization.  Updating this static data which should happen any time the underlying databases receive an update should be as simple as rerunning the script. See [[#Care_and_feeding_of_the_Internationalization_API|Care and feeding of the Internationalization API]].


==== Intrinsic functions ====
==== Intrinsic functions ====


Self-hosted code calls into various intrinsics to access ICU functionality.  The full list of Internationalization intrinsics is (necessarily, at the moment -- this will probably change eventually) in {{source|js/src/vm/SelfHosting.cpp}}, but the intrinsics themselves are implemented in {{source|js/src/builtin/Intl.cpp}}.
Self-hosted code calls into various intrinsics to access ICU functionality.  The full list of Internationalization intrinsics is (necessarily, at the moment this will probably change eventually) in {{source|js/src/vm/SelfHosting.cpp}}, but the intrinsics themselves are documented in {{source|js/src/builtin/Intl.h}} and implemented in {{source|js/src/builtin/Intl.cpp}}.


==== Natively-implemented functions ====
==== Natively-implemented functions ====
Line 99: Line 99:
==== Tests ====
==== Tests ====


Tests live in {{source|js/src/tests/test402}}, an unmodified import of the ECMA-402 test suite. The tests are run during the normal jstests/jsreftest suite. '''XXX Is this true yet?'''
Tests live in two directories:
* {{source|js/src/tests/test402}} has an unmodified import of Test402, the ECMA-402 conformance test suite, supplemented by jstests-specific browser.js and shell.js files.
* {{source|js/src/tests/Intl}} has additional implementation dependent tests, especially for the integration with the ICU library. They verify aspects that are not specified in ECMA-402 but matter for actual use, such as which locales are supported and whether the functions actually exhibit properly localized behavior.


Internationalization tests are treated as a third-party import. Contributions to them should go through '''XXX how to contribute'''.
Both sets of tests are run as part of the normal jstests/jsreftest suite if the <code>Intl</code> object is present. For Test402 tests, this condition is defined by an entry in {{source|js/src/tests/jstests.list}}; for Intl tests by normal jstests comments in the individual test files. Conformance tests that the SpiderMonkey implementation doesn't pass yet are disabled by additional entries in {{source|js/src/tests/jstests.list}}.


As we may not have fully-correctly implemented ECMA-402 at any point, or a bug might be found before a test is committed, we require a mechanism to mark an ECMA-402 test as failing without requiring that marking be upstreamed. '''XXX what is this mechanism?'''
When writing new tests, consider whether they verify conformance with the ECMA-402 specification or implementation dependent behavior:
* Conformance tests should be written in [http://wiki.ecmascript.org/doku.php?id=test262:test_case_format Test262 format] and [http://wiki.ecmascript.org/doku.php?id=test262:submission_process contributed to Test402] (Test402 uses the harness and repository of [http://wiki.ecmascript.org/doku.php?id=test262:test262 Test262], the conformance test suite for the ECMAScript Language Specification). Mozilla has a contribution agreement with Ecma in place; authorized to make contributions are Brendan Eich, Dave Herman, Allen Wirfs-Brock, Jeff Walden, Jason Orendorff.
* Tests for implementation dependent behavior have to go into the {{source|js/src/tests/Intl}} directory.


In builds with ECMA-402 support disabled, these tests are skipped. '''XXX how?'''
See below for information on [[#Conformance_tests|updating the imported copy of the conformance tests]].


=== Implementation ===
=== Implementation ===


ECMA-402 currently exposes <code>Intl.Collator</code>, <code>Intl.DateTimeFormatter</code>, and <code>Intl.NumberFormatter</code> objects.  The spec also permits initializing an existing object as one of these, for a small wrinkle.  The fundamental ICU data structures providing the relevant functionality are <code>UCollator*</code>, <code>UNumberFormat*</code>, and <code>UDateFormat*</code>, opaque pointers all.  Instances are created using <code>u{col,num,date}_open</code>, passing in appropriate arguments.  For objects ''created'' by the constructor, the pointer is stored in a reserved slot as a private value.  For objects merely ''initialized'' by the constructor, the ICU data structures must be (inefficiently!) created anew every time.  (This difference should not be observable, except through performance-timing, because the only structures consulted to create the ICU structure are internal ones , operations on which aren't observable.)
ECMA-402 currently exposes <code>Intl.Collator</code>, <code>Intl.DateTimeFormat</code>, and <code>Intl.NumberFormat</code> objects.  The spec also permits initializing an existing object as one of these, for a small wrinkle.  The fundamental ICU data structures providing the relevant functionality are <code>UCollator*</code>, <code>UNumberFormat*</code>, and <code>UDateFormat*</code>, opaque pointers all.  Instances are created using <code>u{col,num,date}_open</code>, passing in appropriate arguments.  For objects ''created'' by the constructor, the pointer is stored in a reserved slot as a private value.  For objects merely ''initialized'' by the constructor, the ICU data structures must be (inefficiently!) created anew every time.  (This difference should not be observable, except through performance-timing, because the only structures consulted to create the ICU structure are internal ones , operations on which aren't observable.)


Every object initialized as an Intl object has an associated set of internal properties.  In ECMA-402 these properties are represented using ES5's traditional double-bracket notation: <code><nowiki>[[calendar]]</nowiki></code>, <code><nowiki>[[initializedIntlObject]]</nowiki></code>, and so on.  The "ideal" means of implementing these properties would probably be ES6 private names, but they're not stable or well-understood enough to be specified yet (let alone implemented).  In the meantime we associate ECMA-402 internal properties with objects using a weak map.  Any object initialized as an <code>Intl</code> object has an internal <code><nowiki>[[initializedIntlObject]]</nowiki></code> property.  This is implemented by placing all such objects as keys in a weak map (<code>internalsMap</code> in <code>builtin/Intl.js</code>).  The corresponding value is an ''internals object''.
Every object initialized as an Intl object has an associated set of internal properties.  In ECMA-402 these properties are represented using ES5's traditional double-bracket notation: <code><nowiki>[[calendar]]</nowiki></code>, <code><nowiki>[[initializedIntlObject]]</nowiki></code>, and so on.  The "ideal" means of implementing these properties would probably be ES6 private names, but they're not stable or well-understood enough to be specified yet (let alone implemented).  In the meantime we associate ECMA-402 internal properties with objects using a weak map.  Any object initialized as an <code>Intl</code> object has an internal <code><nowiki>[[initializedIntlObject]]</nowiki></code> property.  This is implemented by placing all such objects as keys in a weak map (<code>internalsMap</code> in <code>builtin/Intl.js</code>).  The corresponding value is an ''internals object''.
Line 115: Line 119:
Checking whether an object has been initialized as an <code>Intl</code> object is encapsulated by the <code>isInitializedIntlObject</code> method in {{source|js/src/builtin/Intl.js}}.  The <code>getInternals</code> function in the same file is used to encapsulate weak map access to an internals object.  These methods ensure the weak map mechanism is only an implementation detail encoded in a very few places.
Checking whether an object has been initialized as an <code>Intl</code> object is encapsulated by the <code>isInitializedIntlObject</code> method in {{source|js/src/builtin/Intl.js}}.  The <code>getInternals</code> function in the same file is used to encapsulate weak map access to an internals object.  These methods ensure the weak map mechanism is only an implementation detail encoded in a very few places.


Internals objects are objects with null <code><nowiki>[[Prototype]]</nowiki></code>, with properties corresponding to the other internal properties on the object, named naturally -- "calendar", "initializedDateTimeFormat", and so on (no brackets).  Accessing any internal property is simply a matter of doing <code>internals.calendar</code>: this is safe because, with the <code><nowiki>[[Prototype]]</nowiki></code> nulled out, property accesses can't touch any script-visible state.  Internal properties are added and set during the initialization process.  They are lazily consulted to construct an ICU structure when collation/formatting/etc. actually occurs in the <code>js::intl_CompareStrings</code>, <code>js::intl_FormatNumber</code>, and <code>js::intl_FormatDateTime</code> functions.  (Although not ''directly'' there, but rather in sub-methods called when the ICU structure isn't cached, or when the object was initialized as an <code>Intl</code> object but wasn't actually one -- see again the "inefficiently" bit above.)
Internals objects are objects with null <code><nowiki>[[Prototype]]</nowiki></code>, with properties corresponding to the other internal properties on the object, named naturally "calendar", "initializedDateTimeFormat", and so on (no brackets).  Accessing any internal property is simply a matter of doing <code>internals.calendar</code>: this is safe because, with the <code><nowiki>[[Prototype]]</nowiki></code> nulled out, property accesses can't touch any script-visible state.  Internal properties are added and set during the initialization process.  They are lazily consulted to construct an ICU structure when collation/formatting/etc. actually occurs in the <code>js::intl_CompareStrings</code>, <code>js::intl_FormatNumber</code>, and <code>js::intl_FormatDateTime</code> functions.  (Although not ''directly'' there, but rather in sub-methods called when the ICU structure isn't cached, or when the object was initialized as an <code>Intl</code> object but wasn't actually one see again the "inefficiently" bit above.)
 
=== Known issues ===
 
ECMA-402 says that the supported numbering systems for a locale are (unsurprisingly) locale-dependent.  ICU exposes the default numbering system for a locale via a C++ API, but otherwise it pretends any numbering system can be used by any locale.  Thus SpiderMonkey's implementation says that the default numbering system is supported (obviously), and it says a handful of common decimal numbering systems are supported.  See <code>getNumberingSystems</code> in {{source|js/src/builtin/Intl.cpp}}.  If ICU ever provides more comprehensive information here, we should probably use it.
 
The ICU interface that exposes a locale's default numbering system (see above) is C++, which (see below) means it's not stable.  There's an [http://bugs.icu-project.org/trac/ticket/10039 issue] on file to add a C API for this.  Until that's implemented and we use it, be careful about ICU upgrades.
 
The means for representing internal properties ''may'' not be cross-global-correct.  Technically if I do <code>var obj = {}; Intl.Collator(obj); otherWindow.Intl.Collator(obj);</code> the second initialization should throw, because internal properties adhere to the object.  The current structuring of the weak map mechanism, however, uses one weak map per global.  So that example likely "succeeds" now, where it actually shouldn't.  This probably is unlikely to be simply stumbled upon, but it's an issue.  Recent self-hosting work may allow us to not clone the internals-mapping behavior into every global object that uses <code>Intl</code> stuff, which would solve this issue.  I (Jeff) should look into this at some point, and poke Till for review on a patch if it pans out (given he implemented the relevant self-hosting improvements).


=== Care and feeding of the Internationalization API ===
=== Care and feeding of the Internationalization API ===
Line 131: Line 127:
==== ICU ====
==== ICU ====


ICU has major releases once or twice a year, and minor releases as needed. Releases are announced on the [https://lists.sourceforge.net/lists/listinfo/icu-announce icu-announce mailing list]. Each release includes the latest versions of the CLDR locale data, the IANA time zone database, and the ISO 4217 currency data, so it's generally worth it for Mozilla to update its copy each time. As of April 2013, upgrades are unfortunately blocked by [http://bugs.icu-project.org/trac/ticket/10043 ICU bug 10043].
ICU has major releases once or twice a year, and minor releases as needed. Releases are announced on the [https://lists.sourceforge.net/lists/listinfo/icu-announce icu-announce mailing list]. Each release includes the latest versions of the CLDR locale data, the IANA time zone database, and the ISO 4217 currency data, so it's generally worth it for Mozilla to update its copy each time. As of April 2013, upgrades are unfortunately blocked by [http://bugs.icu-project.org/trac/ticket/10043 ICU bug 10043]. To import the latest version, use the {{source|intl/update-icu.sh}} script.


Bugs in ICU should be reported into the [http://bugs.icu-project.org/trac/ ICU bug database]. Bug fixes can be [http://site.icu-project.org/processes/contribute contributed]; as of April 2013, one contribution is in progress.
Bugs in ICU should be reported into the [http://bugs.icu-project.org/trac/ ICU bug database]. Bug fixes can be [http://site.icu-project.org/processes/contribute contributed]; as of April 2013, one contribution is in progress ({{bug|866359}}).


==== Language subtag registry ====
==== Language subtag registry ====


The IANA language subtag registry is updated around 4 times a year. Releases are announced on the [https://mm.icann.org/mailman/listinfo/ietf-languages-announcements ietf-languages-announcements mailing list]. Updates are usually not urgent. Changes require updating ICU and, if language tag mappings are involved, the mapping tables in IntlData.js.
The IANA language subtag registry is updated around 4 times a year. Releases are announced on the [https://mm.icann.org/mailman/listinfo/ietf-languages-announcements ietf-languages-announcements mailing list]. Updates are usually not urgent. Changes require updating ICU (see above) and, if language tag mappings are involved, the mapping tables in IntlData.js, using {{source|js/src/builtin/make_intl_data.py}}.


==== Time zone database ====
==== Time zone database ====
Line 147: Line 143:
* Changes to offsets and rules for existing time zones require updating the time zone data within ICU. For changes that Mozilla doesn't consider urgent, just wait for the next ICU release. For urgent changes, you can update just the [http://userguide.icu-project.org/datetime/timezone#TOC-Updating-the-Time-Zone-Data ICU time zone data].
* Changes to offsets and rules for existing time zones require updating the time zone data within ICU. For changes that Mozilla doesn't consider urgent, just wait for the next ICU release. For urgent changes, you can update just the [http://userguide.icu-project.org/datetime/timezone#TOC-Updating-the-Time-Zone-Data ICU time zone data].


* Changes to time zone names (new names or renames) require updating the time zone data within ICU and at the same time the time zone name mappings in IntlData.js (once bug 837961 has been implemented).
* Changes to time zone names (new names or renames) require updating the time zone data within ICU and at the same time the time zone name mappings in IntlData.js (once {{bug|837961}} has been implemented).


==== Currency changes ====
==== Currency list ====


The ISO 4217 currency code list is updated around twice a year. Releases are announced in a [http://www.currency-iso.org/en/home/amendments/newsletter.html newsletter]. Changes might be urgent, although politicians seem to understand the impact of changing currencies better than that of changing time zones. Changes require updating ICU and, if the minor unit value for a currency was/is different from 2, the currencyDigits table in Intl.js (IntlData.js once bug 843758 is fixed).
The ISO 4217 currency code list is updated around twice a year. Releases are announced in a [http://www.currency-iso.org/en/home/amendments/newsletter.html newsletter]. Changes might be urgent, although politicians seem to understand the impact of changing currencies better than that of changing time zones. Changes require updating ICU and, if the minor unit value for a currency was/is different from 2, the currencyDigits table in Intl.js (IntlData.js once {{bug|843758}} is fixed).


=== Other random details ===
==== Conformance tests ====


....anything?...
The Test402 conformance test suite can be updated by Ecma members with contribution agreements at any time. Updates should be announced on the [https://mail.mozilla.org/listinfo/test262-discuss test262-discuss mailing list]. To import the updates, run the {{source|js/src/tests/update-test402.sh}} script.
 
=== Known issues ===
 
As of April 2013, most known issues in the SpiderMonkey implementation of the Internationalization API are referenced by {{bug|837963}}.
 
ECMA-402 says that the supported numbering systems for a locale are (unsurprisingly) locale-dependent.  ICU exposes the default numbering system for a locale via a C++ API, but otherwise it pretends any numbering system can be used by any locale.  Thus SpiderMonkey's implementation says that the default numbering system is supported (obviously), and it says a handful of common decimal numbering systems are supported.  See <code>getNumberingSystems</code> in {{source|js/src/builtin/Intl.cpp}}.  If ICU ever provides more comprehensive information here, we should probably use it.
 
The ICU interface that exposes a locale's default numbering system (see above) is C++, which (see below) means it's not stable.  There's an [http://bugs.icu-project.org/trac/ticket/10039 issue] on file to add a C API for this.  Until that's implemented and we use it, be careful about ICU upgrades.
 
The means for representing internal properties ''may'' not be cross-global-correct.  Technically if I do <code>var obj = {}; Intl.Collator(obj); otherWindow.Intl.Collator(obj);</code> the second initialization should throw, because internal properties adhere to the object.  The current structuring of the weak map mechanism, however, uses one weak map per global.  So that example likely "succeeds" now, where it actually shouldn't.  This probably is unlikely to be simply stumbled upon, but it's an issue.  Recent self-hosting work may allow us to not clone the internals-mapping behavior into every global object that uses <code>Intl</code> stuff, which would solve this issue.  I (Jeff) should look into this at some point, and poke Till for review on a patch if it pans out (given he implemented the relevant self-hosting improvements).
3

edits