User:Waldo/Internationalization API

From MozillaWiki
Jump to: navigation, search

Introduction

ECMAScript has long had rudimentary localization support. ES5 defines toLocaleString methods (found on various objects like Array.prototype, Number.prototype, and Date.prototype); toLocaleLowerCase and toLocaleUpperCase on String.prototype; and toLocaleDateString and toLocaleTimeString on Date.prototype. Each method acts only with respect to the user's current locale, and each method provides no control over output formatting. The spec algorithms are woefully under-defined. As a practical matter localization support in ES5 is useless.

The ECMAScript Internationalization API (standard ECMA-402, introduction) significantly extends these capabilities, to provide genuinely useful means of localization to ECMAScript. Outputs may be customized in various ways by requesting different components be included in output, formatted in various ways. The locale used for a formatting operation is customizable, and output formatting is intelligently determined in accordance with the locale. It additionally provides comparison functions useful for locale-sensitively sorting strings, according to how the sorted list is used (for example, sorting names in phone book order, versus sorting them in dictionary order), considering or ignoring capitalization, accents, and so on.

The Internationalization API introduces one new global property: Intl. This property is an object with various properties corresponding to various sub-APIs: collation (sorting), number formatting, and date/time formatting. (More capabilities will be added in future Internationalization API updates.) The localization APIs from ES5 have been reformulated to use the localization capabilities provided by the Internationalization API. Generally, however, it's preferable to use the Internationalization API directly, as this is more efficient by permitting caching of the structures needed to perform each operation.

Concepts

The Internationalization API relies on a number of other standards to define identifiers for languages/locales, currencies, and time zones. Full details should generally be looked up there: ECMA-402 defines most underlying concepts only by reference.

Language tags

Language tags are defined in BCP 47: a living aggregation of a set of RFCs (the set may change over time as RFCs in the set are obsoleted and replaced) specifying their structure, their default interpretation, and extension mechanisms. The standard is complemented by the IANA Language Subtag Registry.

Every operation is performed in terms of locales, specified as language tags: en-US, nan-Hant-TW, und, and so on. The main components of a language tag are the language and optionally a script, region, and variations that might exist within these. An extension component may follow, permitting inclusion of extra structured data (usually to contextualize a use of the language tag). Finally, an optional private-use component may include arbitrary data (this is for the use of webpages — not for SpiderMonkey's use). All components are alphanumeric and case-insensitive ASCII. The components are joined into a language tag using hyphens; individual components can be distinguished by length and internal syntax (length, prefix, etc.). The precise details of language tag structure are quite complex, and they include a list of irregular forms for legacy compatibility reasons. See BCP 47 for all the gory details.

One particular subcomponent worth noting specifically is the Unicode extension component, living within the extension component. The Unicode extension component has the basic form "-u(-[a-z0-9]{2,8})+", with precise details in RFC 6067. The Unicode component permits specifying additional details about sort order, numeric system, calendar system, and others.

SpiderMonkey verifies structural validity of language tags and brings them into a canonical form but generally doesn't interpret the components of a language tag. Instead SpiderMonkey passes the tag to ICU, and ICU interprets the components.

One exception is for old-style language tags: A small set of language tags for languages that according to BCP 47 don't have a default script, but that are commonly used without the script code based on older standard (RFC 1766 and RFC 3066) that didn't recognize script codes. The implementation maps such language tags to their modern equivalents so that they can be found in the lists of available locales provided by ICU.

Another exception is when determining language fallback, as required by ECMA-402: a language is requested that's not supported, but the language tag's internal structure implicitly encodes a list of fallbacks. For example, the tag en-US suggests a fallback to en.

The last exception is that, again as required by ECMA-402, SpiderMonkey removes the Unicode extension component of a language tag from the base language tag during processing. The key-value pairs in the Unicode extension are compared separately against the feature set supported by the language found, and the language tag sans Unicode extension is used by ICU after the feature set is determined.

Currency codes

Formatting a number to display as currency depends upon the particular currency used, so currency codes play a role in number formatting. For example, one correct formatting for one hundred dollars USD is "$100.00" (two decimal places), while one hundred Japanese yen would be "¥100" (no decimal places). Additional characteristics determined by currency, besides decimal place count, include the currency symbol and a "long" name ("US dollar" and so on). Currency codes are three letters, traditionally capitalized. The ISO 4217 maintenance agency publishes the full list.

Time zone name

Date and time formatting requires knowledge of the time zone. Time zone names are specified by the IANA Time Zone Database.

Operations

ECMA-402 in its first iteration exposes various locale-sensitive operations. Future editions will likely expose more operations.

Collation

Collation is the process of sorting a list of strings, according to particular rules. Different locales sort text differently. Locales may also sort differently in different contexts: dictionary sort order versus phonebook sort order, say. Sorting also may or may not take into account the numeric value of strings: ["1", "30", "5"] order versus ["1", "5", "30"] order.

Number formatting

Number formatting is simple if one only wishes for a decimal format. But in many contexts, simple decimal formatting is undesirable. Numbers displayed as currencies, percents, and decimal will display differently in different locales. Currencies pose additional problems, as different currencies format fractional values to different numbers of decimal places. And for sufficiently large numbers, grouping separators (in en-US, at thousands places as a comma; in many European locales, at thousands places as periods; other, more exotic forms exist) may be desirable. ECMA-402 permits customization of all these formatting choices.

Date formatting

Locale-sensitive date formatting in ES5 admits only a single implementation-defined format with a fixed set of components. ECMA-402 enhances date formatting to allow the selection of components to be customized — month, day, and year, for example.. The way in which these components will be displayed in the final format string is locale-dependent, as different locales write out dates in different ways. (For example, a date might be formatted as "September 24, 2012" for en-US and as "24 Sept. 2012" for fr-FR.) Moreover, various styles may be chosen for the components included in the format: "narrow", "short", "long", "2-digit", and "numeric". (Exactly which of these styles are supported and what they look like is also implementation-dependent.) These styles feed into the final computation of an appropriate pattern to use to generate a final string.

Internationalization in SpiderMonkey

SpiderMonkey includes significant support for the Internationalization API. The fundamental primitives used to implement the API are provided by an in-tree imported copy of ICU. This is an optional component of a SpiderMonkey build; support may be turned on using the --enable-intl-api configuration option. Most of SpiderMonkey's Internationalization code gets built even when the API is disabled, to prevent bit rot; ICU interfaces of note are stubbed out in this configuration with methods that do nothing but assert. The most important differences in a SpiderMonkey build without Internationalization are that the Intl object isn't added to global objects and the legacy toLocale*String methods are implemented using SpiderMonkey's old JSLocaleCallbacks interface. Features and capabilities of the Internationalization API itself are implemented in both C++ and in self-hosted JavaScript that accesses ICU functionality through various intrinsic functions in the self-hosting global.

The Internationalization API is enabled by default in SpiderMonkey when embedded in desktop Firefox builds.

We anticipate that over time the Internationalization API will be enabled by default in all SpiderMonkey builds, possibly with different underlying implementations to support resource-constrained environments.

Code organization

ICU

International Components for Unicode is a library implementing collation, formatting, and other locale-sensitive functions. It provides the underlying functionality used in implementing Internationalization. ICU is imported in intl/icu.

ICU's source code is relatively huge and sprawling: hardly surprising for a 15+ year old project supporting the complete Unicode character set, hundreds of locales, and core operating system functionality for IBM, Google, and Apple. The ICU team provides API documentation, a user guide with detailed introductions to all its functionality, mailing lists for support and announcements, and a bug database.

ICU provides both C and C++ APIs, but the only stable interfaces are C APIs marked as stable. (C++ APIs are considered uniformly unstable. This extends even to interfaces defined entirely in public ICU headers, such as ICU smart pointers.) Given that some people reasonably want to use SpiderMonkey with a system ICU, this means we're generally limited to only the stable C API. (In one case we have to use the C++ API to access functionality; see known issues below.) Unfortunately, this also means we have to hand-roll our own smart pointer for managing ICU resources.

Most of the ICU methods indicate errors through an error code outparam. Also, such APIs check the existing value in that outparam before proceeding. Thus a sequence of ICU calls can occur without error-checking right up til the end, where a single U_FAILURE(status) will suffice to handle all errors that might occur. For example:

ucol_setAttribute(coll, UCOL_STRENGTH, uStrength, &status);
ucol_setAttribute(coll, UCOL_CASE_LEVEL, uCaseLevel, &status);
ucol_setAttribute(coll, UCOL_ALTERNATE_HANDLING, uAlternate, &status);
ucol_setAttribute(coll, UCOL_NUMERIC_COLLATION, uNumeric, &status);
ucol_setAttribute(coll, UCOL_NORMALIZATION_MODE, uNormalization, &status);
ucol_setAttribute(coll, UCOL_CASE_FIRST, uCaseFirst, &status);
if (U_FAILURE(status)) {
    ucol_close(coll);
    JS_ReportErrorNumber(cx, js_GetErrorMessage, NULL, JSMSG_INTERNAL_INTL_ERROR);
    return NULL;
}

Integration

The Intl object is integrated into the global object through code in js/src/builtin/Intl.cpp and js/src/builtin/Intl.h. js_InitIntlClass performs this operation when it's called during global object bootstrapping, in concert with various other initialization methods in the same file and in js/src/vm/GlobalObject.cpp. There's some particular trickiness here, as the various Intl.* constructors aren't global classes, yet need to participate in the reserved-slot constructor/prototype system used by Object, Function, Array, and so on to implement "using the original value of Intl.Collator.prototype" and "as if by new Intl.NumberFormat()" and similar.

Self-hosted code

The majority of the self-hosted code implementing Internationalization is in js/src/builtin/Intl.js. This file defines the functions exposed on the various Intl.* constructor functions and the various Intl.*.prototype objects.

Internationalization in various cases requires keeping around large data tables: to record the minor units of supported currency codes, to record language tag mappings, and so on. Some of this data lives in js/src/builtin/IntlData.js and is generated by js/src/builtin/make_intl_data.py. This script downloads original (large) plaintext databases, parses them, and extracts in the proper format the data used by Internationalization. Updating this static data — which should happen any time the underlying databases receive an update — should be as simple as rerunning the script. See Care and feeding of the Internationalization API.

Intrinsic functions

Self-hosted code calls into various intrinsics to access ICU functionality. The full list of Internationalization intrinsics is (necessarily, at the moment — this will probably change eventually) in js/src/vm/SelfHosting.cpp, but the intrinsics themselves are documented in js/src/builtin/Intl.h and implemented in js/src/builtin/Intl.cpp.

Natively-implemented functions

All the constructor functions are implemented in C++ in js/src/builtin/Intl.cpp. These need to call into enough C++ code to create the necessary ICU data structures for it to be worth keeping them in C++.

Tests

Tests live in two directories:

  • js/src/tests/test402 has an unmodified import of Test402, the ECMA-402 conformance test suite, supplemented by jstests-specific browser.js and shell.js files.
  • js/src/tests/Intl has additional implementation dependent tests, especially for the integration with the ICU library. They verify aspects that are not specified in ECMA-402 but matter for actual use, such as which locales are supported and whether the functions actually exhibit properly localized behavior.

Both sets of tests are run as part of the normal jstests/jsreftest suite if the Intl object is present. For Test402 tests, this condition is defined by an entry in js/src/tests/jstests.list; for Intl tests by normal jstests comments in the individual test files. Conformance tests that the SpiderMonkey implementation doesn't pass yet are disabled by additional entries in js/src/tests/jstests.list.

When writing new tests, consider whether they verify conformance with the ECMA-402 specification or implementation dependent behavior:

  • Conformance tests should be written in Test262 format and contributed to Test402 (Test402 uses the harness and repository of Test262, the conformance test suite for the ECMAScript Language Specification). Mozilla has a contribution agreement with Ecma in place; authorized to make contributions are Brendan Eich, Dave Herman, Allen Wirfs-Brock, Jeff Walden, Jason Orendorff.
  • Tests for implementation dependent behavior have to go into the js/src/tests/Intl directory.

See below for information on updating the imported copy of the conformance tests.

Implementation

ECMA-402 currently exposes Intl.Collator, Intl.DateTimeFormat, and Intl.NumberFormat objects. The spec also permits initializing an existing object as one of these, for a small wrinkle. The fundamental ICU data structures providing the relevant functionality are UCollator*, UNumberFormat*, and UDateFormat*, opaque pointers all. Instances are created using u{col,num,date}_open, passing in appropriate arguments. For objects created by the constructor, the pointer is stored in a reserved slot as a private value. For objects merely initialized by the constructor, the ICU data structures must be (inefficiently!) created anew every time. (This difference should not be observable, except through performance-timing, because the only structures consulted to create the ICU structure are internal ones, operations on which aren't observable.)

Every object initialized as an Intl object has an associated set of internal properties. In ECMA-402 these properties are represented using ES5's traditional double-bracket notation: [[calendar]], [[initializedIntlObject]], and so on. The "ideal" means of implementing these properties would probably be ES6 private names, but they're not stable or well-understood enough to be specified yet (let alone implemented). In the meantime we associate ECMA-402 internal properties with objects using a weak map. Any object initialized as an Intl object has an internal [[initializedIntlObject]] property. This is implemented by placing all such objects as keys in a weak map (internalsMap in builtin/Intl.js). The corresponding value is an internals object.

Checking whether an object has been initialized as an Intl object is encapsulated by the isInitializedIntlObject method in js/src/builtin/Intl.js. The getIntlObjectInternals and (less preferred) getInternals function in the same file are used to encapsulate weak map access to an internals object. These methods ensure the weak map mechanism is only an implementation detail encoded in a very few places.

Internals objects are objects with null [[Prototype]] and the properties type, lazyData, and internalProps. This structure permits internals objects to be lazily initialized. Initially, type is "partial"; lazy initialization changes this to "Collator", "DateTimeFormat", or "NumberFormat" and sets lazyData to the information necessary to compute full initialization info; finally, first use fully initializes, converting lazyData into an internalProps object containing the actual ECMA-402-defined internal properties. (For more details on this scheme, see initializeIntlObject and adjacent functions, as well as the class-specific initialization methods, in js/src/builtin/Intl.js.)

The internalProps object stores the internal properties (other than [[initializedIntlObject]]) of the object, named naturally — "calendar", "initializedDateTimeFormat", and so on (no brackets). Accessing any internal property is simply a matter of doing internals.calendar: this is safe because, with the [[Prototype]] nulled out, property accesses can't touch any script-visible state. These internal properties are lazily computed to construct an ICU structure when collation/formatting/etc. actually occurs in the js::intl_CompareStrings, js::intl_FormatNumber, and js::intl_FormatDateTime functions. (Although not directly there, but rather in sub-methods called when the ICU structure isn't cached, or when the object was initialized as an Intl object but wasn't actually one — see again the "inefficiently" bit above.)

Care and feeding of the Internationalization API

The behavior of the Internationalization API reflects the real world in various ways, the real world changes, and so the implementation needs to be updated from time to time.

ICU

ICU has major releases once or twice a year, and minor releases as needed. Releases are announced on the icu-announce mailing list. Each release includes the latest versions of the CLDR locale data, the IANA time zone database, and the ISO 4217 currency data, so it's generally worth it for Mozilla to update its copy each time. To import the latest version, use the intl/update-icu.sh script. Doing so will likely require updating Mozilla's set of local ICU patches -- a tedious process the burden of which we attempt to minimize by upstreaming patches whenever possible (and only patching locally with good reason).

Bugs in ICU should be reported into the ICU bug database. Bug fixes can be contributed; as of February 2014, one contribution is in progress (bug 866359).

Language subtag registry

The IANA language subtag registry is updated around 4 times a year. Releases are announced on the ietf-languages-announcements mailing list. Updates are usually not urgent. Changes require updating ICU (see above) and, if language tag mappings are involved, the mapping tables in IntlData.js, using js/src/builtin/make_intl_data.py.

Time zone database

The IANA time zone database is updated around 20 times a year. Releases are announced on the tz-announce mailing list. Many of these updates are quite urgent for the countries affected, because politicians in a number of countries like to tinker with time zone rules on short notice without thinking about the impact.

There are two kinds of changes:

  • Changes to offsets and rules for existing time zones require updating the time zone data within ICU. For changes that Mozilla doesn't consider urgent, just wait for the next ICU release. For urgent changes, you can update just the ICU time zone data.
  • Changes to time zone names (new names or renames) require updating the time zone data within ICU and at the same time the time zone name mappings in IntlData.js (once bug 837961 has been implemented).

Currency list

The ISO 4217 currency code list is updated around twice a year. Releases are announced in a newsletter. Changes might be urgent, although politicians seem to understand the impact of changing currencies better than that of changing time zones. Changes require updating ICU and, if the minor unit value for a currency was/is different from 2, the currencyDigits table in Intl.js (IntlData.js once bug 843758 is fixed).

Conformance tests

The Test402 conformance test suite can be updated by Ecma members with contribution agreements at any time. Updates should be announced on the test262-discuss mailing list. To import the updates, run the js/src/tests/update-test402.sh script.

SpiderMonkey won't always pass all official tests, so a mechanism for marking tests as failing is needed. The mechanism by which JS tests are run is to generate test lists, then process extra jstests.list files already present and merge in their changes. The resulting data determines what tests will be packaged up, to be run as J builds on tinderbox. Thus to skip internationalization tests that fail, we list and skip them in js/src/tests/jstests.list. One benefit of this is that we can fix such failures without having to rerun the import script and without having to change failing tests themselves. (Arguably jstests.list used this way is a kludge, to be sure. But it's not too bad as hacks go.)

Known issues

As of April 2013, most known issues in the SpiderMonkey implementation of the Internationalization API are referenced by bug 837963.

ECMA-402 says that the supported numbering systems for a locale are (unsurprisingly) locale-dependent. ICU exposes the default numbering system for a locale via a C++ API, but otherwise it pretends any numbering system can be used by any locale. Thus SpiderMonkey's implementation says that the default numbering system is supported (obviously), and it says a handful of common decimal numbering systems are supported. See getNumberingSystems in js/src/builtin/Intl.cpp. If ICU ever provides more comprehensive information here, we should probably use it.

The ICU interface that exposes a locale's default numbering system (see above) is C++, which (see below) means it's not stable. There's an issue on file to add a C API for this. Until that's implemented and we use it, be careful about ICU upgrades.

The means for representing internal properties may not be cross-global-correct. Technically if I do var obj = {}; Intl.Collator(obj); otherWindow.Intl.Collator(obj); the second initialization should throw, because internal properties adhere to the object. The current structuring of the weak map mechanism, however, uses one weak map per global. So that example likely "succeeds" now, where it actually shouldn't. This probably is unlikely to be simply stumbled upon, but it's an issue. Recent self-hosting work may allow us to not clone the internals-mapping behavior into every global object that uses Intl stuff, which would solve this issue. I (Jeff) should look into this at some point, and poke Till for review on a patch if it pans out (given he implemented the relevant self-hosting improvements).

The size of ICU, especially its impact on download size, is an issue for Mozilla distribution. A document that Norbert wrote during the discussion of including ICU in desktop Firefox is somewhat out of date, but summarizes and comments on a number of ideas for reducing the size of ICU: Implementation Options for ECMAScript Internationalization API in SpiderMonkey.