User:Waldo/Internationalization API

< User:Waldo
Revision as of 23:36, 16 April 2013 by Waldo (talk | contribs) (More info)

Introduction

ECMAScript has long had rudimentary localization support. ES5 defines toLocaleString methods (found on various objects like Array.prototype, Number.prototype, and Date.prototype); toLocaleLowerCase and toLocaleUpperCase on String.prototype; and toLocaleDateString and toLocaleTimeString on Date.prototype. Each method acts only with respect to the user's current locale, and each method provides no control over output formatting. The spec algorithms are woefully under-defined. As a practical matter localization support in ES5 is useless.

The ECMAScript Internationalization API (ECMA-402) significantly extends these capabilities, to provide genuinely useful means of localization to ECMAScript. Outputs may be customized in various ways by requesting different components be included in output, formatted in various ways. The locale used for a formatting operation is customizable, and output formatting is intelligently determined in accordance with the locale. It additionally provides means of locale-sensitively sorting data, according to the type of that data (for example, sorting names in phone book order, versus sorting them in dictionary order), considering or ignoring capitalization, accents, and so on.

The Internationalization API introduces one new global property: Intl. This property is an object with various properties corresponding to various sub-APIs: collation (sorting), number formatting, and date/time formatting. (More capabilities will be added in future Internationalization API updates.) The localization APIs from ES5 have been reformulated to use the localization capabilities provided by the Internationalization API. Generally, however, it's preferable to use the Internationalization API directly, as this is more efficient by permitting caching of the structures needed to perform each operation.

Internationalization in SpiderMonkey

SpiderMonkey includes significant support for the Internationalization API. The fundamental primitives used to implement the API are provided by an in-tree imported copy of ICU. This is an optional component of a SpiderMonkey build; support may be turned on using the --enable-intl-api configuration option. Most of SpiderMonkey's Internationalization code gets built even when the API is disabled, but the Intl object isn't added to global objects, and the legacy toLocale*String methods are implemented using SpiderMonkey's old JSLocaleCallbacks interface. Features and capabilities of the Internationalization API itself are implemented in both C++ and in self-hosted JavaScript that accesses ICU functionality through various intrinsic functions in the self-hosting global.

The Internationalization API is enabled by default in SpiderMonkey when embedded in Firefox builds.

Code organization

ICU

International Components for Unicode is a library implementing collation, formatting, and other locale-sensitive functions. It provides the underlying functionality used in implementing Internationalization. ICU is imported in intl/icu

Integration

The Intl object is integrated into the global object through code in js/src/builtin/Intl.cpp and js/src/builtin/Intl.h. js_InitIntlClass performs this operation when it's called during global object bootstrapping, in concert with various other initialization methods in the same file and in js/src/vm/GlobalObject.cpp. There's some particular trickiness here, as the various Intl.* constructors aren't global classes, yet need to participate in the reserved-slot constructor/prototype system used by Object, Function, Array, and so on to implement "using the original value of Object.prototype" and "as if by new Array()" and similar.

Self-hosted code

The majority of the self-hosted code implementing Internationalization is in js/src/builtin/Intl.js. This file defines the functions exposed on the various Intl.* constructor functions and the various Intl.*.prototype objects.

Internationalization in various cases requires keeping around large data tables: to record the set of supported currency codes, to record language tag (hyphenated strings describing locales, and various options) mappings, and so on. This data lives in js/src/builtin/IntlData.js and is generated by js/src/builtin/make_intl_data.py. This script downloads original (large) plaintext databases, parses them, and extracts in the proper format the data used by Internationalization. Updating this static data — which should happen any time the underlying databases receive an update — should be as simple as rerunning the script. XXX Link to the mailing lists to track to learn when updates occur!

Intrinsic functions

Self-hosted code calls into various intrinsics to access ICU functionality. The full list of Internationalization intrinsics is (necessarily) in js/src/vm/SelfHosting.cpp, but the intrinsics themselves are implemented in js/src/builtin/Intl.cpp.

Natively-implemented functions

All the constructor functions are implemented in C++ in js/src/builtin/Intl.cpp. These need to call into enough C++ code to create the necessary ICU data structures for it to be worth keeping them in C++.

Tests

Tests live in js/src/tests/test402, an unmodified import of the ECMA-402 test suite. XXX Explain how the tests are run, how they're skipped in no-Intl builds, how to update them, how to contribute to them, how we disable/annotate any tests we don't pass

Key concepts

...talk about collators, date formats, and how all the stuff is implemented using what ICU primitives...

Known bugs and issues

ECMA-402 says that the supported numbering systems for a locale are (unsurprisingly) locale-dependent. ICU exposes the default numbering system for a locale via a C++ API, but otherwise it pretends any numbering system can be used by any locale. Thus SpiderMonkey's implementation says that the default numbering system is supported (obviously), and it says a handful of common decimal numbering systems are supported. See getNumberingSystems in js/src/builtin/Intl.cpp. If ICU ever provides more comprehensive information here, we should probably use it.

The ICU interface that exposes a locale's default numbering system (see above) is C++, which (see below) means it's not stable. There's an issue on file to add a C API for this. Until that's implemented and we use it, be careful about ICU upgrades.

Other random details

ICU provides both C and C++ APIs, but only the C API is considered stable. Given that some people reasonably want to use SpiderMonkey with a system ICU, this means we're generally limited to only the stable C API. Unfortunately, this also means we have to hand-roll our own smart pointer for managing ICU resources.

Most of the ICU methods indicate errors through an error code outparam. Also, such APIs check the existing value in that outparam before proceeding. Thus a sequence of ICU calls can occur without error-checking right up til the end, where a single U_FAILURE(status) will suffice to handle all errors that might occur.