User:Waldo/Internationalization API

< User:Waldo
Revision as of 22:44, 16 April 2013 by Waldo (talk | contribs) (Add more information)

Introduction

ECMAScript has long had rudimentary localization support. ES5 defines toLocaleString methods (found on various objects like Array.prototype, Number.prototype, and Date.prototype); toLocaleLowerCase and toLocaleUpperCase on String.prototype; and toLocaleDateString and toLocaleTimeString on Date.prototype. Each method acts only with respect to the user's current locale, and each method provides no control over output formatting. The spec algorithms are woefully under-defined. As a practical matter localization support in ES5 is useless.

The ECMAScript Internationalization API (ECMA-402) significantly extends these capabilities, to provide genuinely useful means of localization to ECMAScript. Outputs may be customized in various ways by requesting different components be included in output, formatted in various ways. The locale used for a formatting operation is customizable, and output formatting is intelligently determined in accordance with the locale. It additionally provides means of locale-sensitively sorting data, according to the type of that data (for example, sorting names in phone book order, versus sorting them in dictionary order), considering or ignoring capitalization, accents, and so on.

The Internationalization API introduces one new global property: Intl. This property is an object with various properties corresponding to various sub-APIs: collation (sorting), number formatting, and date/time formatting. (More capabilities will be added in future Internationalization API updates.) The localization APIs from ES5 have been reformulated to use the localization capabilities provided by the Internationalization API. Generally, however, it's preferable to use the Internationalization API directly, as this is more efficient by permitting caching of the structures needed to perform each operation.

Internationalization in SpiderMonkey

SpiderMonkey includes significant support for the Internationalization API. The fundamental primitives used to implement the API are provided by an in-tree imported copy of ICU in intl/icu. This is an optional component of a SpiderMonkey build; support may be turned on using the --enable-intl-api configuration option. The Internationalization API is enabled by default in Firefox builds. Features and capabilities of the API itself are implemented in both C++ and in self-hosted JavaScript that accesses ICU functionality through various intrinsic functions in the self-hosting global.

Code organization

ICU

International Components for Unicode is a library implementing collation, formatting, and other locale-sensitive functions. It provides the underlying functionality used in implementing Internationalization.

Integration

The Intl object is integrated into the global object through code in js/src/builtin/Intl.cpp and js/src/builtin/Intl.h. js_InitIntlClass performs this operation when it's called during global object bootstrapping, in concert with various other initialization methods in the same file and in js/src/vm/GlobalObject.cpp. There's some particular trickiness here, as the various Intl.* constructors aren't global classes, yet need to participate in the reserved-slot constructor/prototype system used by Object, Function, Array, and so on to implement "using the original value of Object.prototype" and "as if by new Array()" and similar.

Self-hosted code

The majority of the self-hosted code implementing Internationalization is in js/src/builtin/Intl.js. This file defines the functions exposed on the various Intl.* constructor functions and the various Intl.*.prototype objects.

Internationalization in various cases requires keeping around large data tables: to record the set of supported currency codes, to record language tag (hyphenated strings describing locales, and various options) mappings, and so on. This data lives in js/src/builtin/IntlData.js and is generated by js/src/builtin/make_intl_data.py. This script downloads original (large) plaintext databases, parses them, and extracts in the proper format the data used by Internationalization. Updating this static data — which should happen any time the underlying databases receive an update — should be as simple as rerunning the script.

Intrinsic functions

Self-hosted code calls into various intrinsics to access ICU functionality. The full list of Internationalization intrinsics is (necessarily) in js/src/vm/SelfHosting.cpp, but the intrinsics themselves are implemented in js/src/builtin/Intl.cpp.

Natively-implemented functions

All the constructor functions are implemented in C++ in js/src/builtin/Intl.cpp.

Key concepts

...

Known bugs and issues

ECMA-402 says that the supported numbering systems for a locale are (unsurprisingly) locale-dependent. ICU exposes the default numbering system for a locale via a C++ API, but otherwise it pretends any numbering system can be used by any locale. Thus SpiderMonkey's implementation says that the default numbering system is supported (obviously), and it says a handful of common decimal numbering systems are supported. See getNumberingSystems in js/src/builtin/Intl.cpp. If ICU ever provides more comprehensive information here, we should probably use it.

The ICU interface that exposes a locale's default numbering system (see above) is C++, which (see below) means it's not stable. There's an issue on file to add a C API for this.

Other random details

ICU provides both C and C++ APIs, but only the C API is considered stable. Given that some people reasonably want to use SpiderMonkey with a system ICU, this means we're generally limited to only the stable C API. Unfortunately, this also means we have to hand-roll our own smart pointers for managing ICU resources.