Implementation Options for ECMAScript Internationalization API in SpiderMonkey
Norbert Lindenberg, 2013-01-07
- What’s the ECMAScript Internationalization API?
- Implementation based on bundled ICU
- Implementation based on OS support
- Implementation in Google Chrome
- Tentative recommendations
What’s the ECMAScript Internationalization API?
The API is defined by standard ECMA-402, developed by Ecma TC39, the group that also maintains the ECMAScript Language Specification. Mozilla is a very active participant in TC39 (Brendan Eich, Allen Wirfs-Brock, David Herman), and sponsored development of the Internationalization API (Norbert Lindenberg).
The API provides JavaScript applications with core internationalization features to let them better support the user’s language and culture and provide a more consistent localized experience for their users. Until now, it has been very difficult for web application designers to do something as simple as sort names correctly according to the user’s language. The new standard changes this: It provides string comparison for sorting, number and currency formatting (such as “1.234,56 €” for German), and date and time formatting capabilities (such as 2012年12月12日 for Japanese).
An important aspect of the API is that applications can choose the language, and are not bound to the localization of the browser or the operating system (implementations of the API determines which languages are supported). This addresses the large number of multilingual users who go to web sites in different languages, or users who have to use browsers that aren’t localized for their native language. In addition, the API lets applications tailor the results to their specific needs, e.g., specify the currency with which numbers are displayed, select the date-time components used in a date format, or ignore punctuation in sorting.
Google Chrome is the first browser to ship with an implementation of the API – it’s prefixed in Chrome 23, unprefixed in Chrome 24 (currently beta). Microsoft demoed an implementation at the Unicode Conference in October 2012, but hasn’t announced release plans yet. Plans for Safari and Opera are unknown.
Implementation based on bundled ICU
The solution that’s easiest to implement and provides the highest level of functionality is using ICU and bundling it with the Firefox download. ICU (International Components for Unicode) is a comprehensive open-source internationalization library supported primarily by IBM, Google, and Apple. An ICU-based implementation of the ECMAScript Internationalization API for Firefox is under development, a current build for Mac OS X is available.
Issues with bundling ICU
Concerns have been raised about the increase in download size, the increase in mass storage size, and the increase in RAM use caused by ICU.
- Mass storage size: A full build of ICU 50.1.1 (the version released in December 2012) takes about 23.8 MB on Mac OS X (1MB = 1024 KB = 1,048,576 B), distributed over 7 libraries. Of that, 19.8 MB are data, 4 MB code.
- Download size: A zip file with these libraries takes 9.5 MB. The compression for download data may differ somewhat, but is likely to result in a similar size increase. On the Mac, where all libraries are built and bundled in both 32- and 64-bit versions, the impact would double.
- RAM use: There was concern that all of the code and data might be loaded into memory, and so RAM use might be the same as mass storage size, although a smart operating system should page in only what’s needed.
Download size is seen as a problem for user acquisition, as users may cancel a download that takes too long. Mozilla doesn’t appear to have numbers though about the correlation between download sizes and cancellations, and after going through the Firefox download experience on Windows and comparing it to the Chrome download experience, I suspect that the number of security warnings, dialogs, and cancel buttons on the way might also have a significant impact.
RAM use could be a problem on Firefox OS and Android, which have to run with very limited memory. However, operating systems typically don’t load complete libraries into RAM; they page them in as needed (possibly preloading some proactively). For desktop systems RAM use was not seen as a big issue (Justin Lebar)
Mass storage size in itself wasn’t seen as a serious issue; it’s the easiest one to measure however and can serve as a proxy for download size (the compression ratio seems to be about 3:1).
Steps taken to mitigate the issues
The following steps to reduce mass storage and download size have already been implemented (size numbers are for Mac OS X):
- Turning off major chunks of functionality that’s not needed for the internationalization API, and removing the associated data from the build, reduced the number of libraries to 3, and the combined mass storage size to 12.1 MB (9.3 MB data, 2.9 MB code).
- Removing collation rules (which aren’t required for the functionality of the API) reduced the mass storage size by another 680 KB.
- Building ICU as static libraries increased the size of libmoz.js by 10.9 MB, 584 KB less than the sum of the libraries thus no longer needed. Unfortunately, ICU doesn’t support static libraries when building with MSVC on Windows, and I haven’t gotten them to work on Linux.
With these steps, the increase in download size for Mac OS X (which still includes each library twice) is 6.7 MB (from 47.4 MB to 54.1 MB).
Possible additional steps
The following steps could be taken to reduce the size further, but involve either product trade-offs or major engineering effort:
- Remove locale data for languages that Mozilla doesn’t localize for. About a third of ICU locales could be removed, but savings are going to be significantly less than a third because some core locales (Chinese, Japanese) have much bigger locale data than most others. The product trade-off is that users may want to access web applications in their native languages even if the browser is not localized into those languages.
- Strip out unused time zone names. The locale data contains up to six names per time zone and language; at most four of them can be used by the internationalization API. This could probably be accomplished by adding and implementing a new option for the resource compiler, similar to how it already can strip out collation rules. This is an issue of engineering effort.
- Strip out currency names that are unlikely to be used. Few applications need all currency names for all currencies in all languages; it might be sufficient to have names for the most important currencies per locale. The product trade-off is that some applications, such as the Yahoo Finance currency converter, do deal with large numbers of currencies and would receive currency codes as fallbacks. This would also be a significant engineering effort.
- Package locale data as a data file rather than a library. On the Mac, code libraries are built and bundled as both 32-bit and 64-bit binaries; for data that’s not necessary. ICU supports this, although it requires changes to the applications (Firefox etc.), not just the low-level libraries directly using ICU. This is a smaller engineering effort, but without the next step benefits Mac OS X only.
- Download data on demand. This is attractive because users generally need locale data only for a few locales. However, the API is synchronous and can’t wait until data is downloaded, and once it has produced results indicating that it doesn’t support a locale it shouldn’t start supporting that locale for the same client seconds later (there’s a user experience proposal). We also don’t know whether the time when a user stumbles onto a Chinese web page that requests Chinese collation is really the best time to download the data – at that time the user may be roaming in China with a U.S. data plan. There may be options in between, e.g., download ICU data after the browser itself but without waiting for request. A solution might also address similar issues with dictionaries and hyphenation tables. This is a product trade-off, and would also be a significant engineering effort.
- Apply special-purpose compression to the ICU data. Not clear if this would save a significant percentage of download size, but it sounds like an interesting research project.
- Reimplement the relevant parts of ICU in JavaScript. This sounds like a major engineering effort with uncertain outcome.
Steps proposed that don’t help
A few steps have been proposed, but will not help:
- Use ICU to replace internationalization code in other Mozilla subsystems. While using ICU to implement internationalization of other subsystems is a good idea in general, it won’t address the size issue because the bulk of that existing code is for functionality that the internationalization API doesn’t need (such as encoding conversion) and that’s therefore removed from the current build.
- Provide locale data only for the UI language. A big advantage of the API is that it allows users to access applications that use a different language than the browser, and allows such applications to provide a consistent localized experience. Multilingual users are majorities in many countries, and a large minority even in the U.S.
- Build the browser for Mac OS X as a 64-bit-only binary, with support for 32-bit plugins. There are still Macs running 10.6 in 32-bit mode.
Implementation based on OS support
The ECMAScript Internationalization API can also be implemented on top of the internationalization APIs provided by the operating system. Some operating systems include ICU; for others an adaptation to other APIs is necessary.
Using OS implementations of ICU
The following caveats apply whenever ICU is used as a system library:
- ICU has separate interfaces for C and C++. There’s no guarantee for binary compatibility of the C++ interfaces, so these can’t be used when relying on a system library. My current implementation uses the C API where possible, but there’s one function that it needs that has no C equivalent: NumberingSystem::createInstance. The implementation could work around this by formatting a known number and checking the resulting string against the known numbering systems.
- ICU by default renames functions to include the ICU version number. Operating systems that want to offer ICU as a system library for application use have to turn off this feature. Not all do.
- The newest ICU functions used, ucol_getKeywordValuesForLocale and ucal_getKeywordValuesForLocale, were introduced in ICU 4.2, which shipped in July 2009. Another, udatpg_getBestPattern, was added in ICU 3.8, December 2007. Operating systems using older versions of ICU would not be supported.
- Like all software, ICU has bugs that get fixed over time. An implementation relying on a system version of ICU may encounter bugs that the current version doesn’t have anymore.
- Unlike most other software, ICU includes a huge set of locale data that gets extended and improved all the time. An implementation relying on a system version of ICU likely will find support for fewer locales, and less complete locale data for some supported locales, than one bundling the current version.
Firefox OS
Someone mentioned that the B2G sources include ICU, and after downloading all 9271 MB of those sources I can confirm: it’s there. I haven’t seen an actual build yet, but given that any OS needs some internationalization support, I assume ICU is actually used. It’s a somewhat old version, 4.6 from December 2010, but not too old for our purposes. Since Firefox OS is targeting low end smartphones, it’s most important here to not waste resources, so we should use what’s there. SpiderMonkey is part of the OS here, so it might be possible to use C++ APIs. We should look into upgrading to a more modern version though.
Android
Android includes ICU, but only for use by system applications. Mozilla could ask Google to add ICU to the NDK to make it available for downloaded Firefox. I hear Adobe is interested in this as well, and there’s a chance it could happen. On the other hand, Mozilla also wants OEMs and carriers to ship Firefox on devices. In that case, it may be possible to treat Firefox as a system application and give it access to ICU even without changes to the NDK.
Windows
Windows provides it’s own internationalization API, unrelated to ICU. The ECMAScript Internationalization API was designed to accommodate weaknesses in the Windows API, but building on it will lead to some loss in functionality compared to ICU: No way to implement full time zone support; calendars limited to the traditional calendar for each locale plus Gregorian – no Islamic calendar combined with English; minimal set of supported date and time formats. It may also be difficult to obtain all the information about supported functionality that the ECMAScript Internationalization API requires.
Mac OS X
Mac OS X uses ICU internally, but its interfaces aren’t provided, indicating it’s not for applications to use. It’s technically possible to link against the library, but risky. In Mac OS 10.7 the library is in /usr/lib, not in /usr/library as stated in the article. The recommended internationalization APIs for Mac OS X (UCCreateCollator, NSNumberFormatter, CFNumberFormatter, NSDateFormatter, CFDateFormatter) don’t seem to support some of the options specified in the ECMAScript Internationalization API, such as numeric sorting or selection of the numbering system.
Desktop Linux
Desktop Linux distributions often include ICU, and some RPMs are available. Versions may be rather old though, for example CentOS 6.3 (released July 2012) includes ICU 4.2.1 (released July 2009), and they may be unsuitable for application use because of function renaming (as they are in CentOS).
Implementation in Google Chrome
Google has decided to go with bundling ICU for all platforms. Dave Mandelin has collected some data:
- A Chrome install folder takes up about 200 MB on my machine (Win 7). This includes a 100 MB chrome.7z file that the installer apparently keeps around, so 100 MB for the software itself but apparently they feel good about taking up 200 MB. :-)
- A Firefox install folder takes up about 45 MB.
- Chrome ships 14 DLLs in its top-level directory for a total of 63 MB.
- Firefox ships 26 DLLs in its top-level directory for a total of 32 MB.
- Chrome ships ICU, in the form of icudt.dll, 9.5 MB.
- Speaking of optional/new-ish features, Pepper Flash is a 12 MB DLL (in the PepperFlash dir, thus not even counted in the 63 MB above).
- In a test of cold startup, Chrome loaded about 500k of icudt.dll in 170 ms. Their total startup time on this machine varies from 3.5-10s. So ICU accounts for 1.7-5% of their total cold startup time.
- Firefox cold startup is of the same (binary) order of magnitude.
- Warm startups are about 10x faster, and given that the code is already in memory, the effect of a new library on warm startup time is probably even less (proportionally) than it is on cold.
The above certainly suggests to me that adding ICU is not a big deal. The biggest risk is probably still the increased download size. The Firefox stub installer, still in progress IIUC, will probably make download size matter less.
Tentative recommendations
Given available information, I recommend using different solutions for different platforms:
- Mac OS X: Use bundled ICU. Big downloads are a fact of life for Macintosh users. Operating system and application updates routinely run into hundreds of megabytes; a few recent ones have gone beyond 1 GB. When it comes to hard disk space, 20 MB are the equivalent of ten minutes of music downloaded from the iTunes music store, or ten photos taken with an iPhone.
- Linux: Use bundled ICU. The situation here is likely similar to the one on Mac, and the chances of finding a usable ICU library in the OS uneven.
- Windows: Use bundled ICU. Mitigate user acquisition issue by improving the download and installation experience, i.e., by reducing the number of steps and dialogs with “cancel” buttons. This is admittedly a tough one, because Windows is the main OS for low-end desktop and laptop computers, and widely used in countries where broadband isn’t the standard yet.
- Android: Ask Google to make the ICU API available as part of the NDK, and provide the ECMAScript Internationalization API only on Android systems that provide the ICU API, implementing it on top of that API. Download and mass storage size matter a lot more on Android than on desktop systems. Google is a strong supporter of both ICU and the ECMAScript Internationalization API, and I understand other parties have requested making ICU available, so there’s a reasonable chance this would happen. Issue: Would this change apply to all versions of Android or only to future ones?
- Firefox OS: Use the OS implementation of ICU, upgrading this implementation to a current version at the same time. Since Mozilla controls the entire system, there’s no point in having two separate implementations.