An announcement from the Unicode Consortium describes new sponsorship for the CLDR Project and its Locale Data Markup Language (LDML), designed to facilitate standardized methods for software globalization. The project is now organized under the Unicode Locale Technical Committee (LTC).
The Common Locale Data Repository (CLDR) "provides a general XML format for the exchange of locale information for use in application and system software development, combined with a public repository for a common set of locale data generated in that format."
A locale, as described in the Draft Unicode Technical Standard, is "an id that refers to a set of user preferences that tend to be shared across significant swathes of the world. Traditionally, the data associated with this id provides support for formatting and parsing of dates, times, numbers, and currencies; for measurement units, for sort-order (collation), plus translated names for timezones, languages, countries, and scripts. They can also include text boundaries (character, word, line, and sentence), text transformations (including transliterations), and support for other services."
An LDML specification has been produced by the Free Standards Group's LADE Workgroup, with support from workgroup founding members IBM, Sun and OpenOffice.org; the project was chartered "to devise a general XML format for the exchange of linguistically and culturally sensitive (locale) information for use in application and system development, and to gather, store, and make available data. With LDML, for example, collation rules can be exchanged, allowing two implementations to exchange a specification of collation. Using the same specification, two different implementations will achieve the same results in comparing strings."
Common Locale Data Repository (CLDR) Project
"Not long ago, computer systems were like separate worlds, isolated from one another. The internet and related events have changed all that. A single system can be built of many different components, hardware and software, all needing to work together. Many different technologies have been important in bridging the gaps; in the internationalization arena, Unicode has provided a lingua franca for communicating textual data. But there remain differences in the locale data used by different systems.
Common, recommended practice for internationalization is to store and communicate language-neutral data, and format that data for the client. This formatting can take place on any of a number of the components in a system; a server might format data based on the user's locale, or it could be that a client machine does the formatting. The same goes for parsing data, and locale-sensitive analysis of data.
But there remain significant differences across systems and applications in the locale-sensitive data used for such formatting, parsing, and analysis. Many of those differences are simply gratuitous; all within acceptable limits for human beings, but resulting in different results. In many other cases there are outright errors. Whatever the cause, the differences can cause discrepancies to creep into a heterogeneous system. This is especially serious in the case of collation (sort-order), where different collation caused not only ordering differences, but also different results of queries! That is, with a query of customers with names between "Arnold, James" and "Abbot, Cosmo", if different systems have different sort orders, different lists will be returned...
There are a number of steps that can be taken to improve the situation. The first is to provide an XML format for locale data interchange. This provides a common format for systems to interchange data so that they can get the same results. The second is to gather up locale data from different systems, and compare that data to find any differences. The third is to provide an online repository for such data. The fourth is to have an open process for reconciling differences between the locale data used on different systems and validating the data, to come up with a useful, common, consistent base of locale data.
This document [Locale Data Markup Language - LDML] describes one of those pieces, an XML format for the communication of locale data. With it, for example, collation rules can be exchanged, allowing two implementations to exchange a specification of collation. Using the same specification, the two implementations will achieve the same results in comparing strings.
People do not have to subscribe to this model to use the data, but they do need to understand it so that the data can be correctly translated into whatever model their implementation uses. The first issue is basic: what is a locale? In this document, a locale is an id that refers to a set of user preferences that tend to be shared across significant swathes of the world. Traditionally, the data associated with this id provides support for formatting and parsing of dates, times, numbers, and currencies; for measurement units, for sort-order (collation), plus translated names for timezones, languages, countries, and scripts. They can also include text boundaries (character, word, line, and sentence), text transformations (including transliterations), and support for other services.
Locale data is not cast in stone: the data used on someone's machine generally may reflect the US format, for example, but preferences can typically set to override particular items, such as setting the data format for 2002.03.15, or using metric vs. Imperial measurement units. In the abstract, locales are simply one of many sets of preferences that, say, a website may want to remember for a particular user. Depending on the application, it may want to also remember the user's timezone, preferred currency, preferred character set, smoker/non-smoker preference, meal preference (vegetarian, kosher, etc.), music preference, religion, party affiliation, favorite charity, etc.
Locale data in a system may also change over time: country boundaries change; governments (and currencies) come and go: committees impose new standards; bugs are found and fixed in the source data; and so on. Thus the data needs to be versioned for stability over time.
In general terms, the locale id is a parameter that is supplied to a particular service (date formatting, sorting, spell-checking, etc.). The format in this document does not attempt to collect together all the data that could conceivably be used by all possible services. Instead, it collects together data that is in common use in systems and internationalization libraries for basic services. The main difference among locales is in terms of language; there may also be some differences according to different countries or regions. However, the line between locales and languages, as commonly used in the industry, are rather fuzzy...
The Common Locale Data Repository comparison charts provide comparisons between locale data from different sources. The files are organized by locale. In each file, the first three columns identify the data item, while the subsequent columns contain data. There may be different numbers of columns per locale, based on the available comparison data. The Common data is in the first data column, with other data sources following (where available). The latter sources are generated with public APIs..." See the main chart (an index of all comparison charts) and the collation diff files. [adapted from the 'Introduction' to the Draft Unicode Technical Standard #35]
About the Unicode Consortium
"The Unicode Consortium is a non-profit organization originally founded to develop, extend and promote use of the Unicode Standard, which specifies the representation of text in modern software products and standards. The Unicode Consortium actively develops standards in the area of internationalization including defining the behavior and relationships between Unicode characters. The Consortium cooperates with W3C and ISO and has liaison status "C" with ISO/IEC/JTC 1/SC2/WG2, which is responsible for in refining the specification and expanding the character set of ISO/IEC 10646.
The publications of the Unicode Consortium include The Unicode Standard, with its Annexes and Character Database, Unicode Technical Standards and Reports and Unicode Technical Notes.
Members of the Consortium include major computer corporations, software producers, database vendors, research institutions, international agencies, various user groups, and interested individuals. A white paper outlining the overall value of a Unicode membership to an organization is available separately.
The Consortium's Directors and Officers come from a variety of organizations, representing a wide spectrum of text-encoding and computing applications.
The Consortium maintains a set of official policies with regards to patents, the stability of the Standard, and Unicode's trademarks, logo and copyright material. It also makes available many Unicode Encoded logos that can be freely displayed on web sites to indicate that a page (or collection of pages) is encoded in Unicode. The Consortium's technical activities are conducted by the Unicode Technical Committee, which is responsible for the creation, maintenance, and quality of the Unicode Standard..." [extracted from the About document]
Principal references:
- Announcement 2004-04-21: "Unicode Consortium Sponsors Locale Data Project."
- "Locale Data Markup Language (LDML)." Draft Unicode Technical Standard #35. Version 1.1. Edited by Mark Davis. 2004-04-19. [cache]
- LDML XML DTDs corresponding to Draft Unicode Technical Standard #35:
- Common Locale Data Repository (CLDR) Project web site
- CLDR Repository Access. Provides documentation on how to download or browse the CLDR repository using WebCVS or a local CVS client.
- CLDR Comparison Charts
- CLDR main comparison chart
- CLRD collation diff files
- CLDR Data Formats. This resource provides background information on the way that locale information is formatted.
- Unicode Home Page
- The Unicode Consortium web site
- Earlier news:
- "OpenI18N Releases Locale Data Markup Language Specification (LDML) Version 1.0." News story 2003-07-03.
- Announcement 2003-06-24: "OpenI18N Releases LDML Version 1.0"
- "OpenI18N Announces Common XML Locale Specification." News story 2002-11-09.
- Free Standards Group web site. "The Free Standards Group develops standards and tools that help developers focus on adding value to their software, rather than spending time dealing with verification and porting issues. The group also coordinates testing and certification programs that verify software compliance with existing standards."
- See also: "XML and Unicode" - General reference document.
- "Markup and Multilingualism" - General reference document.