Cover Pages: IETF Draft on Language Tags Defines Mechanism for Private Use Extension.

Note 2004-01-23: An updated document "Tags for Identifying Languages" should also be consulted, as corrections/additions are provided relative to some content and references below.

Note 2003-11-19: An updated 'draft-langtags-phillips-davis-01' version of Tags for Languages was released on November 18, 2003. Updates will be made to this news story soon.

An initial public draft of Tags for Languages presented to the IETF Network Working Group builds upon the current IETF RFC 3066 Tags for the Identification of Languages and defines additional mechanisms for private use extension. The Internet Draft also clarifies how private use, registered values, and matching interact.

Identifiers known as language tags are authorized for use in XML and many related computing technologies that need to support language-sensitive and locale-based processing. Current practice regarding the creation, registration, and use of language tags is in a considerable state of confusion and "mess," in the experience of localization experts and software engineers. The goal of the new draft is to work toward a new IETF RFC that replaces RFC 3066.

The proposed syntax for construction of a language tag provides for designation of language, script, region, variant, and arbitrary extension (using name/value pairs). Under the new proposal, "all 4-letter subtags are interpreted as ISO 15924 alpha-4 script codes from ISO 15924, or subsequently assigned by the ISO 15924 maintenance agency or governing standardization bodies, denoting the script or writing system used in conjunction with this language. All 2-letter and 3-letter subtags are interpreted as ISO 3166 alpha-2 (or alpha-3) country codes from ISO 3166, or subsequently assigned by the ISO 3166 maintenance agency or governing standardization bodies, denoting the area to which this language variant relates. Region tags must occur after any script tags and before any variants or extensions."

A further goal of the new RFC is to provide for stable language tags even in the face of ISO instability. "To maintain backwards compatibility, there are two provisions to account for [v -01 text: potential] instability in ISO 639, 3166, and 15924 codes: (1) Ambiguity - in the event that one of these ISO standards reassigns a code that was previously assigned to a different value, the new use of the code will not be permitted and the IANA registry, as soon as practical, will register a surrogate value for the new code, based on the year that the new code assignment was made. (2) Stability - all other ISO codes are valid, even if they have been deprecated; where a new equivalent code has been defined, implementations should treat these tags as identical."

Bibliographic Information

Tags for Languages. By Addison P. Phillips (webMethods, Inc) and Mark Davis (IBM). IETF Network Working Group. Initial public draft. Reference: 'draft-langtags-phillips-davis-00.txt'. November 2003, expires November 2004. 13 pages. Original source: http://www.ietf.org/internet-drafts/draft-langtags-phillips-davis-00.txt (text). See also the text version, cache and source PDF.

Language Tag Issues

Background to the new Internet Draft Tags for Languages can be found in the IETF-languages Archives for the IETF Language tag discussion group, created "for the discussion of matters related to RFC 1766/RFC 3066 language tags, including but not limited to the registration of new tags." Some of the commmonly-referenced issues are summarized below, beginning with an excerpt from "Language Code Issues" written by Mark Davis.

In the context of a discussion about locale (localization), Davis says:

"The current situation with regards to language codes and locale codes is a a bit of a mess. We are looking at migrating ICU to use RFC 3066 instead of ISO 639 alone, and find a number of problems. Our goal is to identify the language codes and region codes used in practice in Windows and other platforms, and account for any missing ones so that people can communicate locale information correctly, without loss of data because of missing or mismatched language codes. The issues that we turned up, or were pointed out to us by others, are of interest not only for ICU but also to a broader audience..."

"The usual basis for a locale code is a combination of language code and region code, sometimes augmented with variants. The language code could either be drawn from ISO 639 or from RFC 3066, but there are problems with both. RFC 3066 codes are actually a superset of ISO 636 codes..."

"As far as we are concerned (as a completely practical matter) two languages are different if they require substantially different localized resources. Distinctions according to spoken form are important in some contexts, but the written form is by far and away the most important issue for data interchange. Unfortunately, this is not the principle used in ISO 639, which has the fairly unproductive notion that only spoken language matters (it is also not completely consistent about this, however). If the use of languages happens to correspond to region boundaries expressed as ISO 639 country codes, then we can use RFC 3066 to express the difference, but in many cases this is not true. For example, both simplified and traditional Chinese are used in Hong Kong S.A.R.; both Cyrillic and Latin are used in Serbia, Azerbaijan, and Uzbekistan; Indic languages are customarily written in different scripts, etc..."

"ISO 15924 contains script codes that could be used, in some cases, to distinguish future language codes in these cases. Unfortunately, RFC 3066 does not permit the productive use of script codes for these cases; each example has to be separately registered, which can take quite a while..."

"ISO 15924 does not yet help with simplified Chinese vs. traditional Chinese. ISO 15924 does allow for variant scripts, such as Latf for the Fraktur variant of Latin, or Latg for the Gaelic variant. We would need variant codes for Chinese (Hani) corresponding to simplified and traditional (such as Hans and Hant) if script codes were used to address the above problem..."

[We could] start the ball rolling on a 3066bis that would allow the productive use of script codes, allowing the formats in addition to the Current RFC 3066 Formats: <iso_639_code> "-" <iso_15924_code> and <iso_639_code> "-" <iso_15924_code> "-" <iso_3166_code>... We do need this eventually, for matching purposes, and since the ways in which languages can be used with different scripts is not closed. Roozbeh points out, for example, that az-Arab and uz-Arab are both used. But there was a reason that RFC 3066 allowed productive use of Language + Region codes: it saves a good detail of time and effort to not have to register all the combinations that anyone wants. Many different languages can be written in multiple scripts, and if one counts all the regions that they could be used in, it would be far better to make this a generative machanism, like it is with ISO Language + Region codes..."

"Part of the fuzziness around this whole topic is that people have very slippery notions of what distinguishes a language code vs. a locale code. The problem is that both are somewhat nebulous concepts. In practice, many people use RFC 3066 codes to mean locale codes instead of strictly language codes. It is easy to see why this came about; because RFC 3066 includes an explicit region (country) code, for most people it was sufficient for use as a locale code as well. For example, when typical web software receives an RFC 3066 code, it use it as a locale code. Other typical software will do the same: in practice, language codes and locale codes are treated interchangeably..." ["Language Code Issues"]

Short list of "language tag" issues:

Neither ISO 639-1 nor ISO 639-2 referenced by RFC 3066 contains a complete list of languages sufficient for language identification in various application domains. ISO 639-2 lists 480+ languages, but there are more than 6000 natural/human languages.

The ISO process for registering new languages is too slow and (arguably) too restrictive, forcing language engineers to make do with the (arguably) under-specified mechanisms in RFC 3066 for creating new codes (principal options include: [1] "i-" <iana_registered_code> [2] "x-" <private_use_code> [3] <iso_639_code> [4] <iso_639_code> "-" <iso_3166_code>).

There is some disagreement about the role that script (and writing system) should play in the design of language tags. In principle, it is understood that some (national) languages have more than one official script (e.g., Azerbaijani written in both Latin and Cyrillic), and that any language can be written in any script. Not everyone agrees that binding language+script into a fixed (static, non-decomposable) language tag is a good idea because it requires too many permutations.

A "serious issue for most language tagging systems" is that the ISO codes that are used to compose them are not stable. See for example the decision by the maintenance agency for ISO 3166 to re-assign cs (formerly Czechoslovakia) to Serbia and Montenegro..." and the IAB letter to ICANN. A UTC Resolution 96-M5 of August 26, 2003 made appeal that ISO "rescind the re-assignment of the code "cs' to Serbia and Montenegro at the earliest opportunity available, to minimize the impact [and to] change the policy to allow the re-use of codes only after a long period of time, such as 100 years..."

The practice of creating language tags based upon language+countryCode is problematic because ISO 3166 "Codes for the Representation of Names of Countries" is incomplete. More profoundly, the boundaries for political jurisdictions such as "country" do not match boundaries for language and language use, often based upon "region" geographically/culturally defined. Regions sometimes involve multiple countries.

There is no agreement about the optimal factoring of linguistic and non-linguistic features in the creating of language tags. Description of variants is often based upon dialect, orthography (spelling practices, e.g., British vs. American English), vocabulary-based sublanguage variant, time period, etc.

Based upon various approaches to language tag design, most all systems for identification overload the lang attribute to include separable, distinct features: XML's reserved xml:lang attribute and HTTP Accept-Language.

The tag registry process is unsatisfactory to software developers and localization experts because it requires the registration of fixed-string complete tags as units, not adequately supporting generative approaches. [According to a note from Addison Phillips, Mark Davis recently registered several script-tagged variants, but to use country codes with those variations would have required an addition 20 or 30 registrations.]

There is widespread confusion about the language vs locale, with requirements for user (UI) preferences and geography being merged with 'real' linguistic features that drive language-based information processing (spellchecking, hyphenation, soundex matching, thesaurus lookups, collation rules, morphological parsing, speech synthesis, etc). See the summary in "Appendix D: Language and Locale IDs," from the Locale Data Markup Language specification.

Comparison of ABNF Grammar Rules

Language tag syntax for RFC 3066 (in ABNF) is:

Language-Tag = Primary-subtag *( "-" Subtag )
Primary-subtag = 1*8ALPHA
Subtag = 1*8(ALPHA / DIGIT)

Language tag syntax for Tags for Identifying Languages 'draft-phillips-langtags-02' (in ABNF) is:

= lang ["-" script] ["-" region] *("-" variant)  [extensions]
=/ "x-" ALPHA * alphanumdash ; private use tag
=/ grandfathered-registrations
lang    = 2*3 ALPHA ; shortest ISO 639 tag
        =/ registered-lang
script  = 4 ALPHA ; ISO 15924 tag
region  = 2*3 ALPHA ; shortest ISO 3166 tag
variant =  5*16 alphanum
registered-lang = 5*16 alphanum
extensions      = "-x" 1* ("-" key "." value)
key   = ALPHA *alphanum
value = 1* utf8uri
grandfathered-registrations = ALPHA * (alphanumdash)
alphanum     = (ALPHA / DIGIT)
alphanumdash = (ALPHA / DIGIT / "-")
utf8uri      = (ALPHA / DIGIT / 1*4 ("%" 2 HEXDIG))

Examples

Informative Appendix B of the [updated] Tags for Identifying Languages draft presents several examples of Language Codes.

   Simple language code:
      de (German)
      fr (French)
      ja (Japanese)

   Language code plus Script code :
      zh-Hant (Traditional Chinese)
      en-Latn (English written in Latin script)
      sr-Cyrl (Serbian written with Cyrillic script)

   Language-Script-Region:
      zh-Hans-CN (Simplified Chinese for the PRC)
      sr-Latn-CS1 (Serbian, Latin script, Serbia and Montenegro)

   Language-Script-Region-Variant:
      en-Latn-US-boont (Boontling dialect of English)

   Language-Region:
      de-DE (German for Germany)
      zh-SG (Chinese for Singapore)
      cs-CS (Czech for Czechoslovakia)
      sr-CS1 (Serbian for Serbia and Montenegro, IANA registered variant)

   Other Mixtures:
      zh-CN (Chinese for the PRC)
      en-boont (Boontling dialect of English)

   Extension mechanism:
      de-CH-x-collation.phonebook
      az-Arab-x-SIL.AZE-dialect.derbend

The XML Specification and 'Tags for Languages'

The draft Tags for Languages will likely impact a large number of XML-related specifications if it (or a similar document) advances to RFC status and comes to be regarded as the natural successor to RFC 3066.

Extensible Markup Language (XML) 1.1 (W3C Proposed Recommendation 05-November-2003) provides a for IETF RFC 3066 Tags for the Identification of Languages in its Appendix A (A.1 Normative References). RFC 3066 is referenced by XML 1.1 in two locations:

Section 1.1 'Origin and Goals': "This specification, together with associated standards (Unicode and ISO/IEC 10646 for characters, Internet RFC 3066 for language identification tags, ISO 639 for language name codes, and ISO 3166 for country name codes), provides all the information necessary to understand XML Version 1.1 and construct computer programs to process it..."

Section 2.12 'Language Identification': "In document processing, it is often useful to identify the natural or formal language in which the content is written. A special attribute named xml:lang MAY be inserted in documents to specify the language used in the contents and attribute values of any element in an XML document. In valid documents, this attribute, like any other, MUST be declared if it is used. The values of the attribute are language identifiers as defined by IETF RFC 3066, Tags for the Identification of Languages, or its successor; in addition, the empty string MAY be specified..."

Similar references are made in Extensible Markup Language (XML) 1.0 (Second Edition) [W3C Recommendation 6-October-2000] where "Tags for the Identification of Languages, or its successor on the IETF Standards Track..." had the earlier designation RFC 1766. See also Section 2.12 'Language Identification' in the early Extensible Markup Language (XML) 1.0 [REC-xml-19980210, W3C Recommendation 10-February-1998] where grammar productions 33-38 (LanguageID, Langcode, ISO639Code, IanaCode, UserCode, Subcode) were included for Language Identification based upon IETF RFC 1766, ISO 639, and IANA registered language identifiers.

Principal references:

Tags for Languages. IETF Internet Draft. Reference: 'draft-langtags-phillips-davis-01.txt'. released on November 18, 2003.
Tags for Languages. IETF I-D 'draft-langtags-phillips-davis-00'.
Tags for the Identification of Languages. By Harald Tveit Alvestrand (Cisco Systems). IETF Network Working Group, Request for Comments: 3066. Best Current Practice, 47. January 2001. Obsoletes RFC 1766.
Comments on 'Tags for Languages' Draft: send email to the ietf-languages@alvestrand.no mailing list.
Mailing list archives for IETF Language Tag Discussions, "... a list for the discussion of matters related to RFC 1766/RFC 3066 language tags, including but not limited to the registration of new tags.."
"Web Services Internationalization Requirements" Editors' copy. By Addison P. Phillips, Martin Dürst, Andrea Vine, Michael McKenna, Tex Texin, Takao Suzuki, and Debasish Banerjee. W3C Internationalization Web Services Task Force. The November 8, 2003 version of the W3C requirements document references Tags for Languages, as one of the requirements for internationalizing Web services is "a standardized extension to the proposed update to RFC 3066 that describes international preferences [See draft-langtags-phillips-davis-00]. Some of the items that such extensions would describe include: Locales, Timezones, and Collation Preferences..."
"RFC 3066 Language Code Assignments." By Michael Everson (Everson Typography). "As language-tag reviewer for RFC 3066, I am maintaining the following table to help users access the codes and information on them. Clicking on the name of the code itself will open the registration document from the IANA website. You can also view the IANA languages directory..."
Directory of language tag applications
IANA Language Tags. Updated 2003-10-09 or later. "In Tags for the Identification of Languages [BCP 47, RFC 3066] there is a provision for listing unique "tags" or names for languages and variants of languages. This document summaries the list of assigned language tags..."
ISO 15924: Code for the Representation of Names of Scripts
"Internet Application Protocol Collation Registry." IETF Network Working Group. By Chris Newman. Reference: 'draft-newman-i18n-comparator-01.txt'. Proposes http://www.iana.org/assignments/collation/collation-name.xml and http://www.iana.org/assignments/collation/summary.txt
Language Code Issues. By Mark Davis. 2003.04.10 or later.
IANA web site
"Markup and Multilingualism" - General references.
"Language Identifiers in the Markup Context" - Main reference page.


SEARCH \| ABOUT \| INDEX \| NEWS \| CORE STANDARDS \| TECHNOLOGY REPORTS \| EVENTS \| LIBRARY