The Cover PagesThe OASIS Cover Pages: The Online Resource for Markup Language Technologies
Advanced Search
Site Map
CP RSS Channel
Contact Us
Sponsoring CP
About Our Sponsors

Cover Stories
Articles & Papers
Press Releases

XML Query

XML Applications
General Apps
Government Apps
Academic Apps

Technology and Society
Tech Topics
Related Standards
Created: February 28, 2005.
News: Cover StoriesPrevious News ItemNext News Item

IESG Announces Proposed IETF Working Group for Language Tag Registry Update.


Update 2005-03-08: On March 08, 2005 the IESG announced the formation of the Language Tag Registry Update (LTRU) Working Group. The Working Group Co-Chairs are Randy Presuhn and Martin Dürst. Addison Phillips and Mark Davis will serve initially as editors for the update to RFC 3066.

[February 28, 2005] The Internet Engineering Steering Group (IESG) has announced the submission of a proposal for a new IETF Working Group for 'Language Tag Registry Update' in the IETF Applications Area. The Steering Group requests comment on this proposal through March 2, 2005; it is expected that the creation of the Working Group will be discussed at the IESG teleconference on March 3, 2005.

The proposed Working Group would continue technical work on matters related to RFC 1766/RFC 3066 language tags, currently under discussion in the 'ietf-languages' list. RFC 3066, published in 2001, "describes a language tag for use in cases where it is desired to indicate the language used in an information object, how to register values for use in this language tag, and a construct for matching such language tags."

RFC 3066 language tags are used in a wide range of computing applications, and particularly in markup (meta-)languages (XML, HTML), to provide language attributes. Computing machines need to know what language a text is "in" so as to perform intelligent processing on encoded text: for spell-checking, indexing, searching, multilingual-context word wrapping, computer-synthesized speech, hyphenation, transliteration, sorting/collation, grammar checking, thesaurus building, machine translation, etc. The computer needs to know about both language and script (writing system) to do the right thing in a multilingual setting.

Several individual Internet Drafts have been prepared as a successor to RFC 3066 (aka RFC 3066bis), including the February 14, 2005 two-part version composed of Tags for Identifying Languages and Matching Language Identifiers, edited by Addison P. Phillips and Mark Davis. Review by various parties in the IETF context has pointed out a number of remaining complications stemming from dependencies upon other standards bodies and maintenance agencies (e.g., scripts, countries/regions, dates). These would be addressed within the proposed IETF Working Group.

The draft IETF proposal for the WG notes several issues that have presented challenges to the design of a successor to RFC 3066: "(1) [in]stability and accessibility of the underlying ISO standards ; (2) difficulty with registrations and their acceptance; (3) lack of clear guidance on how to identify script and region where necessary; (4) lack of parseability and the ability to verify well-formedness; (5) lack of specified algorithms, apart from pure prefix matching, for operations on language tags."

One of the proposed Language Tag Registry Update WG document deliverables would correspond directly to RFC 3066, describing "the structure of the IANA registry and how the registered tags will relate to the generative mechanisms," with solutions to the problems mentioned above. The document should "describe how the meaning of language tags remains stable, even if underlying references should change, and how the structure is to remain stable in the future. For accessibility, it is to provide a mechanism for easily determining whether a particular subtag is valid as of a given date, without onerous reconstruction of the state of the underlying standard as of that time."

This first document would also describe "how generative mechanisms could use ISO 15924 (Codes for the Representation of Names of Scripts) and United Nations M.49 (Standard Country or Area Codes for Statistical Use) codes without explicit registration of all combinations, with guidance for how scripts should be incorporated into language tags. It is expected to provide mechanisms to support the evolution of the underlying ISO standards, in particular ISO 639-3, mechanisms to support variant registration and formal extensions; it would specify a mechanism for easily identifying the role of each subtag in the language tag, so that, for example, whenever a script code or country code is present in the tag it can be extracted, even without access to a current version of the registry."

A second major document to be produced within the Language Tag Registry Update WG "will describe matching algorithms for use with language tags. Language tags are used in a broad variety of contexts and it is not expected that any single matching algorithm will fit all needs. Developing a small set of common matching and sorting algorithms does seem likely to contribute to interoperability, however, as it seems likely that using protocols could reference these well-known algorithms in their specifications."

Recent IETF Individual Internet Drafts Revising RFC 3066

  • "Tags for Identifying Languages." By Addison P. Phillips (editor; Director, Globalization Architecture, webMethods) and Mark Davis (IBM). Also available in HTML format with hyperlinks. IETF Network Working Group. Internet Draft. Reference: 'draft-phillips-langtags-10'. February 14, 2005, expires August 15, 2005. 45 pages. "This document describes the structure, content, construction, and semantics of language tags for use in cases where it is desirable to indicate the language used in an information object. It also describes how to register values for use in language tags and the creation of user defined extensions for private interchange. This document obsoletes RFC 3066 (which replaced RFC 1766)...

    Information about a user's language preferences commonly needs to be identified so that appropriate processing can be applied. For example, the user's language preferences in a browser can be used to select web pages appropriately. A choice of language preference can also be used to select among tools (such as dictionaries) to assist in the processing or understanding of content in different languages.

    One means of indicating the language used is by labeling the information content with a language identifier. These identifiers can also be used to specify user preferences when selecting information content, or for labeling additional attributes of content and associated resources.

    These identifiers can also be used to indicate additional attributes of content that are closely related to the language. In particular, it is often necessary to indicate specific information about the dialect, writing system, or orthography used in a document or resource, as these attributes may be important for the user to obtain information in a form that they can understand, or important in selecting appropriate processing resources for the given content..."

  • "Matching Language Identifiers." By Addison P. Phillips (editor; Director, Globalization Architecture, webMethods) and Mark Davis (IBM). IETF Network Working Group. Internet Draft. Reference: 'draft-phillips-langmatching-00'. February 14, 2005, expires August 15, 2005. 15 pages. "This document describes different mechanisms for comparing and matching the language identifiers defined by RFC3066bis. Possible algorithms for language negotiation and content selection are described. Portions of this document obsolete RFC 3066...

    Given a set of language identifiers, such as those defined in RFC3066bis, various mechanisms can be envisioned for performing language negotiation and tag matching. The suitability of a particular mechanism to a particular application depends on the needs of that application.

    This document defines language ranges and syntax for specifying user preferences in a request for language content. It also specifies a default algorithm for matching language ranges to content (language tags), as well as alternate mechanisms suitable for certain applications..."

Why Revise RFC 3066?

"Reasons for Enhancing RFC 3066." Addison P. Phillips (ed). From Inter-Locale. Document for Public Review. "RFC 3066 and its predecessor, RFC 1766, define language tags for use on the Internet. Language tags are necessary for many applications, ranging from cataloging content to computer processing of text. The RFC 3066 standard for language tags has been widely adopted in various protocols and text formats, including HTML, XML, and CLDR, as the best means of identifying languages and language preferences. This specification proposes enhancements to RFC 3066. Because revisions to RFC 3066 therefore have such broad implications, it is important to understand the reasons for modifying the structure of language tags and the design implications of the proposed replacement. The proposed successor to RFC 3066, addresses a number of issues that implementers of language tags have faced in recent years: (1) Stability of the underlying ISO standards; (2) Accessibility of the underlying ISO standards for implementers; (3) Ambiguity of the tags defined by these ISO standards; (4) Difficulty with registrations and their acceptance; (5) Identification of script where necessary; (6) Extensibility. The stability, accessibility, and ambiguity issues are crucial. Currently, because of changes in underlying ISO standards, a valid RFC 3066 language tag may become invalid (or have its meaning change) at a later date. With much of the world's computing infrastructure dependent on language tags, this is simply unacceptable: it invalidates content that may have an extensive shelf-life. In this specification, once a language tag is valid, it remains valid forever... The authors of this specification have worked for the past year with a wide range of experts in the language tagging community to build consensus on a design for language tags that meets the needs and requirements of the user community. Language tags form a basic building block for natural language support in computer systems and content. The revision proposed in this specification addresses the needs of this community of users with a minimal impact on existing content and implementations, while providing a stable basis for future development, expansion, and improvement..."

Related Publications from W3C Internationalization (I18N) Activity

Two documents produced within the W3C Internationalization (I18N) Activity relate to the use of language identifiers in markup, as referenced in the announcements 'Language Tags in HTML and XML' and and 'Language of XHTML and HTML Content'.

  • "Authoring Techniques for XHTML & HTML Internationalization: Specifying the Language of Content 1.0." Edited by Richard Ishida (W3C). W3C Working Draft. 24-February-2005. "Specifying the language of content is useful for a wide number of applications, from linguistically sensitive searching to applying language-specific display properties. In some cases the potential applications for language information are still waiting for implementations to catch up, whereas in others, such as detection of language by voice browsers, it is a necessity today. Marking up language information is something that can and should be done today. Without it, it is not possible to take advantage of any of these applications...

    This document is one of a series of documents providing HTML authors with techniques for developing internationalized HTML using XHTML 1.0 or HTML 4.01, supported by CSS1, CSS2 and some aspects of CSS3. It focuses specifically on advice about specifying the language of content. It is produced by the Internationalization GEO (Guidelines, Education & Outreach) Working Group of the W3C Internationalization Activity...

    This document provides techniques for developing pages using HTML 4.01, XHTML 1.0 and XHTML 1.1 with CSS1, CSS2 and some parts of CSS3... The base versions considered for this version of the document include: Internet Explorer 6 (Windows); Firefox 1.0; Mozilla 1.4; Opera 7.0; Netscape Navigator 7.0; Safari 1.03; Internet Explorer 5.2 (Mac)..."

  • "Language Tags in HTML and XML." By Martin Dürst and Richard Ishida. W3C Updated I18N Article. February 24, 2005. "Language tags are used to indicate the language of text in HTML and XML documents, and are also used in HTTP headers, SMIL and SVG switch statements, CSS pseudo-elements, etc. This article describes how to choose values for language tags. It augments an existing article with information that previously existed in a tutorial. For HTML 4, language tags are specified with the lang attribute. For XML, language tags are given in the xml:lang attribute. In both cases, language information is inherited along the document hierarchy, i.e., it has to be given only once if the whole document is in one language, and language information nests, i.e., inner attributes overwrite outer attributes...

    Although RFC3066 language tags work well much of the time, there are still some issues: (1) Many more codes are needed than those provided by ISO to cover the approximately 6,000 languages of the world; (2) They don't cover the needs to express general regions; for example, there is still no tag for the generalized Latin-American Spanish that many organizations use to create Spanish content; (3) There is some lack of clarity between the use of language tag values for designating language vs. locale; (4) There is a need, sometimes, to distinguish the script used, in addition to the language. People are currently working on solutions to these issues, including people from ISO TC37, SIL, and W3C..."

Principal References

Hosted By
OASIS - Organization for the Advancement of Structured Information Standards

Sponsored By

IBM Corporation
ISIS Papyrus
Microsoft Corporation
Oracle Corporation


XML Daily Newslink
Receive daily news updates from Managing Editor, Robin Cover.

 Newsletter Subscription
 Newsletter Archives
Bottom Globe Image

Document URI:  —  Legal stuff
Robin Cover, Editor: