Cover Pages: Language Identifiers in the Markup Context


SEARCH \| ABOUT \| INDEX \| NEWS \| CORE STANDARDS \| TECHNOLOGY REPORTS \| EVENTS \| LIBRARY

SEARCH
Advanced Search

ABOUT
Site Map
CP RSS Channel
Contact Us
Sponsoring CP
About Our Sponsors

NEWS
Cover Stories
Articles & Papers
Press Releases

CORE STANDARDS
XML
SGML
Schemas
XSL/XSLT/XPath
XLink
XML Query
CSS
SVG

TECHNOLOGY REPORTS
XML Applications
General Apps
Government Apps
Academic Apps

EVENTS

LIBRARY
Introductions
FAQs
Bibliography
Technology and Society
Semantics
Tech Topics
Software
Related Standards
Historic

Last modified: December 11, 2009

Technology Reports

Language Identifiers in the Markup Context

Introduction
Language Code Listings
Use of Standard Code Lists
Language Tags and Operating Systems
General References
See also: "Markup and Multilingualism"

Introduction

[August 29, 2001] Since machines first began processing digitized text, computer users have understood that the machine needed to know what language a text was "in" so as to perform intelligent processing on the text: for spell-checking, indexing, searching, multilingual-context word wrapping, computer-synthesized speech, hyphenation, transliteration, sorting/collation, grammar checking, thesaurus building, machine translation, etc. The computer needs to know about both language and script (writing system) to do the right thing in a multilingual setting. Thus, the use of language codes to assist in machine processing of text is documented in a wide range of specifications, including markup metalanguages (SGML, XML), markup language applications, and software operating systems. Similarly, descriptive cataloging at the subject/metadata level needs to assign labels for linguistic properties of data/text in order to help users restrict their research to appropriate content. In support of interoperable computing solutions and information longevity, it is desirable to use standardized language codes inserted directly into marked-up documents.

As the mass of networked digital information grows ever larger and becomes easily accessible, demand increases for a taxonomy of human languages adequate to support language data classification, categorization, and linguistic annotation. It is now widely recognized that the ISO standards providing "codes for the representation of names of languages" (ISO 639, ISO/DIS 639-1, ISO 639-2) are inadequate to meet the application requirements being levied by users in a growing number of domains. Librarians and archivists cataloging written and aural language materials from minority languages may find that the 136 codes of ISO 639:1988, or even the 400+ codes of ISO 639-2:1998 are too few to support metadata description. Linguists applying language codes at a low level within natural language texts may discover that the ISO codes do not sufficiently distinguish regional, social, or dialectical variation. Data providers in general fields may find that the code identifiers used in the two largest projects -- with classification for 7,000 or 70,000 languages/dialects -- are too heavy for their purposes. Granularity, genetic reconstruction, and language groupings are but a few of the challenges facing design teams in an endeavor to create a (sic!) theory-neutral language code vocabulary.

Despite these problems inherent to any classification endeavor, petitions are now being heard in many quarters for collaborative effort toward the creation of better language identification formalisms that account for the richness of human language -- increasing the number of language codes and their descriptiveness along several language-property axes. A new work item approved by ISO earlier in 2001, for example, addresses the need for an International Standard with mechanisms for encoding language variation through time, geography, dialectal variation, writing system, and so forth. An initial proposal calls for codes supporting representation of the language along at least five axes: "geog (geographical specification), script (writing system), temp (temporal specification), socli (sociolinguistic specification), and style (stylistic specification)."

This document supplies a collection of references to publications and projects relating to language identification. The goal is multipurpose: (1) to save time for readers who wish to know more about language identification in the markup context; (2) to raise awareness of the importance of language identification; (3) to urge support for standards efforts which will be required to continue the process of requirements gathering and database design reflecting a rigorous intellectual approach to the problems.

Please send corrections/additions via email. -- Robin Cover

Language Code Listings

ANSI/NISO Codes for the Representation of Languages for Information

Codes for the Representation of Languages for Information Interchange. 'ANSI/NISO Z39.53-2001.' Revision of ANSI/NISO Z39.53-1994. An American National Standard Developed by the National Information Standards Organization. Approved August 31, 2001 by the American National Standards Institute. Published by the National Information Standards Organization: NISO Press, Bethesda, Maryland, U.S.A. Maintenance Agency: US Library of Congress. ISSN: 1041-5653. 24 pages. "A standardized 3-character code to indicate language in the exchange of information is defined. Codes are given for languages, contemporary and historical." Source URL as of 2003-09; see also the reference URL.

[March 13, 2001] Codes for the Representation of Languages for Information Interchange. ANSI/NISO Z39.53-200X. ISSN:1041-5653, Revision of ANSI/NISO Z39.53-1994. 24 pages. A Draft American National Standard Developed by the National Information Standards Organization. Status: For Ballot February 9, 2001 - March 23, 2001. [see preceding; broken link removed]

The specification provides "a standardized 3-character code to indicate language in the exchange of information is defined. Codes are given for languages, contemporary and historical. The purpose of this standard is to provide libraries, information services, and publishers a standardized code to indicate language in the exchange of information. This standard for language codes is not a prescriptive device for the definition of language and dialects but rather a list reflecting the need to distinguish recorded information by language." From the Foreword: "This standard was originally prepared by Standards Committee C, Language Codes, which was organized in 1979. Charged with 'providing a standard code for indicating languages for information interchange purposes,' the committee produced a standard based on the list of MARC language codes developed by the Library of Congress in cooperation with the National Agricultural Library and the National Library of Medicine. This code list is now published as the MARC Code List for Languages. Practical application of the MARC language codes has shown that in order to serve as an appropriate retrieval device for information, a standard list of language codes must reflect the linguistic content of the universal collection to which it is applied, with language codes assigned as needed to distinguish information in a given language or group of languages. The MARC language codes constitute such a list. The committee's decision to base the standard on the existing MARC list took into account these contributing factors: (a) several years' successful application of the MARC language codes resulting in many millions of bibliographic records containing the accepted MARC codes, (b) the mnemonic relationship of the MARC codes to the English language names of the languages with English being the operational language of most American libraries, information services, and publishers, and (c) the flexibility inherent in a three-character code. The MARC list may be consulted for references from alternative forms of language names, as well as for the assignments to collective codes of languages for which individual codes have not been established. This revised edition reflects a thorough review of the document and includes changes which are a result of requests and demonstrated need from users and implementors. In addition, it includes numerous changes necessary for compatibility with bibliographic language codes in ISO 639-2 (Codes for the representation of names of languages: Alpha-3 code). The MARC code list is kept consistent with both ANSI/NISO Z39.53 and ISO 639-2/B." See the main description, the comment form, and a cover memorandum. Contact: NISO, 4733 Bethesda Ave, Suite 300, Bethesda, MD 20814; Fax: 301-654-1721; Email: [email protected].

Description: "The language codes are designed to be used: (1) To designate the languages in which documents are or have been written or re-corded; (2) To designate the languages in which document handling records (order records, bibliographic records, and the like) have been created. Language codes are not designed to be used: (1) To designate machine programming languages (FORTRAN, BASIC, and the like); (2) To distinguish languages from dialects. The dialect of a language is usually represented by the same language code as that used for the language... Each code comprises three roman alphabet characters. Codes generally were created using three characters usually based on an English form of the language name or, in some cases, a vernacular form of the corresponding language name. Future development of language codes will be based, whenever possible, on the vernacular form of the language, unless another language code is requested by the country or countries using the language. The codes are varied where necessary to resolve conflicts...Language codes are assigned either to individual languages or to related groups of languages. The level of specificity of the language code assigned is determined in each case to be the level necessary to maintain the utility of the standard based on the volume of documents or document handling records that have been or are expected to be written, recorded, or created. Levels of specificity represented by the language codes include: (1) Language codes for individual languages; (2) Collective language codes for linguistically or otherwise related groups of languages; (3) Collective language codes for linguistically or otherwise related groups of languages having individual language codes for some but not all languages so related. This standard does not indicate which level of specificity is represented by each code. The word 'languages' or 'other' as part of a descriptor may be taken to indicate that a language code is a collective language code. A collective language code is not intended to be used when an individual language code or another more specific collective language code is available." [cache]

[1994] National Information Standards Organization. Codes for the Representation of Languages for Information Interchange (ANSI/NISO Z39.53-1994). Bethesda, MD: NISO Press [for NISO], 1994. ISBN: 1-880124-10-6. ISSN: 1041-5653. Overview: "The National Information Standards Organization (NISO) has published a revised standard for language codes. Codes for the Representation of Languages for Information Interchange (ANSI/NISO Z39.53-1994) is used by libraries, information services, and publishers as the standard for designating languages in which documents or document handling records (such as order records or bibliographic records) have been created. The revised standard reflects a thorough review of the 1987 edition and includes many changes requested by users. Codes have been added for 28 languages or language groups previously not represented. The list codifies names for 399 languages. Numerous minor changes also have been made to reflect current accepted usage in language names. The USMARC Code List for Languages is kept consistent with ANSI/NISO Z39.53 and will be revised to incorporate the changes in this new edition." [from a NISO-L news announcement; see the complete text for details.] The standard was approved on September 21, 1994, by the American National Standards Institute. It was developed for NISO by an ad hoc working group composed of John Byrum (Chair), Rebecca Guenther, Sally H. McCallum, and Millicent Wewerka. It is a revision of ANSI Z39.53-1987. The 399 language codes are for contemporary and historical languages. The codes are based (largely) upon an existing MARC list of language names, where the MARC language codes have been used in the cataloging of millions of bibliographic works in a library setting. See unofficially: NISO 3-character language codes (Z39.53-1994), [mirror copy]. Also, for several proposed additions and deletions to Z39.53-1994, approved as of January 1997: see the update to USMARC Code List for Languages from November 15, 1996: "Any changes listed below [in this MARC code list] that were not included in Z39.53 will be incorporated at the next revision of that standard"; [mirror copy].

Ethnologue

[February 13, 2002] Ethnologue resources for language codes, announced by Peter Constable:

http://www.ethnologue.com/iso639/ -- entry point and intro (the following pages can be reached via links from this page)
http://www.ethnologue.com/iso639/codes.asp -- a table of current ISO 639 codes, with links for each code to a report showing our proposed mapping to Ethnologue entries
http://www.ethnologue.com/iso639/analysis.asp -- our analysis of ISO 639 codes
http://www.ethnologue.com/iso639/An_analysis_of_ISO_639.pdf -- a paper describing the principles by which we derived our proposed mapping and some issues that arise from the analysis (this is the paper I presented at IUC 20)
http://www.ethnologue.com/codes/ -- information on the codes used in the Ethnologue, with links to downloadable files containing core language data from the Ethnologue

Ethnologue: Languages of the World Fourteenth Edition 14 edition (by Barbara and Joseph Grimes) is available in print, CDROM, and online web formats. Published by SIL International, this work represents anthropological and linguistic survey conducted over many years, resulting in a collection of some 6,809 language descriptions listed by country, 41,791 alternate names and dialect names, 109 language family trees, together with 345 overviews of language situations. The work includes available alternate names, dialects, number of speakers, multilingualism, and other demographic and sociolinguistic information. The relevance of this language code index and database is described in the paragraphs below. See the Ethnologue language code index and description of the print version.

[September 07, 2000] "Language identification and IT: Addressing Problems of Linguistic Diversity on a Global Scale." Paper presented by Peter Constable and Gary Simons (SIL International) at the Seventeenth International Unicode Conference (IUC17), September 07, 2000, San Jose, CA. "Information technologies, particularly the internet, are rapidly becoming more global in focus. At the same time, and partly as a result, economic development is quickly expanding in many previously lesser-developed regions of the world. One of the implications of this is that IT systems are being confronted with the challenges of the world's ethno-linguistic diversity. Considerable and productive effort is being made to create adequate I18N infrastructures for issues such as text encoding and processing in IT systems. Yet at the same time, infrastructures for dealing with issues of language and locale identification are lagging behind user needs. The connection between how text is encoded and how it should be processed cannot be properly closed until the language identification problem is solved, since so many aspects of text processing (like collating and spell-checking) are language specific. At present we are confronted with an issue of scale. The leading standard for addressing language identification, ISO 639-2, offers codes to identify approximately 450 languages. In fact, the number of languages spoken in the world today exceeds 6000, as is documented in SIL's online catalogue of the world's languages. The problem is that the world's linguistic diversity is at the same time very complex but well understood by relatively few. In this paper, we will explore the world's ethno-linguistic diversity, it's challenges for IT, and some directions in which we can move forward toward solutions. In particular, we will (1) give an overview of the world's ethno-linguistic diversity; - discuss some of the inherent difficulties in devising systems of language and locale identification; (2) examine some existing IT practices and their successes and limitations; and (3) present work that SIL is doing in relation to language identification that can provide at least part of a needed solution for global IT systems."

[September 28, 2000] "Language Identification and IT: Addressing Problems of Linguistic Diversity on a Global Scale." By Peter Constable and Gary Simons. In SIL Electronic Working Papers. Reference: SILEWP 2000-001. September 2000. 22 pages. Keywords: ISO 639, RFC 1766, internationalization, I18N, linguistic diversity, web development, XML, language identification, information technology (IT). [A revised version of a paper that was presented at the 17th International Unicode Conference in San José, California in September, 2000, and which appears in the conference proceedings.] "Many processes used within information technology need to be customized to work for specific languages. For this purpose, systems of tags are needed to identify the language in which information is expressed. Various systems exist and are commonly used, but all of them cover only a minor portion of languages used in the world today, and technologies are being applied to an increasingly diverse range of languages that go well beyond those already covered by these systems. Furthermore, there are several other problems that limit these systems in their ability to cope with these expanding needs. This paper examines five specific problem areas in existing tagging systems for language identification and proposes a particular solution that covers all the world's languages while addressing all five problems." [...] The information technology (IT) industry has been driven in recent years to address problems of multilingualism and internationalization. This has been driven to a significant extent by the growth of the Internet. Rapidly increasing economic development throughout the world, together with the growth of the 'Net, has actually resulted in a significant increase in the number of languages that technologies need to support. In many parts of the world, speakers of previously 'unknown' languages (that is, unknown to speakers of 'major' languages) are beginning to make their mark on the World Wide Web, and are using their own languages to do so. Even apart from the Internet, communities of speakers of lesser-known languages are using technology to pursue linguistic development of their communities through literacy, literature development and other means. In addition, researchers such as linguists and anthropologists, development and relief organizations, and governments are pursuing interests involving thousands of different linguistic and ethnic communities around the world. In this work, they are seeking to make use of current information technologies, such as Unicode and XML. . . [Problem of scale:] The need for systems to cover thousands of languages is real, not merely hypothetical. For instance, SIL has been involved in projects in some 1,600 different languages, of which about 1,100 are current, and new projects are begun regularly. Thus, just within SIL, we have an immediate need for over 1,600 identifiers that conform to RFC 1766 for use within XML documents. We are aware of several other agencies that have similar, vastly multilingual needs, such as the Linguistics Data Consortium, the Linguist List, the Endangered Language Fund, UNESCO, various departments of the U.S. and other governments, and others. When we add the work of other institutions, individual linguists and the language communities themselves, the existing needs for language identifiers are considerably greater, and are only continuing to grow. As stated earlier, every language in the world represents a real need for a unique language identifier. When confronted with needs for thousands of language identifiers, we find that some existing systems do not scale well. There is the obvious problem of devising several thousand new tags. There are other problems with scaling, however, due either to the mechanism that a system uses for tags, or to the procedures for extending the coverage of a system. We will consider each of these in turn..." Also in PDF format. [cache]

[August 27, 2001] "Mapping Between ISO 639 and the SIL Ethnologue. Principles Used and Lessons Learned." By Peter Constable and Gary Simons (SIL International). 2001-08-09. 17 pages. "There is a growing consensus that ISO standards for language identification are not meeting current and future industry needs, and that new work should be done to enhance these standards. Various extensions have been considered, including the following: (1) Provide more comprehensive coverage for the world's languages, including the thousands of lesser-known languages that have been attested. (2) Provide more comprehensive coverage for language collections, specifically collections based on genetic language relationships. (3) Provide systems for extending language identifiers to create identifiers for paralinguistic categories, such as writing system, or identifiers for language varieties based on factors such as style, geographic region, or time period... We have endeavoured to provide a definitive statement of how the ISO 639-1 and ISO 639-2 codes map to and from the SIL Ethnologue. We consider it acceptable to use the Ethnologue for this purpose. The Ethnologue is not a perfect representation of all the world's languages. Indeed, such a goal is impossible in principle. The Ethnologue is, nevertheless, among the most complete and generally reliable compilations of information on the world's languages available today. The Ethnologue has identified languages with some form of operational definition for language in mind, one based on a primary criterion of mutual non-intelligibility, and this definition has been applied with at least some level of consistency across languages. In spite of its limitations, the Ethnologue has become a de facto standard among many users because of its completeness of coverage, because the complete inventory of languages and the wealth of supporting information is readily accessible on the Web, and because it has been deemed by these users to warrant a sufficient level of their confidence... The Ethnologue's inventory and identifiers have been used in a number of research efforts and publications conducted by various agencies. They have also been adopted as the basis for language identification by the Linguist List, the Open Language Archive Community [OLAC], and the Rosetta Project... The Ethnologue assigns a unique three-letter code for each language within its scope. Three features that make it particularly useful are that it is a single source providing comprehensive coverage of all modern, natural languages; that each of its identifiers represents the same type of category (namely, a language, as understood in terms of the operational definition it assumes); and that the denotation of each identifier is well documented and readily accessible on a public Web site. By presenting a thorough and detailed mapping of ISO code elements to languages enumerated in the Ethnologue, we can effectively provide an explicit statement as to what type of category each of the ISO code elements represents and what they denote... We have presented our proposed mappings in HTML pages that are available online, along with an analysis ["Analysis of ISO 639-2 to Ethnologue Mappings"]. We acknowledge, though, that definitive mappings can only be specified by the owners of the ISO 639-x standards since they are the ones who determine what normative definitions apply to the standards... In this paper, we outline the principles by which we determined how to map ISO 639-x code elements to languages listed in the Ethnologue. In the course of our work, it was necessary to make judgments regarding what the ISO code elements denote, and in so doing we were able to compile in specific detail a number of issues that need to be considered in relation to the ISO standards as they exist at present." See similarly "An Analysis of ISO 639: Preparing the Way for Advancements in Language Identification Standards," presented at the Twentieth International Unicode Conference (IUC20) (January 28-31, 2002, Washington DC, USA). [source]

IETF RFCs (RFC 5646, 5645, 4646, 4647, 3066, 1766)

IETF Working Group and Discussion List:

IETF Language Tag Registry Update (ltru) WG Charter

Discussion list archive for the IETF Language Tag Registry Update WG
ietf-languages list. "A list for the discussion of matters related to RFC 1766/RFC 3066 language tags, including but not limited to the registration of new tags."
Inter-Locale web site on internationalization and localization. Unofficial, maintained by Addison Phillips.

[September 09, 2009] Tags for Identifying Languages. Edited by Addison Phillips (Lab126) and Mark Davis (Google). IETF RFC 5646, BCP 47. Precursors of this document include RFC 4646, RFC 4647, RFC 3066, and RFC 1766. Source text, HTML. Credits to Stephane Bortzmeyer, Karen Broome, Peter Constable, John Cowan, Martin Duerst, Frank Ellerman, Doug Ewell, Deborah Garside, Marion Gunn, Alfred Hoenes, Kent Karlsson, Chris Newman, Randy Presuhn, Stephen Silver, Shawn Steele, and many, many others... "This document describes the structure, content, construction, and semantics of language tags for use in cases where it is desirable to indicate the language used in an information object. It also describes how to register values for use in language tags and the creation of user-defined extensions for private interchange..." See also Update to the Language Subtag Registry (RFC 5645). Comment: see the blog article "New Language Tag Specification, RFC 5646, Published" by Richard Ishida (W3C).

[July 07, 2006] Matching of Language Tags. Edited by Addison Phillips (Yahoo! Inc) and Mark Davis (Google). Produced by members of the Language Tag Registry Update (LTRU) Working Group, in the IETF Applications Area. See the (unofficial) announcement of the IESG's approval fpr publication. In November 2005, the IESG approved the Tags for Identifying Languages document as a BCP and Initial Language Subtag Registry as an Informational RFC. Martin Duerst (Aoyama Gakuin University), co-chair of IETF's Language Tag Registry Update (LTRU) Working Group, announced that the IETF had IETF has approved version 15 of the "Matching of Language Tags" draft for publication. This document, together with version 14 of the companion "Tags for Identifying Languages" (now in RFC Ed Queue) will be published as an RFC and replace RFC 3066 ("Tags for the Identification of Languages"), which replaced RFC 1766. Currently, RFC 3066 or its successor is referenced normatively by XML 1.1 and other markup standards for constructing language identification tags. Knowledge about the particular language used by some piece of information content might be useful or even required by some types of processing; for example spell-checking, computer-synthesized speech, Braille transcription, or high-quality print renderings. One means of indicating the language used is by labeling the information content with an identifier or 'tag'. The IETF document 'Tags for Identifying Languages' describes the structure, content, construction, and semantics of language tags for use in cases where it is desirable to indicate the language used in an information object. It also describes how to register values for use in language tags and the creation of user defined extensions for private interchange. The document 'Matching of Language Tags: defines a syntax (called a language range) for specifying items in the user's list of language preferences (called a language priority list), as well as several schemes for selecting or filtering sets of language tags by comparing the language tags to the user's preferences. Applications, protocols, or specifications will have varying needs and requirements that affect the choice of a suitable matching scheme. It describes: how to indicate a user's preferences using language ranges; three schemes for matching these ranges to a set of language tags; and the various practical considerations that apply to implementing and using these schemes..."

"Matching Language Identifiers." Edited by Addison Phillips (Quest Software) and Mark Davis (IBM). IETF Network Working Group. InternetDraft. Reference: 'draft-ietf-ltru-matching-00'. May 13, 2005, expires November 14, 2005. 20 pages. "This document describes different mechanisms for comparing and matching the tags for the identification of languages defined by RFC 3066bis.

"Tags for Identifying Languages." Edited by Addison P. Phillips (Quest Software) and Mark Davis (IBM). IETF Network Working Group. Internet Draft, reference 'draft-ietf-ltru-registry-00'. March 10, 2005, expires September 11, 2005. 44 pages.

[February 28, 2005] IESG Announces Proposed IETF Working Group for Language Tag Registry Update. The Internet Engineering Steering Group (IESG) has announced the submission of a proposal for a new IETF Working Group for 'Language Tag Registry Update' in the IETF Applications Area. The Steering Group requests comment on this proposal through March 2, 2005; it is expected that the creation of the Working Group will be discussed at the IESG teleconference on March 3, 2005. The proposed Working Group would continue technical work on matters related to RFC 1766/RFC 3066 language tags, currently under discussion in the 'ietf-languages' list. RFC 3066, published in 2001, "describes a language tag for use in cases where it is desired to indicate the language used in an information object, how to register values for use in this language tag, and a construct for matching such language tags." RFC 3066 language tags are used in a wide range of computing applications, and particularly in (meta-) markup languages (XML, HTML), to provide language attributes. Computing machines need to know what language a text is "in" so as to perform intelligent processing on encoded text: for spell-checking, indexing, searching, multilingual-context word wrapping, computer-synthesized speech, hyphenation, transliteration, sorting/collation, grammar checking, thesaurus building, machine translation, etc. The computer needs to know about both language and script (writing system) to do the right thing in a multilingual setting. Several individual Internet Drafts have been prepared as a successor to RFC 3066, including the February 14, 2005 two-part version composed of Tags for Identifying Languages and Matching Language Identifiers, edited by Addison P. Phillips and Mark Davis. Review by various parties in the IETF context has pointed out a number of remaining complications stemming from dependencies upon other standards bodies and maintenance agencies (scripts, countries). These would be addressed within the proposed IETF Working Group.

[February 14, 2005] "Tags for Identifying Languages." By Addison P. Phillips (editor; Director, Globalization Architecture, webMethods) and Mark Davis (IBM). Also available in HTML format with hyperlinks. IETF Network Working Group. Internet Draft. Reference: 'draft-phillips-langtags-10'. February 14, 2005, expires August 15, 2005. 45 pages. "This document describes the structure, content, construction, and semantics of language tags for use in cases where it is desirable to indicate the language used in an information object. It also describes how to register values for use in language tags and the creation of user defined extensions for private interchange. This document obsoletes RFC 3066 (which replaced RFC 1766)."

[February 14, 2005] "Matching Language Identifiers." By Addison P. Phillips (editor; Director, Globalization Architecture, webMethods) and Mark Davis (IBM). IETF Network Working Group. Internet Draft. Reference: 'draft-phillips-langmatching-00'. February 14, 2005, expires August 15, 2005. 15 pages. "This document describes different mechanisms for comparing and matching the language identifiers defined by RFC3066bis. Possible algorithms for language negotiation and content selection are described. Portions of this document obsolete RFC 3066."

[December 08, 2004] "IESG Announcement: Last Call for 'Tags for Identifying Languages' to BCP." - "The IESG has been considering 'Tags for Identifying Languages' [draft-phillips-langtags-08.txt] as a BCP. There have been considerable changes to the document since the initial last call, and the IESG would like the community to consider the changes. In addition, the authors have prepared text describing why this mechanism is needed as a replacement for the existing procedure... The IESG plans to make a decision in the next few weeks, and solicits final comments on this action." Reasons for Enhancing RFC 3066: "RFC 3066 and its predecessor, RFC 1766, define language tags for use on the Internet. Language tags are necessary for many applications, ranging from cataloging content to computer processing of text. The RFC 3066 standard for language tags has been widely adopted in various protocols and text formats, including HTML, XML, and CLDR, as the best means of identifying languages and language preferences. This specification proposes enhancements to RFC 3066. Because revisions to RFC 3066 therefore have such broad implications, it is important to understand the reasons for modifying the structure of language tags and the design implications of the proposed replacement. This specification, the proposed successor to RFC 3066, addresses a number of issues that implementers of language tags have faced in recent years: (1) Stability of the underlying ISO standards; (2) Accessibility of the underlying ISO standards for implementers; (3) Ambiguity of the tags defined by these ISO standards; (4) Difficulty with registrations and their acceptance; (5) Identification of script where necessary; (6) Extensibility. The stability, accessibility, and ambiguity issues are crucial..."

[November 15, 2004] "Tags for Identifying Languages." By Addison P. Phillips (editor; Director, Globalization Architecture, webMethods) and Mark Davis (IBM). Also available in HTML format with hyperlinks. IETF Network Working Group. Internet Draft. Reference: 'draft-phillips-langtags-08'. November 9, 2004, expires May 10, 2005. 46 pages. "This document describes the structure, content, construction, and semantics of language tags for use in cases where it is desirable to indicate the language used in an information object. It also describes how to register values for use in language tags and a construct for matching such language tags, including user defined extensions for private interchange. This document replaces RFC 3066 (which replaced RFC 1766)." Editor's note: "You should note that we think that this will be very near to the final version of this document. As such we have created an external document describing in very broad terms the design and design decisions made in hopes of better documenting the whys-and-wherefores for potential implementers. This document is available for public comment..." See the announcement for Draft-08. IETF ephemeral source: http://www.ietf.org/internet-drafts/draft-phillips-langtags-08.txt.

[November 15, 2004] "Reasons for Enhancing RFC 3066." Addison P. Phillips (ed). Inter-Locale. Document for Public Review. "RFC 3066 and its predecessor, RFC 1766, define language tags for use on the Internet. Language tags are necessary for many applications, ranging from cataloging content to computer processing of text. The RFC 3066 standard for language tags has been widely adopted in various protocols and text formats, including HTML, XML, and CLDR, as the best means of identifying languages and language preferences. This specification proposes enhancements to RFC 3066. Because revisions to RFC 3066 therefore have such broad implications, it is important to understand the reasons for modifying the structure of language tags and the design implications of the proposed replacement. The proposed successor to RFC 3066, addresses a number of issues that implementers of language tags have faced in recent years: (1) Stability of the underlying ISO standards; (2) Accessibility of the underlying ISO standards for implementers; (3) Ambiguity of the tags defined by these ISO standards; (4) Difficulty with registrations and their acceptance; (5) Identification of script where necessary; (6) Extensibility. The stability, accessibility, and ambiguity issues are crucial. Currently, because of changes in underlying ISO standards, a valid RFC 3066 language tag may become invalid (or have its meaning change) at a later date. With much of the world's computing infrastructure dependent on language tags, this is simply unacceptable: it invalidates content that may have an extensive shelf-life. In this specification, once a language tag is valid, it remains valid forever... The authors of this specification have worked for the past year with a wide range of experts in the language tagging community to build consensus on a design for language tags that meets the needs and requirements of the user community. Language tags form a basic building block for natural language support in computer systems and content. The revision proposed in this specification addresses the needs of this community of users with a minimal impact on existing content and implementations, while providing a stable basis for future development, expansion, and improvement..."

[October 17, 2004] "Tags for Identifying Languages." By Addison Phillips (Editor, webMethods, Inc.) and Mark Davis (IBM). IETF Network Working Group, Internet Draft. Reference: 'draft-phillips-langtags-05'. October 7, 2004, expires April 7, 2005. 46 pages. "This document describes the structure, content, construction, and semantics of language tags for use in cases where it is desirable to indicate the language used in an information object. It also describes how to register values for use in language tags and a construct for matching such language tags, including user defined extensions for private interchange. This document replaces RFC 3066 which replaced RFC 1766)." See Inter-Locale Home Page and the HTML format.

[September 10, 2004] "Tags for Identifying Languages." Reference: 'draft-phillips-langtags-06'. "Version -06 has one substantive modification: the ABNF for variant subtags was modified to make four-digit year subtags (such as '1996' and '1901') legal. This change was implemented so that variant subtags that start with a digit can be four characters in length. Also in HTML format.

[August 16, 2004] "Tags for Identifying Languages." By Addison Phillips (Editor, webMethods, Inc.) and Mark Davis (IBM). IETF Network Working Group, Internet Draft. Reference: 'draft-phillips-langtags-05'. August 9, 2004, expires February 7, 2005. 47 pages. 18 references. "This document describes the structure, content, construction, and semantics of language tags for use in cases where it is desirable to indicate the language used in an information object. It also describes how to register values for use in language tags and a construct for matching such language tags, including user defined extensions for private interchange. This document replaces RFC 3066 (which replaced RFC 1766)..." See also the HTML version with links. Editor's notes from the announcement: "This document's changes section details the specific alterations in this version of the document. There are not that many substantive changes in this version. The majority of the changes are related to specific comments we received during the last two rounds of review. Also substantial work on the prototype registry between Doug Ewell and the authors (Mark and I) has resulted in a few tweaks to the examples and some rewriting in sections 3.1 and 3.2 (whose order has been swapped). Please review the changes section for specifics. We feel that this draft addresses all of the comments on this list from prior drafts (within the goals we set — which we enumerate now in the changes section). Absent the question of whether there should be a subtag registry at all, we feel that this document is very near its final form. Of course we welcome comments from the community, including vigorous debate where it is necessary, but sincerely hope that we can move forward with this draft with a new Last Call very soon..."

[June 30, 2004] "Tags for Identifying Languages." By Addison Phillips (Editor, webMethods, Inc.) and Mark Davis (IBM). IETF Network Working Group, Internet Draft. Reference: 'draft-phillips-langtags-04'. June 24, 2004, expires December 23, 2004. 42 pages. This document describes the structure, content, construction, and semantics of language tags for use in cases where it is desirable to indicate the language used in an information object. It also describes how to register values for use in language tags and a construct for matching such language tags, including user defined extensions for private interchange. This document replaces RFC 3066 (which replaced RFC 1766)... The language tag is composed of one or more parts: A primary language subtag and a (possibly empty) series of subsequent subtags. Subtags are distinguished by their length, position in the subtag sequence, and content, so that each type of subtag can be recognized solely by these features. This makes it possible to construct a parser that can extract and assign some semantic information to the subtags, even if specific subtag values are not recognized. Thus a parser need not have an up-to-date copy of the registered subtag values to perform most searching and matching operations..." Note: Mark Davis said in v04 "we provide for way for programs to really validate IDs by providing a complete list of all valid subtags... The most substantive issue I'd like to get feedback on is that we still allow in this draft subtags of up to 15 long (for readability), whereas RFC 3066 has a maximum of 8. The question is whether that would cause enough of a problem for older parsers that we should pull back to a maximum of 8..."

[June 21, 2004] "Supplementary Codes for RFC 3066bis." By Doug Ewell [WWW]. Announcement posted 2004-06-21. The web page "discusses the use of 'deprecated' ISO 639 codes, 'formerly used' ISO 3166 codes, and United Nations M.49 numeric geographical codes in RFC 3066bis. RFC 3066bis provides a great deal of flexibility, and along with it, some potential for confusion. This page describes the two different sets of region codes, explains the rules on deprecated ISO codes, and shows why the freely available, official code lists aren't enough by themselves to answer all questions. I hope this page will solidify the issues in my own head, explain them for anyone who is still puzzled, and eventually turn into a useful reference for language tag users once the new RFC is approved..."

[June 02, 2004] Tags for Identifying Languages. By Addison Phillips (Editor, webMethods, Inc.) and Mark Davis (IBM). IETF Network Working Group. Internet Draft. Reference: 'draft-phillips-langtags-03'. June 02, 2004, expires December 1, 2004. 35 pages. Also in PDF format. IETF Source: http://www.ietf.org/internet-drafts/draft-phillips-langtags-03.txt. See the news story: "Tags for Identifying Languages: IESG Issues Last Call Review for IETF BCP."

[April 09, 2004] "Tags for Identifying Languages." By Addison Phillips (webMethods, Inc) and Mark Davis (IBM). IETF Network Working Group, Internet Draft. Reference: 'draft-phillips-langtags-02'. April 8, 2004, expires October 7, 2004. 31 pages. "This document describes a language tag for use in cases where it is desired to indicate the language used in an information object, how to register values for use in this language tag, and a construct for matching such language tags, including user defined extensions for private interchange." AP note: "This version contains a few changes based on discussion on this list['[email protected]'], notably it more closely defines the rules for using UN M49 identifiers to resolve ambiguity. It also contains semi-substantial wordsmithing in section 2 which is not substantive, but which does make the rules (we think) clearer and easier to understand..." See also Inter-Locale Home (internationalization content and demos written by Addison Phillips). [PDF]

[February 14, 2004] "Tags for Identifying Languages." By Addison Phillips (Editor, webMethods, Inc) and Mark Davis (IBM). IETF Network Working Group, Internet Draft. Reference: 'draft-phillips-langtags-01'. February 10, 2004, expires August 11, 2004. [date issues, need to clarify]. See the note explaining what's new: (1) We removed the key.value structure from extensions. These are now 2 to 32 character alphanum subtags with no defined structure. (2) We added the concept of 'extended language' subtags, to handle the comments by Peter Constable about language relationships stemming from future adoption of ISO639-3. These are also explicitly reserved for future use. (3) We reserved single character subtags explicitly -- these were implicitly reserved by the syntax previously. (4) We revised the ABNF. Note that we have now unified all private use subtags with the same rules. That is the rules are the same for x-gabble and en-Latn-US-x-gabble. (5) We added support for UN country ID numbers (as suggested by John Cowan and others). These were made the 'ambiguity resolution mechanism' of choice for country IDs..." [Addison P. Phillips]

[January 05, 2004] "Tags for Identifying Languages." By Addison Phillips (Editor, webMethods, Inc) and Mark Davis (IBM). IETF Network Working Group, Internet Draft. Reference: 'draft-phillips-langtags-02'. December 17, 2003; expires June 16, 2004. 29 pages. [See also: draft-phillips-langtags-00 and the announcement.] "This document describes a language tag for use in cases where it is desired to indicate the language used in an information object, how to register values for use in this language tag, and a construct for matching such language tags, including user defined extensions for private interchange..." The ABNF formally specifies the syntax in which "the language tag is composed of one or more parts: A primary language subtag and a (possibly empty) series of subsequent subtags. The sequence of subtags has a specific structure that depends on the length of the subtag to distinguish each tag type." This Internet draft is based upon the earlier RFC 3066: "The main goals were to maintain backward compatibility (so that all previous codes would remain valid); reduce the need for large numbers of registrations; to provide a more formal structure to allow parsing into subtags even where software does not have the latest registrations; to provide stability in the face of potential instability in ISO 639, 3166, and 15924 codes (demonstrated instability in the case of ISO 3166); and to allow for external extension mechanisms. [The specification;] (1) Allows ISO15924 script code subtags and allows them to be used generatively. (2) Adds the concept of a variant subtag and allows variants to be used generatively. (3) Adds an extension mechanism which does not require registration to use. (4) Defines the private use tags in ISO639, ISO15924, and ISO3166 as the mechanism for creating private use language, script, and region subtags respectively. (5) Defines a syntax for private use variant subtags which can be used without registration. (6) Defines a process for handling reuse of values by ISO639, ISO15924, and ISO3166 in the event that they register a previously used value for a new purpose. (7) Changes the IANA language tag registry to a language subtag registry..." Note on the ISO 3166 "demonstrated instability": see the entry "Stability of ISO 3166 and other infrastructure standards" under Unicode Technical Committee Public Positions and UTC Resolution 96-M5 (August 26, 2003): "The recent decision by the maintenance agency for ISO 3166 to re-assign 'cs' (formerly Czechoslovakia) to Serbia and Montenegro can cause severe problems. Country codes are a fundamental component of modern computing infrastructure: major operating systems, postal services, business applications, identification and security systems, to name a few. Their stability must be guaranteed. Data that is identified by these codes has a shelf life of decades, not five years. [Recommended corrective actions to take include: (1) Rescind the re-assignment of the code 'cs' to Serbia and Montenegro at the earliest opportunity available, to minimize the impact; (2) Change the policy to allow the re-use of codes only after a long period of time, such as 100 years..." Davis wrote (2003-08-05) "The major computer systems and standards around the world, including most operating systems, use the two letter country codes. These codes must be stable and unique or data corruption will occur. Simply because a country ceases to exist does not mean that data for that country ceases to exist, nor that new data referring to that previous country cannot be created..." [Note: This document 'Tags for Identifying Languages' updates references given in the following news item 'IETF Draft on Language Tags Defines Mechanism for Private Use Extension'.]

[November 14, 2003] IETF Draft on Language Tags Defines Mechanism for Private Use Extension. An initial public draft of Tags for Languages presented to the IETF Network Working Group builds upon the current IETF RFC 3066 Tags for the Identification of Languages and defines additional mechanisms for private use extension. The Internet Draft also clarifies how private use, registered values, and matching interact. Identifiers known as language tags are authorized for use in XML and many related computing technologies that need to support language-sensitive and locale-based processing. Current practice regarding the creation, registration, and use of language tags is in a considerable state of confusion and "mess," in the experience of localization experts and software engineers. The goal of the new draft is to work toward a new IETF RFC that replaces RFC 3066. The proposed syntax for construction of a language tag provides for designation of language, script, region, variant, and arbitrary extension (using name/value pairs). Under the new proposal, "all 4-letter subtags are interpreted as ISO 15924 alpha-4 script codes from ISO 15924, or subsequently assigned by the ISO 15924 maintenance agency or governing standardization bodies, denoting the script or writing system used in conjunction with this language. All 2-letter and 3-letter subtags are interpreted as ISO 3166 alpha-2 (or alpha-3) country codes from ISO 3166, or subsequently assigned by the ISO 3166 maintenance agency or governing standardization bodies, denoting the area to which this language variant relates. Region tags must occur after any script tags and before any variants or extensions." A further goal of the new RFC is to provide for stable language tags even in the face of ISO instability. "To maintain backwards compatibility, there are two provisions to account for instabilities in ISO 639, 3166, and 15924 codes: (1) Ambiguity - in the event that one of these ISO standards reassigns a code that was previously assigned to a different value, the new use of the code will not be permitted and the IANA registry, as soon as practical, will register a surrogate value for the new code, based on the year that the new code assignment was made. (2) Stability - all other ISO codes are valid, even if they have been deprecated; where a new equivalent code has been defined, implementations should treat these tags as identical."

IETF RFC 3066 Tags for the Identification of Languages. IETF Network Working Group. Request for Comments [RFC]: 3066. January 2001. Obsoletes RFC 1766. Category: Best Current Practice. Harald Tveit Alvestrand (Cisco Systems, Weidemanns vei 27, 7043 Trondheim, Norway. Phone: +47 73 50 33 52; Email: [email protected]). "This document describes a language tag for use in cases where it is desired to indicate the language used in an information object, how to register values for use in this language tag, and a construct for matching such language tags... Meaning of the language tag: The language tag always defines a language as spoken (or written, signed or otherwise signaled) by human beings for communication of information to other human beings. Computer languages such as programming languages are explicitly excluded. There is no guaranteed relationship between languages whose tags begin with the same series of subtags; specifically, they are NOT guaranteed to be mutually intelligible, although it will sometimes be the case that they are. The relationship between the tag and the information it relates to is defined by the standard describing the context in which it appears. [For example:] In markup languages, such as HTML and XML, language information can be added to each part of the document identified by the markup structure (including the whole document itself). For example, one could write <span lang="FR">C'est la vie.</span> inside a Norwegian document; the Norwegian-speaking user could then access a French-Norwegian dictionary to find out what the marked section meant. If the user were listening to that document through a speech synthesis interface, this formation could be used to signal the synthesizer to appropriately apply French text-to-speech pronunciation rules to that span of text, instead of misapplying the Norwegian rules..." [source]

RFC 3066 language tag sources.

"The following rules apply to the primary subtag:

All 2-letter subtags are interpreted according to assignments found in ISO standard 639, 'Code for the representation of names of languages' [ISO 639], or assignments subsequently made by the ISO 639 part 1 maintenance agency or governing standardization bodies. (Note: A revision is underway, and is expected to be released as ISO 639-1:2000)

All 3-letter subtags are interpreted according to assignments found in ISO 639 part 2, 'Codes for the representation of names of languages -- Part 2: Alpha-3 code [ISO 639-2]', or assignments subsequently made by the ISO 639 part 2 maintenance agency or governing standardization bodies.

The value "i" is reserved for IANA-defined registrations

The value "x" is reserved for private use. Subtags of "x" shall not be registered by the IANA.

Other values shall not be assigned except by revision of this standard.

The reason for reserving all other tags is to be open towards new revisions of ISO 639; the use of "i" and "x" is the minimum we can do here to be able to extend the mechanism to meet our immediate requirements. [cache]

IETF RFC 1766 Tags for the Identification of Languages. IETF Network Working Group. Request for Comments: 1766. March 1995. Category: Standards Track. By Harald Tveit Alvestrand (UNINETT). "This document describes a language tag for use in cases where it is desired to indicate the language used in an information object. It also defines a Content-language: header, for use in the case where one desires to indicate the language of something that has RFC-822- like headers, like MIME body parts or Web documents, and a new parameter to the Multipart/Alternative type, to aid in the usage of the Content-Language: header." On 'meaning of the language tag': "It would be possible to define (for instance) an SGML DTD that defines a <LANG xx> tag for indicating that following or contained text is written in this language, such that one could write "<LANG FR>C'est la vie</LANG>"; the Norwegian-speaking user could then access a French-Norwegian dictionary to find out what the quote meant... ... In the primary language tag, all 2-letter tags are interpreted according to ISO standard 639, 'Code for the representation of names of languages' [ISO 639]." RFC 3066 which supersedes and obsoletes RFC 1766 allows 3-letter language tags from ISO 639-2:1998; see preceding. [cache] Contact: [email protected].

"RFC 3066 Language code assignments." By Michael Everson (Dublin). 2001-08-07 or later. "As language-tag reviewer for RFC 3066, I am maintaining the following table to help users access the codes and information on them. Clicking on the name of the code itself will open the registration document from the IANA website. You can also view the IANA languages directory..."

ISO 3166-1: The Code List. RFC 3066 describes the construction of language tags with ISO 3166 country codes. See here English or the French language version of the country names and Alpha-2 (i.e., two-letter) code elements of ISO 3166-1. See also the ISO 3166 Maintenance Agency (ISO 3166/MA) Home Page.

Two Alternative Proposals for Language Taging in ACAP. IETF Internet Draft. Reference: 'draft-ietf-acap-langtag-00.txt'. June 1997. By Martin J. Dürst (Multimedia-Laboratory, Department of Computer Science University of Zurich). Abstract: "For various computing applications, it is helpful to know the language of the text being processed. This can be the case even if otherwise only pure character sequences (so-called plain text) are handled. From several sides, the need for such a scheme for ACAP has been claimed. One specific scheme, called MLSF, has also been proposed, see 'draft-ietf-acap-mlsf-01.txt' for details. This document proposes two alternatives to MLSF. One alternative is using text/enriched-like markup. The second alternative is using a special tag-introduction character. Advantages and disadvantages of the various proposals are discussed. Some general comments about the topic of language tagging are given in the introduction... Option 1: A Text/Enriched-like Notation for Language Tags (TELT)..."specifies a text/enriched-like notation for language tags, leading to a format simmilar to text/enriched. It can be used with any character encoding that contains the necessary subset of the US-ASCII character repertoire. Language tags are of the form '<LANG=xxxxx>' where xxxxx is a lan- guage tag as defined in [RFC1766], with all letters written in upper case. No whitespace of any kind is allowed between '<' and '>'. Language alternatives are started by '<ALTLANG>'. Again, no whites- pace is allowed between '<' and '>'. The use of the character sequences '<LANG=' and '<ALTLANG<' is not allowed in the text itself. Code to convert from this notation to MLSF and back and to test for false positives in plain text search is given in an appendix... Option 2: Language Tags using a Start Tag Character (STLT)... as a method of language taging is only useable with character encodings that can represent the BMP of the Universal Character Set [ISO10646]. For the purpose of illustration, the character PILCROW SIGN (paragraph sign, U+00B6) is used as the tag start character..." Note: The UTR #7 report was reported to be "the result of an intense email discussion regarding language tagging and related issues, occasioned by the review of draft-ietf-acap-mlsf-01.txt and of draft-ietf-acap-langtag-00.txt, which proposed different mechanisms for language tagging in plain text..."

Multi-Lingual String Format (MLSF). IETF Internet Draft. Reference: 'draft-ietf-acap-mlsf-01.txt'. June 1997. Author: Chris Newman (Innosoft International, Inc.). Abstract: "The IAB charset workshop concluded that for human readable text there should always be a way to specify the natural language. Many protocols are designed with an attribute-value model (including RFC 822, HTTP, LDAP, SNMP, DHCP, and ACAP) which stores many small human readable text strings. The primary function of an attribute-value model is to simplify both extensibility and searchability. A solution is needed to provide language tags in these small human readable text strings, which does not interfere with these primary functions. This specification defines MLSF (Multi-Lingual String Format) which applies another layer of encoding on top of UTF-8 to permit the addition of language tags anywhere within a text string. In addition, it defines an alternate form which can be used to include alternative representations of the same text in different character sets. MLSF has the property that UTF-8 is a proper subset of MLSF. This preserves the searchability requirement of the attribute-value model. Appendix F of this document includes a brief discussion of the background behind MLSF and why some other potential solutions were rejected for this purpose..." [cache]

ISO 639

Overview. The ISO 639 standard provides an official list of the "names of languages" and related language information. ISO 639:1988 presented a set 136 two-character language codes, while the current revision effort toward ISO 639-1 focuses upon additional two-letter language identifiers. ISO/FDIS 639-1:2001 (Final Draft International Standard) has been completed, and includes about 190 language identifiers; see the note of July 23, 2001 from the TC convener and the provisional listing from the WG web site. ISO 639-2 includes three-letter language codes. From the introduction to ISO 639-2: "ISO 639 provides two sets of language codes, one as a two-letter code set (639-1) and another as a three-letter code set for the representation of names of languages. ISO 639-1 was devised primarily for use in terminology, lexicography and linguistics. ISO 639-2 represents all languages contained in ISO 639-1 and in addition any other language as well as language groups as they may be coded for special purposes when more specificity in coding is needed. The languages listed in ISO 639-1 are a subset of the languages listed in ISO 639-2; every language code in the two-letter code set has a corresponding language code in the alpha-3 list, but not necessarily vice versa. Both code lists are to be considered as open lists. The codes were devised for use in terminology, lexicography, information and documentation (i.e., for libraries, information services, and publishers) and linguistics." ISO 639-2:1998 provides identifiers for about 450 languages.

Update 2004-04: The ISO 639 family of standards is being extended by work in several working groups. See the summary from Håvard Hjulstad as of November 2003, referencing the following:

639-3 Codes for the representation of names of languages -- Part 3: Alpha-3 code for comprehensive coverage of languages
639-4 Codes for the representation of names of languages -- Part 4: Implementation guidelines and general principles for language coding
639-5 Codes for the representation of names of languages -- Part 5: Alpha-3 code for language families and groups
639-6 Codes for the representation of names of languages -- Part 6: Alpha-? code (Possible NWIP)

[September 20, 2004] Codes for the representation of names of languages — Part 3: Alpha-3 code for comprehensive coverage of languages [Codes pour la représentation de noms de langues — Partie 3: Code alpha-3 pour un traitement exhaustif des langues]. Prepared by Technical Committee ISO/TC 37, Terminology and other language resources, Subcommittee SC 2, Terminography and lexicography. Draft working document: "an ISO International Standard; it is distributed for review and comment; it is subject to change without notice and may not be referred to as an International Standard." References: ISO TC 37/SC 2 xxx. Date: 2004-09-20. ISO/DIS 639-3.5. ISO TC 37/SC 2/WG 1. From the Scope statement: "This part of ISO 639 provides a code consisting of language code elements comprising three-letter language identifiers for the representation of languages. The language identifiers according to this part of ISO 639 were devised for use in a wide range of applications, especially in computer systems, where there is potential need to support a large number of the languages that are known to have ever existed. Whereas ISO 639-1 and ISO 639-2 are intended to focus on the major languages of the world that are most frequently represented in the total body of the world's literature, this part of ISO 639 attempts to provide as complete an enumeration of languages as possible, including living, extinct, ancient and constructed languages, whether major or minor. As a result, this part of ISO 639 lists a very large number of lesser-known languages. Languages designed exclusively for machine use, such as computer-programming languages, and reconstructed languages are not included in this code. Knowledge of the world's languages at any given time is never complete or perfect. Additional language identifiers may be created for this list when it becomes apparent that there is a linguistic variety that is deemed to be distinct from other languages in accordance with the definitions in clause 3 and their elaboration in clause 4. In addition, the denotation of existing identifiers may be revised or identifiers may become deprecated when it becomes apparent that they do not accurately reflect actual language distinctions. In all such changes, careful consideration is given to ensure existing implementations are not adversely affected..." Note also from the Introduction: "The three-letter codes in ISO 639-2 and ISO 639-3 are complementary and compatible. The two codes have been devised for different purposes. The set of individual languages listed in ISO 639-2 is a subset of those listed in ISO 639-3. The codes differ in that ISO 639-2 includes code elements representing some individual languages and also collections of languages, while ISO 639-3 includes code elements for all known individual languages but not for collections of languages. Overall, the set of individual languages listed in ISO 639-3 is much larger than the set of individual languages listed in ISO 639-2..." Source reference: posting of 14-December-2004 by Peter Constable to the IETF-languages mailing list [[email protected]] and to '[email protected]'. "Yesterday, I wrote mentioning the DIS ballot for ISO 639-3. Someone asked me offline whether the draft could be obtained somewhere publicly. Unfortunately, ISO TC 37/SC 2 doesn't have a public document repository. However, I did post a draft online on SIL's site: Look for the link to the document at the bottom of the page to ISO_DIS_639-3.5 [Draft 5 of ISO 639-3]. This copy contains the complete draft code tables (one sorted by ID and one by name), but not the French translation. I mentioned this link back in August; that was prior to the TC 37/SC 2/WG 1 meeting, and some changes made to the draft after the meeting..." See the posting and the follow-up comment from Håvard Hjulstad: "The code table of the FINAL 639-3 will be made freely available..."

Update 2002-02-27: "Codes for the representation of names of languages -- Part 1: Alpha-2 code. [Codes pour la représentation des noms de langue -- Partie 1:Code alpha-2.]." From ISO/TC 37/SC 2 (Secretariat: SCC). International Standard ISO/FDIS 639-1. Reference: ISO/FDIS 639-1:2002(E/F). Final Draft. 48 pages. Voting begins on 2002-02-28. Voting terminates on 2002-04-28.

ISO 639:1988. ISO 639:1988 was published as the successor to and technical revision of ISO/R 639:1967 Symbols for languages, countries and authorities, withdrawn by ISO TC 37 in 1988-03-01. ISO 639, now sometimes referenced proleptically as 639-1 to distinguish it from ISO 639-2, was produced by the "Terminology (principles and co-ordination)" Technical Committee 37 of the International Organization for Standardization (ISO). ISO/TC 37 began to operate in 1952, and was chartered for "standardization of methods for creating, compiling and co-ordinating terminologies. The objective of ISO/TC 37 was to prepare standards specifying principles and methods for terminology work and terminography within the framework of standardization and related activities." Its technical work results in International Standards and Technical Reports covering terminological principles and methods as well as various aspects of computer-assisted terminography." More specifically, ISO 639:1988 was produced within Subcommittee 2, "ISO/TC 37/SC 2 Layout of vocabularies," which was tasked to prepare International Standards concerning terminology work, preparation and layout of terminology standards, coding and codes in the field of terminology, translation-oriented terminography, and terminology management." The ISO 639 standard Code for the representation of names of languages as published in 1988 was said to be "devised primarily for use in terminology, lexicography and linguistics, but they may be used for any application requiring the expression of languages in coded form."

Bibliographic reference: ISO 639:1988 (E/F). Code for the Representation of Names of Languages First edition, 1988-04-01. Reference number: ISO 639:1988 (E/F). Geneva: International Organization for Standardization, 1988. iii + 17 pages. ISO 639:1988 Code for the representation of names of Languages is under revision to become ISO 639-1. [Current 2001-08] ISO/DIS 639-1 Code for the Representation of Names of Languages - Part 1: Alpha-2 Code is thus a revsion of ISO 639:1988.

ISO 639-1. The revision of ISO 639:1988 is ISO 639-1 Code for the representation of names of languages - Part 1: Alpha 2 code / Code pour la représentation des noms de langue - Partie 1: Code alpha-2. ISO/FDIS 639-1:2001 was sent to the ISO Central Secretariat in late June, 2001. ISO 639-1 "consists of language code elements comprising two-letter language identifiers and the respective names of languages represented by these identifiers. The language identifiers according to this standard were devised originally for use in terminology, lexicography and linguistics, but may be adopted for any application requiring the expression of language in two-letter coded form, especially in computerized systems. The alpha-2 code was devised for practical use for most of the major languages of the world that are not only most frequently represented in the total body of the world's literature, but which also comprise a considerable volume of specialized languages and terminologies. Additional language identifiers are created when it becomes apparent that a significant body of documentation written in specialized languages and terminologies exists in a language, for which an alpha-2 identifier is needed, but does not exist yet... Languages designed exclusively for machine use, such as computer programming languages, are not included in this code." The ISO 639-1 Project Leader is Mr. Håvard Hjulstad (RTT - Rådet for teknisk terminologi, Norway). Note 2002-02-27: Balloting on the revised specification ends on 2002-04-28 [HH].

ISO 639-1 Registration Authority. Infoterm [International Information Centre for Terminology] "has been designated the ISO 639-1/RA (Registration Authority) for the purpose of maintaining a register of 2-letter coded names of languages comprised in the International Standard ISO 639-1, Code for the representation of names of languages - Part 1: Alpha 2 code / Code pour la représentation des noms de langues - Partie 1: Code alpha-2. The ISO 639-1/RA receives and reviews applications for the registration of new and for the change of existing language identifiers. See the Criteria for requesting new language codes. The development of the list is carried out by the ISO 639 Joint Advisory Committee (ISO 639/RAs-JAC) in cooperation with the Library of Congress, which functions as the Registration Authority for ISO 639-2, Code for the representation of names of languages - Part 2: Alpha-3 code / Codes pour la représentation des noms de langue - Partie 2: Code alpha-3 (ISO 639-2/RA). A list of information associated with registered language identifiers and updates of registered language codes is maintained by Håvard Hjulstad, Convener of ISO/TC 37/SC 2/WG 1, 'Coding systems'."

ISO 639-1 Registration Authority Contact: International Information Centre for Terminology (Infoterm), Heinerstr. 38, P.O. Box 130, A-1021 Vienna, Austria. Phone: +43-1-74040-442 od. 441; Telefax:+43-1-74040-444; Email: [email protected]; WWW: http://www.infoterm.org. The SC 2 "Layout of vocabularies" - ISO/TC 37/SC 2 Secretariat may be reached through Ms. Helen Hutcheson [Secretary], Terminology and Standardization, Directorate/Translation Bureau, Public Works and Government Services, Canada; Phone: +1-819-994-5934; Telefax:+1-819-953-9691.

ISO 639 online references:

ISO 639 Codes. Sorted by language name, language code, and language family.
Technical contents of ISO 639:1988 (E/F). Code for the representation of names of languages." Provided in HTML format by Michael Everson 1999-10-08 or later. Includes additions from ISO 639/RA Newsletter No. 1/1989, and from a decision of the Advisory Committee of ISO/TC37 on 1997-08-27 in Copenhagen. Codes also in French and Gaeilge.
Technical contents of ISO 639:1988 (E/F). Prepared by Keld Simonsen.
Early FDIS listing for ISO 639-1

ISO 639-2:1998 International Standard ISO 639-2 was prepared jointly by Technical Committees ISO/TC 37, Terminology (principles and coordination), Subcommittee SC 2, Layout of vocabularies and ISO/TC 46, Information and documentation, Subcommittee SC 4, Computer applications in information and documentation. Technical Committee 46/Subcommittee 4 (TC46/SC4) is the International Organization for Standardization (ISO) Subcommittee "responsible for technical standards used to facilitate interoperability of information services such as libraries, information centers, indexing and abstracting services, archives, and publishers. These technical standards include standards for information retrieval and interlibrary loan, applications of SGML, data elements directories, data formats, character sets, codes and user commands."

ISO 639-2:1998 has about 400 language codes, depending upon how one counts the "collective" codes that partially duplicate some individual codes. The most noticeable feature is the bibliographic and terminological variants, as explained below; the variants evidently represent the differing needs of the terminologists (TC 37, ISO 639-2/T) and bibliographers (TC 46, ISO 639-2/B).

On [re-]interpretation of B/T variants as "synonyms": Note from Rebecca S. Guenther (Chair, ISO 639 Joint Advisory Committee) on November 09, 2001 relative to the work done in the "10+ years of development of a 3-character code in having alternative codes... The twenty-one (21) alternative codes were necessary to satisfy both constituencies and to deal with the issue of millions of existing records using already established codes. It is always difficult to satisfy everyone in the development of ISO 639, but we are making a valiant attempt. More to the point is that the existing ISO 639-2 list (and ISO 639-1) has been developed for use with written languages, and accommodating variations in spoken languages is a matter for further discussion because of the now broader use of the list... A recent message (I think from John Clews) made a comparison between various codes and listed ISO 639-2/B and ISO 639-2/T codes separately. In discussions at the ISO 639 registration authorities meeting in conjunction with TC37 in August [2001], we all agreed that the few cases where there are alternative codes (only 21 out of 450+) should be considered synonyms, rather than different code sets. Thus the distinction between the bibliographic and terminologic is essentially unimportant. We have since updated the code lists on our Web site to discontinue the use of separate columns for the /B and /T, but rather to list them as synonyms."

From the 'Normative Text' of ISO 639-2:1998:

ISO 639-2 "provides two sets of three-letter alphabetic codes for the representation of names of languages, one for terminology applications and the other for bibliographic applications. The code sets are the same except for twenty-three languages that have variant language codes because of the criteria used for formulating them. The language codes were devised originally for use by libraries, information services, and publishers to indicate language in the exchange of information, especially in computerized systems. These codes have been widely used in the library community and may be adopted for any application requiring the expression of language in coded form by terminologists and lexicographers. The alpha-2 code set was devised for practical use for most of the major languages of the world that are most frequently represented in the total body of the world's literature. Additional language codes are created when it becomes apparent that a significant body of literature in a particular language exists. Languages designed exclusively for machine use, such as computer programming languages, are not included in this code.

Form of the language codes: "The language codes consist of three Latin-alphabet characters in lowercase. No diacritical marks or modified characters are used. Implementors should be aware that these codes are not intended to be an abbreviation for the language, but to serve as a device to identify a given language or group of languages. The language codes are derived from the language name. Two code sets are provided, one for bibliographic applications (ISO 639-2/B), and one for terminology applications (ISO 639-2/T). Criteria for selecting the form of a language code for code set B were: (1) preference of the countries using the language; (2) established usage of codes in national and international bibliographic databases, and; (3) the vernacular or English form of the language. Code set T was based on: (1) the vernacular form of the language, or (2) preference of the countries using the language. There are twenty-three language names that have variant codes assigned depending on the code set chosen..."

The US Library of Congress "has been designated the ISO 639-2 Registration Authority for the purpose of processing requests for alpha-3 language codes comprising the International Standard, Codes for the representation of names of languages -- Part 2: alpha-3 code. The ISO 639-2/RA receives and reviews applications for requesting new language codes and for the change of existing ones according to criteria indicated in the standard. It maintains an accurate list of information associated with registered language codes, processes updates of registered language codes, and distributes them on a regular basis to subscribers and other parties."

The LOC web site provides ISO 639 code lists. Codes for the Representation of Names of Languages-Part 2: Alpha-3 Code. With introductions and normative text. The normative text provides the ISO 639-2 Language Codes (with corresponding ISO 639-1 code) arranged alphabetically by:

Other LOC ISO 639/RA resources:

Development of the ISO 639-2 Standard
ISO 639-2 Annex A, "Procedures for the Registration Authority and Registration Authorities Advisory Committee ISO 639" [normative]
ISO 639-2/Registration Authority Change Notice. "ISO 639: Codes for the Representation of Names of Languages. Additions/Changes to ISO 639 Codes as published in: ISO 639-1: 1988 (Alpha-2 code) and ISO 639-2: 1998 (Alpha-3 code). All changes are displayed in color and italics; deprecated codes are indicated by a hyphen preceding the code. This document is continually updated."
ISO 639-2 Registration Authority Home
ISO 639-2 Registration Authority Home Page
ISO 639 Request Form: Request for new language code
ISO 639 change request - "request an addition or change to an ISO 639 language name..."
Criteria for requesting new language codes for ISO 639-2
Criteria for requesting new language codes for ISO 639-1 [ISO 639 Joint Advisory Committee rule]

ISO 639 Joint Advisory Committee. The ISO 639 Joint Advisory Committee (ISO 639/JAC) has been established [from ISO TC37/SC2 and ISO TC46/SC4] to advise both the ISO 639-1/RA Registration Authority and the ISO 639-2/RA Registration Authority on the two parts of the International Standard on language codes, Codes for the representation of names of languages -- Part 1: alpha-2 code and Codes for the representation of names of languages -- Part 2: alpha-3 code. The ISO 639/JAC guides the application of the coding rules as laid down in both standards. If you have questions concerning ISO 639 please contact us at: Library of Congress; Network Development and MARC Standards Office; Washington, DC 20540-4402; Email: [email protected]; Phone: +1 202 707 6237; FAX: +1 202 707 0115.

ISO 639 Joint Advisory Committee: Rules of procedure for conducting business. Document ISO 639/JAC N2R. 10-March-2000. "The following documents rules of procedure for the conduct of meetings and email business by the ISO 639 Joint Advisory Committee. It repeats some information that is in ISO 639-2:1998 in the normative Annex A and elaborates where necessary for clarification of procedures. In particular it details how business is run in the absence of regular meetings... ISO 639/JAC is composed of: (1) one representative of the International Information Centre for Terminology (Infoterm; representing ISO 639-1/RA); (2) one representative of the Library of Congress (LC; representing ISO 639-2/RA); (3) three representatives of ISO/TC 37 (nominated by ISO/TC 37); (4) three representatives of ISO/TC 46 (nominated by ISO/TC46). Both ISO/TCs may nominate substitute representatives for a meeting..."

ISO 639 Joint Advisory Committee: Working principles for ISO 639 maintenance. Document ISO 639/JAC N3R. 8-March-2000. "The following documents working principles for the maintenance of language codes by the ISO 639 Joint Advisory Committee both in ISO 639-1 (Alpha-2 code) and ISO 639-2 (Alpha-3 code). It repeats some information that is in ISO 639-2:1998 in section 4 (Language codes) and the normative Annex A. In addition, it gives further details as to how language code changes that are submitted are considered and how the two parts of ISO 639 are related."

Historical [earlier than 2000-04-28; some links may be broken]:

ISO 639/2 (1992 Draft) codes for names of languages
ISO 639:1988 Names for languages provisionally supplied here. Use this with care in light of details in the bibliographic entry.
A different compilation from ISO 639:1988 language codes
Some problems with ISO 639. See section 4.4 in Electronic Commerce and Cultural and Linguistic Adaptability: Practical Examples and Horizontal Issues From the Canadian National Body. ISO/IEC JTC 1 N 5626, 1998-12-07. "Serious reflection and more systematic thinking is required here with respect to 'tags' especially if one wishes to use SGML / HTML / XML generally and in electronic commerce specifically as well as ensuring interoperability not only with the use of other syntaxes but among various consumer markets, industry sectors, etc. Even more important is the need to develop a systematic and unambiguous interworking in an IT-enabled manner among language code (ISO 639), currency and fund codes (ISO 4217) and country codes (ISO 3166-1)." Jake Knoppers. [local archive copy]
The MARC 3-character language codes, from about 1991. Alternately: a MARC list "revised 7/19/93", [mirror copy]. See now the update to USMARC Code List for Languages from November 15, 1996; [mirror copy].
Compare language codes of ANSI/NISO Z3953-1994
Technical contents of ISO 639:1988 (E/F) with corrections 1989, 1997; [local archive copy]
"Technical contents of ISO 639-2:1996 - Part 2: Alpha-3 code. [Source? DIS?] [local archive copy]
RFC 1766 Language code assignments - By Michael Everson. "As language-tag reviewer for RFC 1766, I have made and intend to maintain the following table to help users access the codes and information on them. Clicking on the name of the code itself will open the registration document from the IANA website..."
[April 27, 2000] Q: "Has anybody noticed that XML 1.0 requires 2-letter and forbids three-letter language codes?" Martin J. Duerst replied "The W3C I18N WG/IG, together with the W3C XML Core WG, are aware of the situation and are working on the relevant errata for this. Our current plans are to make some fixes (to make clear that XML parsers should not [be required to] enforce Production 35, in order to be forward-compatible) as soon as time permits, and the rest (updating from RFC 1766 to its successor) as soon as that is out..." [April 27, 2000, XML-DEV]
[April 24, 1998] xml:lang resources for parser writers. Contributed by Murray Altheim (Sun Microsystems, SunSoft). Currently there are XML versions of files providing the ISO 639 (language), ISO 3166 (country), and IANA charset values required for support of the xml:lang attribute values in XML1.0. Please direct comments and corrections to Murray Altheim. These 'xml:lang resources' will be made available later on Sunsite.
Language Codes from Ethnologue: Languages of the World, 13th Edition View with a ISO Latin-1 char set font. [local archive copy]

Linguasphere Project

Developed and maintained for many years by David Dalby, Linguasphere "provides a very detailed listing of all the world's languages and many of its dialects, using a new system of classification." Dalby's classification puts the emphasis on verifiable immediate relationships rather than on distant and often hypothetical ones. It features a stable framework of 100 referential zones, each identified by two leading digits, successively refined into six layers of relationship, coded alphabetically, broadly reflecting proportions of shared basic vocabulary. Language/dialect identifiers are constructed as alpha-numeric codes, offering a coded classification of language groups ('sets, chains, nets') and idioms ('outer languages, inner languages, and dialects'). Linguasphere is comparable in some respects to SIL's Ethnologue project; both provide a system of language identifiers with codes for almost all the living languages of the world." [adapted from a book review by Philip Baker]

Web site description:

"The Linguasphere Register of the World's Languages and Speech Communities is the first attempt at a comprehensive and transnational classification of the modern languages and dialects of the world -- and of the communities of humankind. Compiled over several decades by David Dalby (Linguasphere Observatory, London School of Oriental and African Studies and University of Wales, Cardiff), the Register classifies all known languages and dialects on the basis of their closest linguistic relationships, and includes a theoretical and practical discussion and presentation of the linguasphere. A complete index of linguistic and ethnolinguistic names has been prepared by Michael Mann (School of Oriental and African Studies)."

Including over 21,000 'inner languages' and dialects, and a classified index of over 70,000 linguistic and ethnic names, the Register was compiled for the Observatoire Linguistique, an independent non-profit research network created in Quebec in 1983, established in France and coordinated from Wales.

The Linguasphere is composed of two interlocking and evolving strata of human conventions: (1) the total lexical repertoire of humankind, made up of the overlapping and shifting repertoires of all spoken and recorded languages, (2) the global distribution of the overlapping and shifting phonological and grammatical patterns which serve to structure those repertoires. The Register provides a referential framework for monitoring the future welfare of individual communities and their languages, across all national frontiers. Work in the SOAS Departments of Geography and Africa has created an interface between the classificational system of the Register and the language map of Africa, the most complex of all continents. This GIS work, funded by Leverhulme, has been the first step towards creating a global Linguasphere Mapbase, to be compiled and consultable online as a cartographic extension of the Register

The work of the Observatoire is presented through complementary websites. The main site is www.linguasphere.org, a forum for information on the current development of the world's languages, available in English and partially in French, with further options planned in Chinese and other languages. The Linguasphere Press has a parallel website www.linguasphere.net through which both the Linguasphere Register and the licensed version of Linguasphere Online may now be obtained.

References:

E-MELD Language Codes Workgroup

For general information, see the main topic page: "Electronic Metadata for Endangered Languages Data (EMELD)."

EMELD Language Lookup Pages:

Ancient and extinct languages
Constructed languages
All languages [look up a language, family, or language code; searches both Ethnologue and LINGUIST databases]

An E-MELD Language Codes mailing list was set up in July 2001 to support the Electronic Metastructure for Endangered Languages Data Project: '[email protected]'. The E-MELD Project Family of Lists includes the E-MELD Superlist, E-MELD Sublist on Language Codes, and E-MELD Sublist on Language Markup. See the Linguist web site archives.

See the announcement of November 05, 2001 for an initial version of a language database facility which allows one to look up a language, family, or language code; to show all ancient languages in the LINGUIST Database; and to show all constructed languages in the LINGUIST Database.

E-MELD: "To combat the decrease in the number and diversity of languages and to capitalize on a growing store of digitized linguistic data, a team of National Science Foundation (NSF)-funded researchers led by Anthony Aristar at Wayne State University is developing an endangered languages database and a central information server that will allow users to access the material remotely by computer. A $2 million NSF grant to Aristar and his colleagues at Eastern Michigan University, the University of Pennsylvania and the University of Arizona will be used to create this public digital archive. The goals of the Electronic Metastructure for Endangered Languages Data (E-MELD) project are to collect data on endangered languages and to devise a Web-based protocol so that new and existing data will be accessible to researchers and native speakers everywhere..." [From the NSF report of Peter West, cited below. Similarly: [Wayne State] University receives $2 million grant for endangered languages."]

Input to the E-MELD Language Codes List is recorded in Gene Gragg's "Report on Language Codes Workgroup Recommendations" from the Linguist List workshop. See details in the summary.

References:

"E-MELD: Electronic Metastructure for Endangered Languages Data." By [Anthony Aristar]. Paper prepared as reading for a Linguist List 'Language Digitization Workshop'. [cache]
"E-MELD: Electronic Metastructure for Endangered Languages Data." By Anthony Aristar and Helen Aristar Dry. Paper presented at the workshop on Web-Based Language Documentation and Description, 12-15 December 2000, Philadelphia, USA. [cache]
"New Database to Save Endangered Languages." E-MELD Project. From NSF Tipsheet. News Media Tip - July 9, 2001. Edited by Peter West.
E-MELD Mailing list for Language Codes Group
See also: E-MELD Mailing list for Language Markup Group
Contact: Anthony Manuel Rodrigues Aristar

Linguist List Genetic Classification Coding Scheme

The Linguist List community "provides a forum where academic linguists can discuss linguistic issues and exchange linguistic information." From the Linguist List, Language Classification Working Group: A Proposed LINGUIST Coding Scheme for Genetic Classification. "One proposed LINGUIST coding scheme for language classification works as follows. Each language family is assigned a 2-letter code. Then, subgroups which attach directly to the highest node are assigned a letter of the alphabet, in alphabetical order, which is appended to the family code. Any subgroups which belong to a lower subgroup are assigned codes in the same way, until all subgroups have been assigned codes. For example, the code for Altaic is AT. The Mongolian node is the first in alphabetical order, and is assigned the letter A. Its code is thus ATA. Tungus is the next in order, and is assigned the letter B, and its code is thus ATB. Turkic is next, and is assigned the code ATC. The subgroups under each of these nodes are assigned codes in the same way. Mongolian has two subgroups, Eastern and Western. Eastern is assigned the code ATAA, Western the code ATAB. Tungus has two groups, Northern and Southern. Northern is assigned ATBA, and Southern ATBB. Each subgroup is assigned a code in the same way. To see the result for some actual data, select one of the families in our database from the form below [see the online document]... Each language is now categorized by two codes: its own unique code (in this case its Ethnologue code) and the code of its immediate subgroup."

Examples [snapshot 2001-08]:

MARC Code List for Languages

MARC Code List for Languages. Web Version of the 2000 Edition. Description 2001-08-27: "This document contains a list of languages and their associated three-character alphabetic codes. The purpose of this list is to allow the designation of the language or languages in MARC records. The list contains 437 discrete codes, of which 54 are used for groups of languages... The list includes all valid codes and code assignments as of February 2000. This revised edition includes numerous changes necessary for compatibility with the newly approved ISO 639-2 in 1998 (Codes for the Representation of Names of Languages Part 2: Alpha-3 Code). There are 26 code changes, 1 deletion, and 35 additions in this revision. The language codes are three-character lowercase alphabetic strings usually based on the first three letters of the English form or, in some cases, vernacular of the corresponding language name. The codes are varied where necessary to resolve conflicts. In the case of modern and older forms of some languages, the initial letters of each part of the language name are used to form the code, e.g., gmh for German, Middle High, and goh for German, Old High. When the name of a language is changed in the list, the original code is generally retained. The list includes individual codes for most of the major languages of the modern and ancient world, e.g., Arabic, Chinese, English, Hindi, Latin, Tagalog, etc. These are the languages that are most frequently represented in the total body of the world's literature. Additional codes for individual languages are created from time to time when it becomes apparent that a significant body of literature in a particular language already exists, or when it is determined that the amount of material in a language is growing. Usually only one code is provided for a given language, even if that language can be written in more than one set of characters. In a few cases however, separate codes are provided for the same spoken language written in different characters... In addition to codes for individual languages, the list also contains a number of codes for language groups. While some individual languages are given their own unique code, although linguistically they are part of a language group, many individual languages are assigned a group code, because it is not considered practical to establish a separate code for each." See also notes on the Code sequence, Changes in 2000 edition, Changes since 2000 edition , and ASCII version.

Published Subjects for Geography and Languages TC

[February 02, 2002] OASIS Technical Committee to Define Published Subjects for Geography and Languages. A proposal has been submitted to OASIS for the creation of a new technical committee on 'Published Subjects for Geography and Languages'. The TC will define sets of published subjects "for language, country, and region subjects, in accordance with the guidelines for published subjects to be laid down by the OASIS Published Subjects TC. Languages, countries, and regions are subjects that occur frequently across a wide range of topic maps. In order to promote maximum reusability, interchangeability and mergeability, standardised sets of published subjects are required to cover these domains. Two such PSI sets (for country and language) were published as part of the XML Topic Map 1.0 Specification; the task of this TC will be to update and extend those PSI sets using existing code sets defined by recognised standards bodies such as the ISO and the UN." Published subjects will be created for languages according to ISO 639 and USMARC codes; published subjects for countries and regions will be based upon ISO 3166; PSI sets for countries, regions, and geographic areas will also be created for USMARC codes; another set of published subjects for regions will be based up on the UNSD Standard Country or Area Codes. Published subjects are a form of controlled vocabulary allowing "unambiguous indication of the identity of a subject"; they are defined in the ISO 13250 Topic Maps standard and further refined in the XML Topic Maps (XTM) 1.0 Specification. [Full context]

Use of Standard Code Lists

SGML (Standard Generalized Markup Language)

The SGML standard (ISO 8879:SGML, page 36, section 10.2.2.3) references ISO 639 language codes in connection with the specification for "public text language" used in a "text identifier." A text identifier (Section 10.2.2, production 84) is most commonly encountered within a "formal public identifier" (production 79). An FPI might occur in a document type declaration, external entity specification, notation identifier, or link type declaration. The ISO standard [10.2.2.3, production 88 for public text language] says that the "public text language must be a name [viz., an SGML 'name' per production 55], entered with upper-case letters. The name should be the two-character language code from ISO 639 that defines the principal natural language used in the public text. Notes: (1) The natural language will affect the usability of some public text classes more than others; (2) The portions of text most likely to be influenced by a natural language include the data, defined names, and comments; (3) A system can use the public text language to facilitate automatic language translation."

Since most SGML applications anticipate the use of language tagging at the level of the element (frequently not at the level of the entity, where an FPI would be used), ISO 639 language codes are often used in SGML DTDs within attribute definition lists. Elements which require language tagging are given a special attribute such as lang or language, declared as NMTOKEN, IDREF, or CDATA, with the requirement that the attribute value be drawn from the list of ISO 639 language codes. Using IDREF or an enumerated list [name token group] allows the SGML parser to validate the nomination of an authorized language code in the attribute value.

XML (Extensible Markup Language)

The W3C XML 1.0 Recommendation (second edition) references language codes in two places: (1) Section 1.1 reads: "This specification, together with associated standards (Unicode and ISO/IEC 10646 for characters, Internet RFC 3066 for language identification tags, ISO 639 for language name codes, and ISO 3166 for country name codes), provides all the information necessary to understand XML Version 1.0 and construct computer programs to process it."] (2) More specific mention occurs in Section 2.12 'Language Identification': "In document processing, it is often useful to identify the natural or formal language in which the content is written. A special attribute named xml:lang may be inserted in documents to specify the language used in the contents and attribute values of any element in an XML document. In valid documents, this attribute, like any other, must be declared if it is used. The values of the attribute are language identifiers as defined by IETF RFC 3066, 'Tags for the Identification of Languages', or its successor on the IETF Standards Track. [Note: IETF RFC 3066 tags are constructed from two-letter language codes as defined by ISO 639 [International Organization for Standardization. ISO 639:1988 (E). Code for the representation of names of languages. Geneva: International Organization for Standardization, 1988.], from two-letter country codes as defined by ISO 3166, or from language identifiers registered with the Internet Assigned Numbers Authority [IANA-LANGCODES - Internet Assigned Numbers Authority, Registry of Language Tags, ed. Keld Simonsen et al.; see http://www.isi.edu/in-notes/iana/assignments/languages/ ]... The intent declared with xml:lang is considered to apply to all attributes and content of the element where it is specified, unless overridden with an instance of xml:lang on another element within that content."

Note the XML 1.0 second edition amendment characterized as substantive: E11 in "XML 1.0 Second Edition Specification Errata" records that the next-to-last paragraph in Section 1.1 was amended to read "RFC 3066" in place of "RFC 1766". Similarly, other references throughout were changed to read [at the level of surface text] "RFC 3066" -- since RFC 3066 updates and obsoletes RFC 1766. The same errata listing says to "Remove the last sentence of the Note [in Section 2.12, which read]: 'It is expected that the successor to [IETF RFC 1766] will introduce three-letter language codes for languages not presently covered by [ISO 639].'" However, no change was made in the second edition of XML 1.0 to explicitly allow for three-letter codes as values for xml:lang, even though RFC 3066 allows the composition of a language tag using the 3-letter codes from ISO 639 part 2, "Codes for the representation of names of languages -- Part 2: Alpha-3 code." It appears that the intent was to allow the 3-letter codes.

For historical purposes, note the listing E73 [Substantive] listing from the XML 1.0 Specification Errata. E73 "Obsoletes E31, E60, and part of E38."

"Section 2.12: Change the last sentence of the first paragraph to: <corr>The values of the attribute are language identifiers as defined by [IETF RFC 1766], "Tags for the Identification of Languages" or its successor on the IETF Standards Track.</corr> Replace productions [33] to [38] and all the following text, down to but excluding the sentence "For example" just before the examples, with the following: <corr>Note: RFC 1766 tags are constructed from two-letter language codes as defined by [ISO 639], from two-letter country codes as defined by [ISO 3166] or from language identifiers registered with the Internet Assigned Numbers Authority [IANA-LANGCODES]. It is expected that the successor to [IETF RFC 1766] will introduce three-letter language codes for languages not presently covered by [ISO 639].</corr> Rationale The XML processor does not deal with the value of xml:lang, it just passes it on to the application. Checking its correctness at this level has no benefit and hurts with updates to RFC1766 (forthcoming). The spec must still impose the semantics of xml:lang by pointing to RFC 1766.

Evaluation of xml:lang:

The implied inheritance of a language property (viz., the xml:lang value) by subelements in the instance hierarchy may be considered a very useful feature. However, the prescribed semantic "is considered to apply to all attributes and content of the element where it is specified" may be regarded (arguably) as suboptimal for tagging multilingual text, or even for annotating a text in a single language "foreign" to the markup specialist. In some settings, xml:lang may simply be unusable if the semantic prescription of the XML 1.0 specification is to be honored. Details follow.

Section 2.12 describes the use and meaning of xml:lang as follows:

... to specify the language used in the contents and attribute values of any element...

...The intent declared with xml:lang is considered to apply to all attributes and content of the element where it is specified, unless overridden with an instance of xml:lang on another element within that content.

DTD authors will naturally want to design markup constructs (e.g., element type names, attribute names, attribute value name-tokens in an enumerated attribute type) for their users in terms of the users' native language. That is: users want markup labels (XML "names") to be in their first language. Even more critically: if users are required to supply a short phrase-level descriptor as CDATA content for an attribute, they naturally want to think and write in their own language. The XML specification seems not to allow this in cases where the element content is declared to be in some other language. The phrase "all attributes and content" seems to require that a global language assertion would be made by the use of xml:lang in any element.

Example #1: The TEI (P4) DTD defines a <q> element for quoted speech; this element has two CDATA attributes ('who' and 'type') as well as an enumerated-type attribute 'direct' with attribute type and default value (y | n | unspecified) "unspecified". Using the TEI P4 'lang' attribute (a global IDREF attribute indicating the language, writing system, and character set associated with a given element), the following <q>...</q> encoding would be sensible for an English-speaking student wishing to mark up a German quoted phrase: <q lang="de" who="Hans" type="spoken" direct="unspecified">bei mir</q>. The following would not: <q xml:lang="de" who="Hans" type="spoken" direct="unspecified">bei mir</q>. The prescribed meaning of xml:lang seems to require a declaration that the terms "spoken" and "unspecified" (at least) are in German, as well as "bei mir." This is not a boundary case, as the TEI DTD has dozens or maybe hundreds of CDATA attributes which invite substrings "in" the native language of the encoder, which would conflict with the semantic for xml:lang in a bilingual or multilingual encoding environment. It is unclear how the TEI editors could entertain a proposal to substitute the xml:lang attribute of XML 1.0 for the TEI P4 lang attribute in the P4 XML DTD, given the scope specification for xml:lang.

Example #2: Suppose the DTD is "in English" for the benefit of the English speaking users; assume declarations like <!ELEMENT quotation (#PCDATA) > <!ATTLIST quotation speaker CDATA #REQUIRED type (direct | indirect) "direct" context CDATA #IMPLIED xml:lang NMTOKEN #REQUIRED >. How then might one use this DTD to mark up an English language document containing occasional quotations in Spanish uttered by José, as in <quotation speaker="José" xml:lang="ES" type="indirect" context="whispered to his girlfriend seated at the bar">probablemente</quotation> -- if the markup is to embody an assertion that the text portions "quotation," "speaker," "ES," "type," "indirect," "whispered to his girlfriend seated at the bar," and "probablemente" are all in Spanish.

Data directly affected would seem to include all CDATA [StringType] 'content' in attribute values, all name tokens [Enumeration type] used in attribute values, and all character data [#PCDATA] between the start-tag and end-tag. Depending upon the XML 1.0 meaning of "contents" and "all attributes," the declaration would also apply to all (other) attribute names, to the language of the element type name, entity names, ID/IDREF name, etc.

[Please send email if you think I have misunderstood the meaning of the phrase "all attributes" and "contents" or if you think xml:lang is useful for multilingual documents.]

XHTML (Extensible HyperText Markup Language)

Authoring Techniques for XHTML & HTML Internationalization: Specifying the Language of Content 1.0. Edite by Richard Ishida (W3C). W3C Working Draft. 15-October-2004. Produced by the Guidelines, Education & Outreach Task Force (GEO) of the W3C Internationalization Working Group (I18N WG). Latest version URL: http://www.w3.org/TR/i18n-html-tech-lang/. "Specifying the language of content is useful for a wide number of applications, from linguistically sensitive searching to applying language-specific display properties. In some cases the full application is still awaiting full development, whereas in others, such as detection of language by voice browsers, it is a necessity today. Marking up language meta information is something that can and should be done today. Without it, none of these applications can be taken advantage of. This document is one of a series of documents providing HTML authors with techniques for developing internationalized HTML using XHTML 1.0 or HTML 4.01, supported by CSS1, CSS2 and some aspects of CSS3. It focuses specifically on advice about specifying the language of content..."

Text processing language: When specifying the text processing language you are declaring the language in which a particular range of text is actually written, so that user agents or applications that manipulate the text, such as voice browsers, spell checkers, or style processors can effectively handle the text in question. So we are, by necessity, talking about associating a single language with a specific range of text. The text processing language is usually best declared using attributes on elements. Enclosed elements inherit the declared value, but you can, of course, override an initial declaration by specifying a different language on embedded elements where the language changes, eg. a French word in an English paragraph. The default text processing language is not necessarily the same as metadata about the primary language of a document..."

Tutorial: Using Language Information in XHTML, HTML and CSS (DRAFT). "This tutorial provides advice in the following areas: (1) guidelines for declaring the language of documents and text; (2) how to specify language attribute values; (3) applicability of the language tag to apply language-specific CSS styling; (4) a brief introduction to the concept of server-based language negotiation. The tutorial additionally attempts to provide explanations of the basic concepts needed to understand the advice given..."

Other references:

FAQ: Using HTTP and meta for language information

HTML

See "Language Tagging in HTML and XML." By Martin J. Dürst, W3C i18n Coordinator. Updated 2001/08/30 or later. "Language codes as defined in RFC 3066 can be (and should be) used to indicate the language of text in HTML and XML documents. For HTML 4, language codes are specified with the lang attribute. For XML, language codes are given in the xml:lang attribute. In both cases, language information is inherited along the document hierarchy,i.e., it has to be given only once if the whole document is in one language, and language information nests, i.e., inner attributes overwrite outer attributes... Language codes starting with i- are defined in the IANA registry of language codes. Language codes starting with x- denote experimental codes without guarantee for uniqueness... Many other W3C and Web-related specifications use language codes [for example], (1) XHTML 1.0, reformulating HTML in terms of XML, which advises to use both the HTML lang attribute and the XML xml:lang attribute, with the later taking precedence in case there should be any differences. (2) HTTP uses language codes in the Accept-Language and Content-Language headers. (3) SMIL and SVG can use language codes in the <switch> statement. (4) CSS and XSL use language codes for detailed style control..."

On HTTP, see Hypertext Transfer Protocol -- HTTP/1.1 = IETF RFC 2616. From Section 3.10 'Language Tags': "A language tag identifies a natural language spoken, written, or otherwise conveyed by human beings for communication of information to other human beings. Computer languages are explicitly excluded. HTTP uses language tags within the Accept-Language [14.4] and Content- Language[14.12] fields. The syntax and registry of HTTP language tags is the same as that defined by RFC 1766. In summary, a language tag is composed of 1 or more parts: A primary language tag and a possibly empty series of subtags: language-tag = primary-tag *( "-" subtag ) // primary-tag = 1*8ALPHA // subtag = 1*8ALPHA. White space is not allowed within the tag and all tags are case-insensitive. The name space of language tags is administered by the IANA. Example tags include: en, en-US, en-cockney, i-cherokee, x-pig-latin where any two-letter primary-tag is an ISO-639 language abbreviation and any two-letter initial subtag is an ISO-3166 country code. (The last three tags above are not registered tags; all but the last are examples of tags which could be registered in future.)"

For the W3C HTML 4.01 Specification [W3C Recommendation 24-December-1999], 'International considerations for text' are presented in Section 8, "Language information and text direction." Excerpts: "Language information specified via the lang attribute may be used by a user agent to control rendering in a variety of ways. Some situations where author-supplied language information may be helpful include: (1) Assisting search engines; (2) Assisting speech synthesizers; (3) Helping a user agent select glyph variants for high quality typography; (4) Helping a user agent choose a set of quotation marks; (5) Helping a user agent make decisions about hyphenation, ligatures, and spacing; (6) Assisting spell checkers and grammar checkers. The lang attribute specifies the language of element content and attribute values; whether it is relevant for a given attribute depends on the syntax and semantics of the attribute and the operation involved. The intent of the lang attribute is to allow user agents to render content more meaningfully based on accepted cultural practice for a given language. This does not imply that user agents should render characters that are atypical for a particular language in less meaningful ways; user agents must make a best attempt to render all characters, regardless of the value specified by lang. The lang attribute's value is a language code that identifies a natural language spoken, written, or otherwise used for the communication of information among people. Computer languages are explicitly excluded from language codes. RFC1766 defines and explains the language codes that must be used in HTML documents... An element inherits language code information according to the following order of precedence (highest to lowest): #1: The lang attribute set for the element itself; #2: The closest parent element that has the lang attribute set -- i.e., the lang attribute is inherited; #3: The HTTP 'Content-Language' header, which may be configured in a server. Table cells may inherit lang values not from its parent but from the first cell in a span; please consult the section on alignment inheritance for details... In the context of HTML, a language code should be interpreted by user agents as a hierarchy of tokens rather than a single token. When a user agent adjusts rendering according to language information (say, by comparing style sheet language codes and lang values), it should always favor an exact match, but should also consider matching primary codes to be sufficient. Thus, if the lang attribute value of 'en-US' is set for the HTML element, a user agent should prefer style information that matches 'en-US' first, then the more general value 'en'."

XHTML 1.0: The Extensible HyperText Markup Language addresses HTML and XHTML compatability in Section C.7, 'The lang and xml:lang Attributes': "Use both the lang and xml:lang attributes when specifying the language of an element. The value of the xml:lang attribute takes precedence..." HTML used the attribute lang with language tag values constructed according to RFC 1766.

On language support in the Scalable Vector Graphics (SVG) 1.0 Specification [W3C Recommendation], see Section 5.8.5 The systemLanguage attribute and Section 5.8.2 The 'switch' element as part of conditional processing. "SVG contains a 'switch' element along with attributes requiredFeatures, requiredExtensions, and systemLanguage to provide an ability to specify alternate viewing depending on the capabilities of a given user agent or the user's language... The attribute value for the systemLanguage attribute is a comma-separated list of language names as defined in RFC3066..."

TEI (Text Encoding Initiative Guidelines)

The TEI DTD provides a global lang attribute, applicable to all elements in the DTD, which names the language in which the content of an element is written. For example: <p lang="EN">...</p>. The TEI global attribute lang (language) "indicates the language of the element content, usually using a two- or three-letter code from ISO 639. Its datatype is IDREF. The value must be the identifier specified for a writing system declaration declared in the TEI header, as described in section 5. The default is %INHERITED;: If no value is specified for lang, the lang value for the immediately enclosing element is inherited; for this reason, a value should always be specified on the outermost element." A typical declaration is thus <!ATTLIST foo lang IDREF %INHERITED; >.

The language element in the TEI Writing System Declaration has an attribute iso639 providing additional functionality. Check out the TEI WSD in the TEI Guidelines [P4] Chapter 25, "Writing System Declaration: "The writing system declaration or WSD is an auxiliary document which provides information on the methods used to transcribe portions of text in a particular language and script. We use the term writing system to mean a given method of representing a particular language, in a particular script or alphabet; the WSD specifies one method of representing a given writing system in electronic form. A single WSD thus links three distinct objects: (1) the language in question; (2) the writing system (script, alphabet, syllabary) used to write the language; (3) the coded character set, entity names, or transliteration scheme used to represent the graphic characters of the writing system. Different natural languages thus have different writing system declarations, even if they use the same script. Different methods used to write the same language (e.g., Cyrillic or Latin encoding of Serbo-Croatian), and different methods of representing the same script in electronic form (e.g., different coded character sets such as ASCII or EBCDIC, or different transliteration schemes) similarly must use different writing system declarations..." See WSD examples (TEI P3).

The TEI DTD also defines elements <langUsage> and <language>, as described in P4 Section 5.4.2, 'Language Usage'. "The <langUsage> element is used within the <profileDesc> element to describe the languages, sublanguages, registers, dialects, etc. represented within a text. It contains one or more <language> elements, each of which takes attributes specifying the writing system used and the quantity of that language present in the text. Following the <language> elements, prose description may also be added to specify further relevant information..."

The TEI design for lang="" is not entirely happy, however, as reflected in a note from Lou Burnard, Summer 2001: "Every now and then this returns to bite us... In the TEI, the global LANG attribute is supposed to identify both natural language and writing system, using a single code. So 'Hebrew written in Greek characters' gets a unique code, rather than a code for Greek and a code for Hebrew writing system (though the WSD does allow us to say what the ISO639 language is). I believe the rationale for this has something to do with the fact that our LANG attribute in fact identifies a transliteration scheme, which is precisely the union of a particular language and a writing system. But everyone -- not unreasonably -- assumes it means 'language' and so wonders why we think Hebrew stops being Hebrew when you write it in a different writing system. Every now and then people asked why we persist in this folly when there are distinct ISO standards for natural language (ISO 639) and for writing system (ISO 15924). EAD, LOC, and CES all distinguish between them accordingly..." [For details contact Lou Burnard, European TEI Editor]

Note that 'ISO 15924' was sent out for ballot as DIS [Draft International Standard] apparently in October 2000. The closing date for ballots was 2001-01-10. Code for the representation of names of scripts. 'DIS' ISO 15924:2000 (E/F). 2000-05-18. "... provides a code for the presentation of names of scripts. The codes were devised for use in terminology, lexicography, bibliography, and linguistics, but they may be used for any application requiring the expression of scripts in coded form. This standard also includes guidance on the use of script codes in some of these applications... The alphabetic script codes are created from the original script name in the language commonly used for it, transliterated or transcribed into Latin letters. If a country, where the script concerned has the status of a national script, requests a certain script code, preference is given to this code whenever possible. The four-letter codes shall be written with an initial capital Latin letter and final small Latin letters (taken from the range Aaaa to Zzzz). This serves to help differentiate script codes from language codes and country codes: so, for example, Mong mon MON or Mong mn MN would refer to a book in the Mongolian script, in the Mongolian language, originating in Mongolia... The numeric script codes have been assigned to provide some measure of mnemonicity to the codes used..." Note the HTML-style syntax, provided as an example of usage in markup: <META HTTP-EQUIV="Content-Language" CONTENT="ga, ru"> <META NAME="Content-Script" CONTENT="Latg, Cyrl">. See also the ISO 15924 document list. Details: contact Michael Everson, Editor, DIS 15924. [cache]

Encoded Archival Description (EAD)

The DTD of the Encoded Archival Description (EAD) supplies a standard for encoding archival finding aids using SGML and XML. The <eadheader> comprises a set of metadata about the finding aid that serves to identify unambiguously each particular EAD instance by providing a unique identification code for the document; by stating bibliographic information such as the author, title, and publisher of the finding aid; and by tracking significant revisions to the EAD file... Language encoding for EAD instances utilizes ISO 639-2 Codes for the Representation of Names of Languages, and the LANGENCODING attribute always should be 'ISO 639-2'." Spring 2001, EAD new DTD item adds script name: Scripts/symbol systems vs. languages (2001 EAD DTD Revision Suggestions, 46). "According to the 2nd ed. of ISAD(G) [General International Standard Archival Description], the data element 3.4.3 Language/scripts of material can contain 'the language(s) and/or script(s) of the materials. Optionally, also include the appropriate ISO codes for language(s) (ISO 639-1 and ISO 639-2) or script(s) (ISO 15924). EAD has no specific element to tag scripts, symbol systems, abbreviations employed...' [We should] create a new element to tag the scripts, symbol systems or abbreviations employed. This element could contain a new attribute to include the appropriate code(s) for scripts (ISO 15924)." Apparently resolved: "... One of the proposals that went through with little or no dissent [EAD Working Group meeting, Spring 2001] was the proposal to set up a separate attribute for 'script codes' (which is analogous to what we are calling writing system). There is a place to say something about the language of the materials in one tag, and then the ability to tie in language codes (ISO 639) and also the 'script codes' which are also defined by an ISO standard (ISO 15924). So it then becomes possible to say '"Latinized Hebrew' or what have you. I think in EAD, this would look like: <language langcode="heb" scriptcode="latn">Latinized Hebrew</language>..." [Based on a note from Merrilee Proffitt.]

Corpus Encoding Standard

The Corpus Encoding Standard (CES) [formal model in XML as well as SGML] has facility for encoding linguistic annotation. The 'cesAna DTD' has five global attributes, three of which pertain to language and writing system; lang and type store information about the (natural) language of an element's content; the type attribute supplies a two-letter code or three-letter code from ISO 639; wsd provides a means of indicating that the element's content is encoded in a specified character set. See also Section 3.5.2 on the 'langUsage' element and Section 3.5.3 on the CES 'wsdUsage' element for writing system declaration which identifes a character set.

Common Locale Data Repository (CLDR)

"The purpose of the Common Locale Data Repository project is to provide a general XML format for the exchange of locale information for use in application and system development, and to gather, store, and make available a common set of locale data generated in that format. CLDR Version 1.2, with the Locale Data Markup Language specification (LDML 1.2), provides key building blocks for software to support the world's languages. This new release contains data for 232 locales, covering 72 languages and 108 territories. There are also 63 draft locales in the process of being developed, covering an additional 27 languages and 28 territories." A CLDR locale id contains a language_code, as defined in the Locale Data Markup Language (LDML), where the language_code is drawn from RFC 3066 [or its successor] of from the set of 2-character ISO 639 language codes.

References:

Language Tagging in Unicode

Unicode 3.0 provided for language codes, but the use of tag characters is [now] deprecated, except in very limited special cases. See the references and discussion below.

Unicode Version 3.0 Section 5.11 "Language Tagging in Plain Text" said: "For interchange purposes, it is becoming common to use tagged information, which is embedded in the text. Unicode Technical Report #7, 'Plane 14 Characters for Language Tags,' which is found on the CD-ROM or in its up-to-date version on the Unicode Web site, provides a proposed mechanism for representing language tags. Like most tagging mechanisms, these language tags are stateful: a start tag establishes an attribute for the text, and an end tag concludes it..." This paragraph has been deleted in version 3.1.

UTR #7 likewise has been superseded by the publication of Unicode Version 3.1. See for background: Unicode Technical Report #7. Plane 14 Characters for Language Tags. By Ken Whistler and Glenn Adams. Version 4.0. 2001-03-23. Text in 7-3.1. The Plane 14 Technical Report represented the consensus of a meeting of the UTC Working Group on Tagging and Annotation and of IETF representatives which took place on June 24,1997. Rationale was offered (approximately) as follows: "... The difficulty of using general in-band text markup for simple protocols derives from the fact that some characters are used both for textual content and for the text markup; this makes it more difficult to write simple, fast algorithms to find only the textual content and ignore the tags, or vice versa. (Think of this as the algorithmic equivalent of the difficulty the human reader has attempting to read just the content of raw HTML source text without a browser interpreting all the markup tags.) The Plane 14 technical report addresses the recurrent and persistent call for a lighter-weight mechanism for text tagging than typical text markup mechanisms in Unicode. It proposes a special set of characters used only for tagging. These tag characters can be embedded into plain text and can be identified and/or ignored with trivial algorithms, since there is no overloading of usage for these tag characters--they can only express tag values and never textual content itself. Tag characters are not intended for general annotation of text..."

Unicode 3.1. See now the new Section 13.7 on "Tag Characters" in Unicode Standard Annex #27. Unicode 3.1, dated 2001-05-16. Tag Characters: U+E0000-U+E007F, sub 'Block Descriptions'. "The characters in this block provide a mechanism for language tagging in Unicode plain text. However, the use of these characters is strongly discouraged. The characters in this block are reserved for use with special protocols. They are not to be used in the absence of such protocols, or with any protocols that provide alternate means for language tagging, such as HTML or XML. The requirement for language information embedded in plain text data is often overstated. See Section 5.11, Language Information in Plain Text in The Unicode Standard, Version 3.0. This block encodes a set of 95 special-use tag characters to enable the spelling out of ASCII-based string tags using characters which can be strictly separated from ordinary text content characters in Unicode. These tag characters can be embedded by protocols into plain text. They can be identified and/or ignored by implementations with trivial algorithms because there is no overloading of usage for these tag characters -- they can only express tag values and never textual content itself. In addition to these 95 characters, one language tag identification character and one cancel tag character are also encoded. The language tag identification character identifies a tag string as a language tag; the language tag itself makes use of RFC 3066 (or its successors) language tag strings spelled out using the tag characters from this block... Tags of the same type cannot be nested in any way. For example, if a new embedded language tag occurs following text which was already language tagged, the tagged value for subsequent text simply changes to that specified in the new tag... Tags of different types can have interdigitating scope, but not hierarchical scope. In effect, tags of different types completely ignore each other, so that the use of language tags can be completely asynchronous with the use of future tag types... Avoiding Language Tags: Because of the extra implementation burden, language tags should be avoided in plain text unless language information is required and it is known that the receivers of the text will properly recognize and maintain the tags. However, where language tags must be used, implementers should consider the following implementation issues involved in supporting language information with tags and decide how to handle tags where they are not fully supported. This discussion applies to any mechanism for providing language tags in a plain text environment. Language tags should also be avoided wherever higher-level protocols, such as a rich-text format, HTML or MIME, provide language attributes. This practice prevents cases where the higher-level protocol and the language tags disagree..." See also the announcement for version 3.1 and the document "XML and Unicode."

UTR #20. Unicode language tag characters are also discussed in "Unicode in XML and other Markup Languages." Unicode Technical Report #20. W3C Note 15 December 2000. Revision #5. By Martin Dürst ([email protected]) and Asmus Freytag ([email protected]). [The document "contains guidelines on the use of the Unicode Standard in conjunction with markup languages such as XML."] See Section 3.8 on 'Language Tag Characters': "A proposed series of characters from U+E0000 .. U+E007F for expressing language tags, based on existing standards for language tags using the rules in [UTR7]. Reason for inclusion: These characters allow in-band language tagging in situations where full markup is not available, while allowing easy filtering by applications that do not support them. They were specifically included for the benefit of Internet protocols such as ACAP, which require a standard mechanism for marking language in UTF-8 strings and to avoid the use of other schemes that relied on specific details of the encoding form used. Problems when used in markup: These characters duplicate information that can be expressed in markup. Problems with other uses: Their special code range allows them to be easily filtered, but applications that don't expect them will treat them as garbage characters. Replacement markup: Replace with equivalent language markup [e.g., <xhtml:lang>. What to do if detected: Browsers may ignore these characters. When received in an editing context, editors may remove and/or replace them by equivalent markup..."

Related proposal. Adding Embedded Language Identifiers to Plain Unicode Text. By Daniel Wood, Mark W. Davis, and Mark Leisher. (Computing Research Lab New Mexico State University). September 22, 1995. "...This approach basically consists of two parts: [A.] Allocation of sixteen codepoints from the Private Use area of the Unicode Standard character set and identification of their properties (in the Unicode Character Properties sense). [B.] A technique for constructing a language identifier when some contiguous subset of these sixteen codepoints is encountered in a Unicode text stream..." Of this proposal, Mark Leisher wrote: "And speaking of 'A Bad Idea(tm),' I proposed a different sort of bad idea a while back with a language tag idea that is technically more elegant than the Plane 14 tags, in my humble opinion. But it has its own problems as well..." [cache]

Language Tags and Operating Systems

This section has little to do with "markup" in the traditional sense. I include it as a reminder that a lot is at stake as designs are crafted for language identifiers: operating systems and applications software need to be customizable and language-property-aware, based upon the rapidly-changing world of language-smart data begging to be respected for linguistic properties they declare to be relevant.

Note: This section is taken substantially (wholesale) from Peter Constable of SIL, with permission; I have not checked it in any respect. See Peter's article which explains language identifiers in relation to locales.

Universal Locales for Linux. The Universal Locales for Linux provides 140 or more Unicode based locales for Linux The site contains a list of locale identifiers that illustrates this use of ISO 639-1 two-letter codes in Linux and Unix locale identifiers.

About MS Win32 LANGIDs. A description of MS Win32 LANGIDs, taken from Developing International Software for Windows 95 and Windows NT, by Nadine Kano.

List of Locale Ids and Language Groups. LANGIDs from MS Global Software Development site. See also the list of LANGIDs in the Platform SDK documentation. They are language identifiers composed of a primary language identifier and a sublanguage identifier.

Mac OS language constants (Carbon). Lists of constants used for language identification in the Mac OS Carbon interfaces. (This page points to other lists of constants as well.)

Mac OS language and locale identifiers (Cocoa). Describes the mechanisms used for language and locale identification in the Mac OS Cocoa interfaces, taken from Inside Mac OS X: System Overview.

Locales API Preliminary Documentation (Mac OS 8.6 and later). A draft document describing proposed mechanisms for language identification for use in new Mac OS technologies (Cocoa). This does not conform precisely to what is used in Mac OS X, but presents some interesting discussion of language identification issues.

Creating a Locale. Language identifiers in Java. Section of Java tutorial describing the use of language identifiers in creation of locale identifiers in Java. "To create a Locale object, you typically specify the language code and the country code; The first argument is the language code, a pair of lowercase letters that conform to ISO-639."

General References

[December 11, 2009] "Choosing a Language Tag." By Richard Ishida (W3C). From the W3C Internationalization web site. As reported by W3C on 11-December-2009: "New Internationalization Article: Choosing a Language Tag" — "The Internationalization Core Working Group publishes information to help people understand and use international aspects of W3C technologies. Recently the group published Choosing a Language Tag. The appearance of RFC 5646 earlier this year added a new 'extended language' subtag to BCP 47 and around 7,000 new entries in the IANA Language Subtag Registry. This article asks, which language tag is right for me, and how do I choose the language and other subtags I need? The answer outlines the necessary decisions in a step-by-step fashion..." All the subtags you will need to create a language tag are found in one place, the IANA Language Subtag Registry. The registry is a long text file, containing nearly 8,000 entries. The first (and often only) subtag in a language tag always designates a language. It is referred to in BCP 47 as the primary language subtag. We will use that term in this document to refer to the subtag that represents a language, to more clearly make the distinction from 'language tag', which refers to the whole thing. To find a primary-language subtag, search the page for the name of that language... (1) Decision 1: The primary language subtag. You always start by choosing a primary language subtag, and often this is all you'll need for your language tag. Always bear in mind that the golden rule is to keep your language tag as short as possible. Only add further subtags to your language tag if they are needed to distinguish the language from something else in the context where your content is used. (2) Decision 2: Extended language subtags. The BCP 47 specification allows for an additional, 3-letter subtag immediately after the initial primary language subtag. This is called an extended language subtag (abbreviated to extlang). Only a relatively small number of extended language subtags are defined, and they each need to be used with a specific primary language subtag (given in the Prefix field of the entry for the extended language subtag in the registry). Currently only seven primary language subtags can be used with extended language subtags. Six of those have a Scope field set to macrolanguage in the registry (ar, kok, ms, sw, uz, and zh), and the other is sgn... (3) Decision 3: Script subtags. Script subtags should only be used as part of a language tag when the script adds some useful distinguishing information to the tag. Usually this is because a language is written in more than one script or because the content has been transcribed into a script that is unusual to the language (so one might tag Russian transcribed into the Latin script with a tag such as ru-Latn). Script subtags are always 4 letters, and must come after any language or extended language subtag, but before any other subtags. (4) Decision 4: Region subtags. Region subtags associate the language subtag you have chosen with a particular region of the world. Region subtags must come after any language or script subtags. Like script subtags, you should only use a region subtag if it contributes information needed in a particular context to distinguish this language tag from another one; otherwise leave it out. For example, en-GB might be a useful distinction for spell-checking, but the region subtag in ja-JP is unlikely to be useful unless you are intentionally contrasting it with Japanese spoken in other parts of the world. There are two types of region subtag: 2-letter codes and 3-digit codes. The latter tend to identify multinational regions, rather than specific countries. (5) Decision 5: Variant subtags Again, only use variant subtags when there is a need to distinguish this language tag from another similar one in the context in which your content is used. Variant subtags describe additional distinctions not captured by the other subtags. Typically these are dialects, written variations (such as spelling reforms), transcriptions, and the like. A variant subtag is usually five to eight characters long and can contain letters and/or digits. A few four digit subtags (usually representing a year) are also registered. Variant subtags must come after any language, script, and region subtags. (6) Decision 6: Private Use subtags. Private-use subtags do not appear in the subtag registry, and are chosen and maintained by private agreement between the parties that use them..."
[December 12, 2004] "Using Language Identifiers (RFC 3066)." By Tex Texin and John Cowan. "Language identifiers as specified by RFC 3066, can have the form language, language-country, language-country-variant and some other specialized forms. The guidelines for choosing between language and language-country are ambiguous. To clarify which form should be used, John Cowan and I have posted this list for review. This is currently a draft document. It will be continually revised as we get feedback from linguists and internationalization experts... The topic is being discussed on the W3C www-international mail list and the IETF ietf-lang mail list... There are a number of suggestions for deciding whether to use a one-level (language only) or two-level (language-region) tag. They require some discussion and will be added here shortly. Language codes are from ISO 639. Country codes are from ISO 3166..." [snapshot 2004-12-14]
[January 05, 2004] [Apropos of Localization:] "GDP by Language." By Mark Davis (President, The Unicode Consortium; IBM Corporation). Unicode Technical Note #13, Version 1 (first public version). 2003-01-22. Latest Version URL: http://www.unicode.org/notes/tn13. ['While English is a major language, it only accounts for around 30% of the world Gross Domestic Product (GDP), and is likely to account for less in the future.'] "Many people in the software industry don't realize how important it is to localize products for different languages around the world. While English is a major language, it only accounts for around 30% of the world Gross Domestic Product (GDP), and is likely to account for less in the future. Neglecting other languages means ignoring quite significant potential markets. This short article provides one picture of the economic significance of different languages, with a breakdown of the percentages of world GDP by language. Not only does it show the current breakdown, but it also provides data for the years 1975 to 2002 to show modern trends. The most notable feature is the steady rise of Chinese and slow relative decline of Japanese and most European languages. Korean and Indic languages also show growth over that period, though slower than Chinese." Figure 1 portrays GDP by Language for 1975-2002. Figure 2 shows projected GDP by Language, 2003-2010. A paper from Goldman Sachs ('DreamingWith BRICs: The Path to 2050') projects that "the combined GDP of the BRIC countries (Brazil, Russia, India, and China) will exceed that of the current G6 (United States, Japan, Germany, France, United Kingdom, and Italy) before the year 2050. The chart [Figure 2] uses that data to extrapolate what the GDP by Language breakdown would be over the coming years. Chinese would have increasing weight; Russian, Portuguese, and Indic would all increase as well, but most significantly after 2010..." Note: This Unicode Technical Note is supplied purely for informational purposes and publication does not imply any endorsement by the Unicode Consortium.
[September 20, 2003] Standards Organizations Express Concern About Royalty Fees for ISO Codes. W3C, the Unicode Technical Committee, and INCITS (International Committee for Information Technology Standards) have recently published statements of concern about ISO's interpretation of law and policy on the collection of royalty payments for the use of ISO codes. The data elements in question involve several ISO standards that are often referenced in Internet infrastructure specifications and protocols, and code lists that are widely implemented in language-sensitive text processing software. The lists include ISO 639 'Codes for the representation of names of languages', ISO 3166 'Codes for the representation of names of countries and their subdivisions', and ISO 4217 'Codes for the representation of currencies and funds'. ISO has clarified that "generally, software developers or commercial resellers requesting permission to embed the data elements contained in an ISO Code in their products for resale will be asked to purchase the Code in electronic format and pay either an annual fee or a one-time fee and any applicable maintenance fees required." The letters from W3C, UTC, and INCITS have appealed to ISO and ANSI for reversal of their interpretation and policy.
[July 03, 2002] ISO New Work Item Proposal. From BSI UK. ISO Reference: ISO/TC 37/SC 2/WG 1 N 95. BSI has submitted this proposal (to be processed as a New Work Item Proposal) that would base language identification efforts on the foundations of the Linguasphere Registry, per BS258. Extract: "[NWI] Title: Language resources: register of codes and identification tags of the world's language and speech communities. Scope: Standardization of basic principles for and the maintenance of a register that provides transparent, accurate and unambigous codes and tags for the classification and identification of all the worlds languages and speech communities. Aim and Necessity: To develop and publish Part 1, the standardised specification for the maintenance of the register of codes and tags and Part 2, an electronic and hard copy version of the standardised register conforming to Part 1, other related standards of ISO/TC 37 and others. The continuing development and improvement of Part 2 is to be assured by the formation of an ISO Maintenance Agency based in Wales which is ideal for the purpose. Language distinguishes humans as a life form. The level of facility and functionality to be provided by this standard is of fundamental importance for human and machine purposes to ensure success in achieving, maximising and sustaining a fully multiligual, multicultural and harmonious form of human globalization. Feasibility: The production of the standards is well under way within BSI and the Linguasphere Observatory Partnership. The Maintenance Agency will be formed by a Partnership of the same parties where BSI will provide the Secretariat for the ISO steering committee. The Linguasphere Obseratory Partnership will be the operational unit for the Maintenance Agency from their location in Wales. The process of obaining seed funding support from Pembrokeshire County Council and the Welsh Development Agency is under way using the attached Business Plan..." See the self-extracting executable (Win) or ZIP archive.
[May 27, 2002] "SIL Three-letter Codes for Identifying Languages: Migrating from in-house standard to community standard." By Gary F. Simons (SIL International). ISO Reference: ISO/TC 37/SC 2/WG 1 N 94. [A paper presented at the International Workshop on Resources and Tools in Field Linguistics, LREC 2002 (26-27 May 2002, Las Palmas, Canary Islands).] "A foundational aspect of documenting an endangered language and preserving that documentation for long-term access is identifying the language itself. The web version of the Ethnologue has become the de facto standard for identifying the more than 6,800 languages spoken in the world today. The system of three-letter codes that uniquely identify each language has been used within SIL for nearly three decades as an in-house standard, but now there is increasing demand for these codes to be used by other organizations and projects. This paper describes four changes that SIL International is implementing in order to make its set of language identification codes better meet the needs of the wider community. The changes seek to strike a balance between becoming more open while at the same time becoming more disciplined... [Conclusion:] Language identification is a foundational aspect of documenting an endangered language and preserving that documentation for long-term access. This is because effective retrieval of archived language resources depends on the uniform identification of the languages to which they pertain. The system of three-letter language identification codes used in the Ethnologue is proving to be a useful tool for this purpose, and will be even more useful when it can be managed more as a community standard than as an in-house standard. SIL International is therefore endeavoring to implement the changes described in this paper in hopes of better serving the language resources community." [cache]
[March 08, 2002] "Toward a Model for Language Identification. Defining an Ontology Of Language-Related Categories." By Peter G. Constable (SIL Non-Roman Script Initiative, NRSI). Document reference: ISO/TC37/SC2/WG1 N91. February 27, 2002. 34 pages. Draft of paper for the 21st International Unicode Conference, Dublin, Ireland, May 2002. "To deal with the diverse language identification needs, people are looking to the ISO 639 family of standards, which provide over 400 different language identifiers. For those working with hundreds or thousands of less well-known languages, however, this number falls well short of what is needed. Similarly, these standards do not provide mechanisms that accommodate intralanguage distinctions involving parameters such as script. Some protocols have some ability to overcome the limitations in ISO 639 by making reference to the derivative standard provided in RFC 3066, which allows for the creation of tags that add additional qualifiers to the ISO 639 codes, or for the registration of entirely original identifiers. There are potential concerns with introducing a greatly expanded set of tags under the terms of RFC 3066, however, since it could quickly lead to considerable confusion, for reasons I will describe momentarily... This paper is intended to explore what an adequate model of 'language' identification should look like. In particular, it aims to describe the ontology for which 'language' identifiers are needed; that is, the different kinds of language-related entities in the real world that are relevant for IT purposes, and the relationships between them. In view of this ontology, I will also attempt to derive implications for an adequate system of 'language' identifiers to be used in IT applications... in the view presented here, we are dealing with multiple types of categories, all of which are related to language per se but some of which are also somehow different. In other words, not all of the distinctions for which we use 'language' identifiers are between languages. Thus, in making reference to 'language' identification, what is really meant is identification with regard to various types of language-related categories..." [source]
[March 08, 2002] "Future Development of ISO 639." By Håvard Hjulstad (Convener of ISO/TC37/SC2/WG1 'Coding systems'). Document reference: ISO/TC37/SC2/WG1 N89. Date:2002-03-04. 4 pages. "ISO 639-1 (alpha-2 code)1 and ISO 639-2 (alpha-3 code)2 are designed to meet the needs of terminology and library applications. The two parts of the standard and the coordinated effort to develop these two parts represent a vast step toward a universally acceptable set of identifiers for linguistic units. In particular the library community has a genuine need to keep the set of identifiers stable. There are at least a nine-digit number of records using these identifiers. Although there is broad acceptance that the present parts of ISO 639 will be developed further, this development needs to be conservative. For the ICT industry and for language resource and language technology applications there is also a genuine need to expand the current set of language identifiers and language identification mechanisms greatly. There may be a need for identifiers for 15-20 times as many linguistic units as the current tables provide. ISO/TC37 is ready to initiate projects to meet these needs. The projects will be carried out within the framework of ISO/TC37/SC2/WG1. It is, however, recognized that it may be necessary to utilize working procedures and organizational structures that are different from most projects under ISO/TC37 and other ISO committees. It will not be possible to meet the requirements as to timeliness without substantial external funding..." See the news item of 2002-03-08: "ISO Working Group on Coding Systems Outlines New Language Encoding Initiatives." [source .DOC, cache]
[February 27, 2002] "Codes for the representation of names of languages -- Part 1: Alpha-2 code. [Codes pour la représentation des noms de langue -- Partie 1:Code alpha-2.]." From ISO/TC 37/SC 2 (Secretariat: SCC). International Standard ISO/FDIS 639-1. Reference: ISO/FDIS 639-1:2002(E/F). Final Draft. 48 pages. Voting begins on 2002-02-28. Voting terminates on 2002-04-28. "ISO 639 provides two language codes, one as a two-letter code (ISO 639-1) and another as a three-letter code (ISO 639-2) for the representation of names of languages. ISO 639-1 was devised primarily for use in terminology, lexicography and linguistics. ISO 639-2 represents all languages contained in ISO 639-1 and in addition any other language, as well as language groups, as they may be coded for special purposes when more specificity in coding is needed. The languages listed in ISO 639-1 are a subset of the languages listed in ISO 639-2; every language code element in the two-letter code has a corresponding language code element in the three-letter code, but not necessarily vice versa. Both language codes are to be considered as open lists. The codes were devised for use in terminology, lexicography, information and documentation (i.e., for libraries, information services, and publishers) and linguistics. ISO 639-1 also includes guidelines for the creation of language code elements and their use in some applications... The alpha-2 code was devised for practical use for most of the major languages of the world that are not only most frequently represented in the total body of the world's literature, but which also comprise a considerable volume of specialized languages and terminologies. Additional language identifiers are created when it becomes apparent that a significant body of documentation written in specialized languages and terminologies exists. Languages designed exclusively for machine use, such as computer-programming languages, are not included in this code..." Background may be found at an ISO 639 web site maintained by Håvard Hjulstad. [cache]
[February 13, 2002] Analysis of ISO 639 and mappings to SIL Ethnologue. Posted by Peter Constable (SIL). In connection with ISO/TC 37/SC 2/WG 1 activity, the author has added some new pages to the Ethnologue web site that present an analysis of the existing ISO 639 language codes, together with a proposed mapping of those codes to entries in the SIL Ethnologue; relevant URLs:are given.
[February 13, 2002] The EMELD Language Lookup Pages has listings for ancient and extinct languages, constructed languages, and combined language codes lookup (look up a language, family, or language code; searches both Ethnologue and LINGUIST databases). See the entry above. See also the main topic page: "Electronic Metadata for Endangered Languages Data (EMELD)."
[February 12, 2002] "Improved Language Coding. Efforts and Issues." Edited by Sue Ellen Wright. Slides from the panel discussion on language codes. 20th International Unicode Conference, Washington, D.C., USA. January 2002. See also as ISO/TC37/SC2/WG1 N90: "Improved Language Coding: Efforts and Issues," Unicode Conference, January 2002), with sources: PowerPoint (.ppt) and PDF]
[February 04, 2002] Language Identification Issues 2002-02-04. Contributed by Peter Constable. Reports on meetings and activities involving Linguasphere, Ethnologue, W3C Internationalization Working Group, etc.
[February 02, 2002] "XTM Uses Scope For Languages [XML Topic Maps]. By Steven R. Newcomb.
"Miscellaneous Tagging - Language Settings." Chapter 3 in XML Internationalization and Localization, by Yves Savourel. Sams Publishing. Published June 26, 2001. The author discusses the use of xml:lang and related matters. See the book summary.
[November 05, 2001] "LINGUIST List Language Database." Posting from Anthony Aristar (Department of English, Wayne State University). The LINGUIST Public Lookup Pages allow one to look up a language, family, or language code; to show all ancient languages in the LINGUIST Database; and to show all constructed languages in the LINGUIST Database. "We have now finished putting together the initial version of the language database facility we talked about in Santa Barbara. This system includes all of Ethnologue (which SIL has generously allowed us to use) as well as a supplementary database of ancient and constructed languages, which we ourselves have put together, and which includes brief descriptions and unique codes. The intent is to allow us to precisely categorize by language any data we encounter. The language search facility based on this database allows four kinds of searches: (1) Search by language name; this searches a database of around 48,000 alternate names, and does a fuzzy match on your input; (2) Search by Ethnologue or LINGUIST code; (3) Search by family of subgroup name; this will return a list of languages if the node dominates language names, and a clickable tree if it doesn't; (4) Generate a tree of any of the language families in the database. These last two [kinds of searches] will only work properly if you have Java enabled on your machine. We've also set up pages that will give quick listings of all the ancient and constructed languages in the database." Also for extinct languages. Some caveats apply [minor changes to Ethnologue system; system can be slow as of 2001-11-05.]
[November 05, 2001] "XML Internationalization FAQ." From Opentag.com. Updated October 29, 2001 or later. "You will find here answers to some of the most frequently asked questions about XML internationalization and localization, including XSL, CSS, and other XML-related technologies..." The document contains some thirty-seven (37) questions and answers on matters of Character Representation, Encoding, Language Identification, Presentation and Rendering, and Localization. Sample questions on Language Identification: (1) What is the xml:lang attribute? (2) Do I need to declare the xml:lang attribute? (3) What are the values for the xml:lang attribute? (4) In XHTML should I use lang or xml:lang? (5) What about multilingual documents? (6) Can I use Unicode Language Tags in XML? (7) How do I use the lang() function in XPath? (8) How do I use the lang() selector in CSS? Comments to [email protected]. See the Q/A snapshot from 2001-10-29.
[November 03, 2001] "Summary meeting report: ISO/TC37/SC2/WG1 (Coding systems), 2001-08-14." By John Clews (Keytempo Information Management). 21-August-2001, updated 2-November-2001. "ISO/TC37/SC2/WG1 (Coding systems), met on 2001-08-14, with participants from Norway, Canada, Austria, USA, France, Spain, Japan, and UK. ISO 639-1:2001 is agreed and awaits publication. It is intended that this replaces ISO 639. The issue of freezing the 2-letter repertoire remains open. It is planned that the tables are available via the Internet... Chris Cox (UK) presented a paper by David Dalby which supported a UK proposal for a New Work Item on alphanumeric language codes in extending the detail of language code provision, which would give additional infomation about languages and codes, and their relationship, that a single 3-letter could alone did not give. Instead of accepting the UK proposal as it stood, ISO/TC37/SC2/WG1 agreed to recommend that ISO/TC37/SC2 itself (rather than the UK) should progress a New Work Item, and set up a Task Force of ISO/TC37/SC2/WG1, which would begin by identifying the User requirements, and also examine possible methodology(ies). Task Force members will be Gerhard Budin (Austria, Chair), Håvard Hjulstad (Norway), Jennifer De Camp (USA), and John Clews (UK). Further progress is likely at various points before August 2002, when the next ISO/TC37 meetings will take place (2002-08-19 through 2002-08-23)."
[September 24, 2001] "Report of the Ninth Meeting of ISO/TC37/SC2/WG1 [Toronto, 2001-08-14]." Document N88. Prepared by Håvard Hjulstad, Convener of ISO/TC37/SC2/WG1. '(1) Jennifer DeCamp presented user requirements for language coding. Her PowerPoint presentation (which she did not display due to lack of equipment) is available as document ISO/TC37/SC2/WG1 N 78. (2) Jake Knoppers presented input from ISO/IEC JTC 1 / SC 32 and submitted a number of documents (available as documents ISO/TC37/SC2/WG1 N 79 through 87). He also had a formal liaison request, which will be addressed at the SC 2 plenary. (3) Gary Simons gave a brief presentation of SIL, and Peter Constable presented a "Mapping between ISO 639 and the SIL Ethnologue", which is available as document ISO/TC37/SC2/WG1 N 76. Håvard Hjulstad (convener) presented two outlines for possible work items: document ISO/TC37/SC2/WG1 N 72 ("Additional language coding") and document ISO/TC37/SC2/WG1 N 71 ("Language group coding"). Chris Cox presented a draft for a work item, "Development and application of ISO 639 in the identification, classification and alphanumeric coding of the world's languages". The document is available as ISO/TC37/SC2/WG1 N 77...' [source, files .DOC]
[September 11, 2001] Meeting report: ISO/TC37/SC2/WG1 (Coding systems), Toronto, 2001-08-14. By John Clews. Posting to the E-MELD-CODES mailing list. September 11, 2001. "ISO/TC37/SC2/WG1 (Coding systems) met on 2001-08-14, with participants from Norway, Canada, Austria, USA, France, Spain, Japan, and UK. ISO 639-1:2001 is agreed and awaits publication. It is intended that this replaces ISO 639. The issue of freezing the 2-letter repertoire remains open. It is planned that the tables are available via the Internet... The meeting mainly comprised an Open discussion about language coding and the need and feasibility to standardize and extend existing language coding... ISO/TC37/SC2/WG1 agreed to recommend that ISO/TC37/SC2 should progress a New Work Item, and set up a Task Force of ISO/TC37/SC2/WG1, which would begin by identifying the User requirements, and also examine possible methodology(ies). Task Force members will be Gerhard Budin (Austria, Chair), Håvard Hjulstad (Norway), Jennifer De Camp (USA), and John Clews (United Kingdom). It is planned that October 15, 2001 will be an initial reporting date to ISO/TC37/SC2/WG1." [cache]
[August 28, 2001] "Issues and Proposals for Language Tags." By Dr. Jennifer DeCamp (Member of the U.S. delegation to the International Standards Organization Technical Committee 37 on Terminology; MITRE Corporation, a Federally Funded Research and Development Center). Paper to be presented at Nineteenth International Unicode Conference (IUC19) [September 10 - 14, 2001 in San Jose, California]. "Language tags are used to designate the language in word processing and in web pages. The tags facilitate the use of tools such as search engines, automatic hyphenation, spell checking, grammar checking, dictionaries, and machine translation. However, the current International Standards Organization (ISO) standards for language tags cover relatively few languages. In addition, there are separate three-letter codes for the library community and the rest of the world (see ISO-639). There is also an ISO two-letter standard. More comprehensive tag sets exist, such as by SIL's Ethnologue; however, there are additional issues with such sets, such as consistent level of detail. This panel presents the issues with language tags and the proposals being considered by ISO, including use of the Ethnologue tag set, use of a new four-letter code, use of a language tag set registry, and revision of the three-letter codes. It also solicits input from the Unicode community on the best approach for obtaining a comprehensive tag set. Panel members will include representatives from ISO, the Ethnologue, the Unicode Consortium, the U.S. Government, and industry."
[August 27, 2001] "Working with Language Identifiers. Current and Developing Standards for Distinguishing Languages in a Multilingual Environment." By Peter Constable (Non-Roman Script Initiative, SIL International). In MultiLingual Computing and Technology Volume 12 Issue 6 [#42] (June 2001), pages 63-69. ISSN: 1523-0309. The author provides an overview of standards and systems for language identification, including the notion of locales. Language identification mechanisms are surveyed for Win32 platforms, Apple/Mac (Carbon, Cocoa), and for the Microsoft .NET framework. IETF (RFC 1766/3066) and ISO (639-1, 639-2) language tag inventories are described, together with reference to the SIL (Ethnologue) codes. The author believes that ISO committees (TC 37, TC 46), IETF, and UTC are ready to cooperate in the design of solutions which embrace additional language codes for a wider range of computing applications.
[August 09, 2001] "IT-enablement and Language Codes." From Dr Jake Knoppers. Document: N87. "I have had a chance to scan through TC37/SC2/WG1 documentation re: language coding systems, lack of consistency in names of languages among 639-1 and 639-1, language group codings, linkage of language codes and territorial mappings including 'jurisdictions', etc. From an Open-EDI, e-commerce, e-business, etc. there are similar issues albeit from a different perspective... I want to introduce some documents approaching these issues from a ISO/IEC JTC1 'Information Technology' perspective and within this those of electronic data interchange (Open-edi), metadata, e-commerce, e-business, e-administration, etc.. These have resulted in the launching of two new standardization activities ISO/IEC 18022 and ISO/IEC 18038 these will need to interwork closely with ISO 639-2. I am the Project Editor for both. Attached are a series of documents which may be of use to TC37/SC2 work in this area. Recommendations: (1) Use ISO 639-2/T as the core set identifiers and pivot codes especially in support of Open-edi and other computer-to-computer IT-interface requirements. (2) Integrate ISO 639-1 into 639-2 and make it an "partially equivalent sub-set" freezing its development. (3) Declare current ISO 639-2/B to be an alternative equivalent to the 639-2/T 'pivot code set'..." [source]
[August 08, 2001] "Development and Application of ISO 639 in the identification, classification and alphanumeric coding of the world's languages." Document: N77. From: BSI. "There is an established need for a standardised system of codes for the tagging and identification of the world's languages. Variation still exists, however, in the form of language codes used by different organisations and in different countries. The ISO 639 codes provide the base for standardisation in this field, although they at present cover only a small proportion of the world's languages. These ISO language codes also exist in 3 different versions, the ISO 639-1 two-letter code, and the ISO 639-2/T and 639-2/B three-letter codes (as designed for terminological and bibliographical use, respectively). A fully classified inventory of the world's languages and speech communities was published in 1999/2000, including a coded index of over 71,000 names (Linguasphere Register of the World's Languages and Speech Communities). The following proposal outlines how the 3 versions of the ISO 369 codes may be unified as a single standard, and how the formal linking of this standard with the Linguasphere zones of reference would create an alphanumeric Global Identification Code (GIC) with increased informational content and inbuilt protection from error..." [source]
[August 03, 2001] "JTC 1/SC 32 WG 1 Resolutions from the Piscataway Meeting." Piscataway, NJ, USA, 2001-08-03. Document: N82. "Editing and ballot of 15944-1: SC32/WG1 thanks Mr. Paul Levine for his work in editing the base document of 30.02.01.00.00 according to the instructions given at the FCD 15944-1 editing meeting. All comments were successfully resolved. SC32/WG1 instructs its secretariat to forward the edited 15944-1 to the SC32 Secretariat for the FDIS ballot immediately after the Piscataway meeting. SC32/WG1 instructs its secretariat to send a request to JTC1 to make the standard 15944 publicly available..." [source]
[August 03, 2001] "Possible Request for Reservation of ISO 639-2 Codes for Languages "ISO English" and "ISO French" (in support of IT-enabled Open-edi including e-commerce, e-business, etc.)." Document: N80. ISO/IEC JTC1 SC32/WG1 N0179R. From Dr. Jake V. Th. Knoppers, Project Editor. Presented for review and decision at the SC32/WG1 Piscataway (USA) Meeting, 30 July - 3 August, 2001. Source: New ISO/IEC 18038 and new IEC/IEC 16022. Several 3-alpha codes currently exist for representation of the English and French languages in ISO 639-2. In addition, ISO 639-2 in Clause 4 - Language codes and its sub-clauses recognize special situations (4.12), local codes (4.1.4) and makes provision for "Registration of new language codes (4.2). This localization issue can be systematically resolved by use of ISO 639-2 in conjunction with the ISO 3166-1 country code. For example, the use of the English (eng) and French (fra) languages in Canada (124) can be represented as '124:eng' and '124:fra', of use of French in France '250:fra', in Senegal '686:fra', in Belgium '056:fra', etc. However, international entities as a "category of jurisdictions" such as the ISO, IEC, ITU, the UN and its international bodies have their peculiar use of common natural languages, use of terminology, vocabularies which may well and does differ from ordinary day-to-day use of English, of French, etc. The use of natural languages in international contexts especially those recognized as international languages by the ISO (and UN) would benefit from having their own designated codes in ISO 639-2, i.e. for ISO English, ISO French, etc. [source]
[August 03, 2001] "Need for a standard 'default' convention for referencing ISO 639-2: 'Codes for the representation of names of languages' in Open-edi business transactions and e-commerce, e-business, etc." Document N79. ISO/IEC JTC1 SC32/WG1 N178R. Status: Liaison request to ISO TC37/SC2. Action ID: For review, decision and reply by TC37/SC2 to JTC1/SC32/WG1. "This is a liaison request from ISO/IEC JTC1/SC32/WG1 to ISO TC37/SC2 for advice as to which of the two 'standard' code sets in ISO 639-2, i.e., '639-2/B' or '639-2/T', is to be used as the default standard for use in Open-edi business transactions, (e.g., e-commerce, e-business, e-banking, etc.), for the identification and referencing of languages. Alternatively, ISO TC37/SC2 could reply that this is for SC32/WG1 to decide (as per Clause 4.1 in ISO 639-2). In its standards development work supporting the needs of electronic commerce administration, etc., ISO/IEC JTC1/SC32/WG1 wants to make ISO 639-2:1998 (E/F) a normative reference in support of its Open-edi standards work..." [source]
[August 2001] "Towards Common Language Codes." By Jennifer DeCamp. Document N78. Derived PDF, based upon the original Powerpoint slides. "...Will This Work for the Short Term: (1) Compile a list of languages now required that are not in ISO 639; (2) Attempt to obtain 50 documents in each language; (3) For languages where such documents can be easily obtained, submit formal ISO request; (4) Use language code from Linguashere or SIL, if that code has not already been used in ISO 639. Otherwise provide new language code not in Linguasphere or SIL and not in ISO..."
[August 2001] "Draft technical report: Language codes part 3: Guide to the alpha-numeric coding of the world's languages." Document: N74R. ISO/TC37/SC2/WG1 N74 (R). From John Clews (UK). "This is a document that could (a) serve towards being either a technical report which documents different language codes now in widespread use, and (b) -- if deemed appropriate by ISO/TC37/SC2 -- could also be developed further as a list of single language codes which could be used to extend the current repertoire of language codes in the existing parts of ISO 639. Direction for further development will be taken from ISO/TC37/SC2. It starts from the premise that individual codes from various existing coding systems are already being used together, and outlines problems to avoid in doing this..." [source]
[July 2001] Draft Technical Report: Language Codes Part 3. By John Clews. Proposal. Document Reference: ISO/TC37/SC2/WG1 N74. Date: 2001-07. 12 pages. "This draft technical report has been prepared taking into account the aims and needs expressed in the document ISO/TC37/SC2/WG1 N69: Coding systems', prepared on 2001-01-31 by Håvard Hjulstad (convener of ISO/TC37/SC2/WG1) in Norway... Language codes part 3 lists language codes used in ISO 639-1 and ISO 639-2, and also provides information on additional language codes used in other coding systems. This is provided in a detailed table. It plans to provide information on which language codes from other coding systems are safe to use in addition to codes from ISO 639-1 and ISO 639-2, and guidelines on avoiding problems. There is the potential to develop a further full standard (a notional ISO 639-3) which would provide a much-extended list of language codes, in comparison to that currently available, to meet user needs. However, the initial aims is to provide documentation, and that is the principle aim of this draft technical report... The table supplies for each entry a reference identifier, Users, Area, Associated Country, Language Name, and mapping to other language code lists (as applicable), including (1) I-2 [2-letter codes from ISO 639 and ISO 639-1, and new codes applied by the ISO 639 Maintenance Agency]; (2) I-3T [3-letter codes from ISO 639-2, and new codes applied by the ISO 639-2 Maintenance Agency] (3) SIL [3-letter codes from the Ethnologue, published by the Summer Institute of Linguistics, SIL]; (4) OT [3-letter OpenType language tags, developed by Adobe and Microsoft, widely used in the IT industry]; (5) I-3B [3-letter bibliograhic codes from ISO 639-2, and national variants of these codes used in libraries]; (6) Linguascale [a classification system providing a way of refering to related languages, documented in the Linguasphere Register]... This document is also available in HTML format. [source, .DOC]
[July 23, 2001] ISO/TC37/SC2/WG1, 'Coding systems'. Document Reference: ISO/TC37/SC2/WG1, N 73. Subject: Convener's report. Prepared by: Håvard Hjulstad (Convener of ISO/TC37/SC2/WG1). Date: 2001-07-23. ISO/FDIS 639-1: "Following the WG meeting in London 2000-08-16 the text for ISO/FDIS 639-1 was finalized. However, the submission of the document was somewhat delayed because some items needed finalization by the Joint Advisory Committee (ISO 639 RA-JAC). The FDIS document was sent by the convener to ISO/TC37/SC2 Secretariat for submission to ISO Central Secretariat toward the end of June 2001..." [cache]
[July 18, 2001] "Language group coding - A pre-Working Draft." ISO/TC37/SC2/WG1, 'Coding systems'. Document Reference: ISO/TC37/SC2/WG1 N 71. Prepared by: Håvard Hjulstad (Convener of ISO/TC37/SC2/WG1). Date: 2001-07-18. 22 pages. Background: "The present language coding standards (ISO 639-1 and ISO 639-2) have no general mechanism to identify groups of languages. The 'group' identifiers and 'other' identifiers in ISO 639-2 do really not address this issue. The 'other' identifiers are also unstable. (One example: sla = 'Slavic (Other)' does not include Russian and Polish etc that have their own identifiers, and the extension of 'sla' is reduced every time a new Slavic language gets its separate identifier.) There is an obvious need to classify (group) languages by different criteria, in particular geographical and linguistic criteria. Geographical criteria (e.g., 'indigenous languages of the Nordic region' makes sense in some contexts, although it includes languages from three separate language families) should more suitably be dealt with in the context of ISO 3166, which unfortunately so far does not include identifiers on a 'higher' level than country... Linguistic criteria may also be of different kinds. The proposal in this document covers probably the one that is most obvious and most urgently needed: a hierarchical classification of languages based on the principles of diachronic linguistics. Other criteria include writing system, functions of sociolinguistics, and other language typologies... The main part of the document may contain lists of 'language groups' with hierarchical information. Each language group would be assigned an alpha-5 identifier plus English and French names (as 'indigenous names' are irrelevant in most cases). The list could be presented in four tables: by English names, by French names, by alpha-5 identifiers, and hierarchically. In the tables below French names have not yet been included. A separate table in an annex should include the hierarchical table from the main part together with all items in ISO 639-1 and ISO 639-2..." [cache]
[July 18, 2001] "Additional language coding - A pre-Working Draft." ISO/TC37/SC2/WG1, 'Coding systems'. Document Reference: ISO/TC37/SC2/WG1 N 72. Prepared by: Håvard Hjulstad (Convener of ISO/TC37/SC2/WG1). Date: 2001-07-18. Background: "ISO 639-1 and ISO 639-2 include one mechanism to identify 'language variety' by combining language identifiers with identifiers from ISO 3166 (all parts). However, this mechanism is highly inadequate. The standards do not specify clearly how the identifiers should be combined. The following examples have been seen: [1] en term /US/; [2] en term US; [3] enUS term; [4] en US term; [5] en-US term. Language variation exists on many more levels than geography. This includes temporal variation, sociolinguistic variation, and stylistic variation... Some details for a New Item Proposal: At least the following attributes may be defined (with random designations here): geog (geographical specification), script (writing system), temp (temporal specification), socli (sociolinguistic specification), and style (stylistic specification)... The following uses an 'SGML-based' notation. The actual notation in the final document needs to be aligned with relevant SGML and XML applications..." [cache]
[June 24, 2001] Report on Language Codes Workgroup Recommendations. Report from the "Language Codes Working Group" formed at the Santa Barbara Language Digitization Workshop. By Gene Gragg (Moderator). "The mandate of this workgroup was probably the simplest and most concrete of all the workgroup mandates. It was to formulate recommendations on: (1) Individual language tags, (2) Tags for language groups and families. In a way the group may almost be viewed as a subcommittee of the metadata workgroup, to the extent that it was charged with providing a recommendation for a controlled vocabulary for the value of the attribute lang in the various places this attribute appears, e.g., in the OLAC language metadata protocol. Convincing arguments for the inadequacy of present standards, specifically ISO 639, and the need for a comprehensive and officially accepted set of tags, we felt, were given in the electronic preprints submitted to members of the workgroup in anticipation of the workshop... General Recommendation: Universal Language Code Consortium (ULCC). In the absence of a previously existing or better designation, we propose to refer to the set of language tags as the 'universal language code' (ULC). We propose that an international consortium of linguistics-related groups and individuals be formed as a body which would be responsible for sanctioning such an inventory of codes, and to which proposals for additions and corrections would be submitted..." It was proposed that the group would contact ISO about four-letter language-code scheme. Doug Whalen (Haskin's Laboratory at Yale) also wrote in a general report: "The Language Codes working group decided to form a consortium to recommend language names and codes. The consortium will work with SIL on improving coverage, and will also work with the Unicode and ISO groups to make this work. ISO apparently wants the full coverage of languages to use a four-letter code, since they have already used their three-letter code for the woefully inadequate 200 language list. My own recommendation was that we could accommodate that by having the first character indicate whether the language is Ancient (Axxx), Basic (Bxxx), or Constructed (Cxxx). This would allow us to deal with Hittite and Klingon in a principled way, and would make the SIL codes essentially remain at three letters. In addition, the consortium will work on a way of implementing alternate trees for language families, since these are not agreed upon in general..." See "The Digitization of Language Date: The Need for Standards." Linguist List Workshop, June 21-24, 2001. Santa Barbara, California. Also posted to the E-MELD Language Codes List. [cache, Gragg report, file collection]
[June 24, 2001] "More on Language Codes. What's wrong with ISO-639?" Prepared for the Language Digitization Workshop (see above). "... The International Standards Organization has formulated a standard (ISO-639) which assigns one of 464 three-letter codes to languages. Since the ISO is the internationally-accepted body charged with setting standards, there are good reasons to follow its recommendations. There are problems with this, however. For administrative and historical reasons, linguists have had little input to the code-set, and it does not therefore describe the linguistic universe as we may see it. With such a small number of codes, the standard can obviously cover only a minority of the world's languages. To make up for this deficiency, codes have been assigned to cover the residue of language families whose members have not all been assigned individual codes (e.g., AFA 'Afro-Asiatic (Other)'). To deal with other unincluded languages, geographical groupings (e.g., CAI 'Central American Indian (Other)') have been assigned codes too..." [cache]
[May 11, 2001] Language codes - information from ISO/TC37/SC2." By John Clews. Document Reference: SC22/WG20 N842. Update since the WG20 meeting, November 2000. Since the meeting of ISO/TC37 (Terminology) in London in August 2000, and since ISO/IEC JTC1/SC22/WG20's last meeting, Gerhard Budin (Austria) had taken over as Chair of ISO/TC37/SC2 from Aat Vervoorn (Netherlands)... Future plans by ISO/TC37/SC2/WG1: ISO/TC37/SC2/WG1 N69 'Coding systems' (2001-01-31) by Haavard Hjulstad (convener of ISO/TC37/SC2/WG1) describes this NWI, which has now been approved by ISO CS. Currently, three (closely interlinked) projects are planned. (1) Development and maintenance of a database of language coding, (extracts of) which should be freely available on the web. (2) Adding to this those languages that are currently not included in ISO 639-1 or ISO 639-2, without assigning standardized identifiers. (3) Development of an International Standard for coding mechanisms for language variation, including variation through time, geographically determined dialectal variation, writing system, etc. Comment on 1 and 2: the UK is concered that insufficient information is proposed. Currently, ISO 639-1 contains 180 codes, and ISO 639-2 contains 438 entries. As at 2001-01-31, the database currently contains 493 entries. This compares with SIL (7,000 codes) and the Linguasphere Register (around 70,000 codes). Subsetting information from either or both of these sources would be a better basis. Comment on 3: this aims to regulate the possible language combinations where ISO 639 codes can be combined with codes from other sources, e.g., from ISO 3166: Codes for representation of names of countries, and from the draft standard, and ISO 15924: Codes for representation of names of scripts, and potentially other standards too, to provide codes such as 'en US' = 'English in the USA', 'en CA' = 'English in Canada', 'en US-CA' = 'English in the state of California'; or 'ku Cyrl' = 'Kurdish in Cyrillic script', and 'ku RU Cyrl' = Kurdish in Russia in Cyrillic script'. The paper also suggests that standardized mechanisms should be developed to specify, e.g., 'English in North America' or 'English in southern California', and possibly to identify dialects, and a mechanism to specify linking of the ISO 639-2 code 'sgn' = 'Sign languages' with other elements in order to specify specific sign languages. Also the possibility of adding codes for groups of languages would be investigated: currently this is a partial but not systematic part of ISO 639-2..." See: "Future plans by ISO/TC37/SC2/WG1." 2001-05-16. [cache]
[April 19, 2001] "Implementing the ISO 639-2 code for Sign Languages." By Michael Everson. "Linguists have long recognized that Sign Languages are true languages, and the world's Sign Languages, used by Deaf and hearing people, have been provided with an identifying code in ISO 639-2, the International Standard which specifies 3-letter codes to identify the names of languages. The code is a single 3-letter code, sgn. As necessary, other codes may be appended to that code (according to clause 4.4 of ISO 639-2) to specify different Sign Languages. Such extensions cannot be added to ISO 639-2 itself, as it does not register extended codes. However, extended codes may be registered with IETF according to RFC 3066 when warranted. The list of registered extended codes is available from IANA and an index to it is available. Most of the Sign Languages in the tables below can be identified by the country in which they are used, by appending the 2-letter country code from ISO 3166-1. A number of them additionally require one of the regional extensions specified in ISO 3166-2 (where more than one Sign Language occurs in a country). A few of the extensions are language codes taken from ISO 639-2; these are used where geographical delimitation is not feasible..."
[February 15, 2001] ISO/TC37/SC2/WG1, 'Coding systems'. Subject: Report of the Eighth Meeting of ISO/TC37/SC2/WG1 [London, 2000-08-16]. Reference: ISO/TC37/SC2/WG1N70. Prepared by: Håvard Hjulstad (Convener of ISO/TC37/SC2/WG1). Date: 2001-02-15. 2 pages. "Document ISO/DIS 639-1. The report of voting was studied (document N64R1). Some issues were discussed. The discussion has resulted in a new version of the report of voting (document N68). In particular, the following issues were decided: (1) The terminology (use of 'code', 'language identifier', etc.) in the DIS will be retained. It is noted that there is a difference in the terminology in ISO 639-2, but the Working Group found ISO/DIS 639-1 to be in order in this respect. (2) The document will be circulated for FDIS ballot following the updates at the meeting, and the inclusion of language identifiers that are under discussion in the JAC. (3) Variant names will be included, but this should be done carefully. Most languages will have one name only in each column. (4) A separate table in the alphabetical order by indigenous name will be added. (5) An annex will be added listing the differences between ISO 639:1988 and ISO 639-1..." [cache]
[February 09, 2001] ISO 639 Database. Prepared by Håvard Hjulstad (Rådet for teknisk terminologi). Updated 2001-02-09 or later. The database was described in SC22/WG20 N835: "Language coding database. The convener of TC37/SC2/WG1 has already developed an internal database (using MS Access 97) which includes all items that are given standardized identifiers in ISO 639-1 and ISO 639-2, items that have been proposed for inclusion in the standard, and some (more or less random) additional items..." The document references various resources, including an ISO 639 database = zip file of an MS Access database. Document being finalized for ISO/FDIS 639-1. The English text is final, but not proofed. The French text needs updating. This document does not contain the tables. The following documents are all generated on the basis of the 639 database, and will (following any updates to the database on the basis of decisions in the Joint Advisory Committee) be used to make the tables in the final document: Table 1: alphabetical by English name. Table 2: alphabetical by French name; Table 3: alphabetical by indigenous name; Table 4: alphabetical by language identifier; Table B1: changes from ISO 639:1988 to ISO 639-1:2001. Cache examples: ISO/FDIS 639-1:2001 (Table 1); the database; tables
[January 31, 2001] Language coding. Reference: SC22/WG20 N835. Document ISO/TC37/SC2/WG1 N 69, ISO/TC37/SC2/WG1, "Coding systems." Prepared by Håvard Hjulstad (convener of ISO/TC37/SC2/WG1). Date: 2001-01-31. "This document is a response to the decisions made during the TC37/SC2 meetings in London, August 2000. The convener of SC2/WG1 was asked to look into the feasability to standardize language coding beyond the current coding in ISO 639-1 and ISO 639-2, and to make concrete proposals for work items to be carried out within TC37/SC2/WG1... This document describes three (closely interlinked) projects. The third project only aim directly for International Standard. [We may need] two new parts of ISO 639: (1) ISO 639-3: A linguistically based hierarchical structure of the language entries that are included in ISO 639-1 and ISO 639-2 with, e.g., four-letter identifiers for nodes in the structure (i.e., language families and groups), and (2) ISO 639-4: An International Standard specifying mechanisms for the identification of variants of languages, in-cluding geographical variants, temporal variants, and variants relating to writing system..." See alt URL, [cache; DK cache]
[December 15, 2000] "Language Identification in Metadata Descriptions of Language Archive Holdings." By Gary F. Simons (SIL International). Paper presented at the workshop on Web-Based Language Documentation and Description, 12-15 December 2000, Philadelphia, USA. "Uniform identification of languages is a foundational requirement within the metadata of language archives. This paper discusses the problems that a system of language identification must solve, and then proposes that the system of three-letter identification codes used in the Ethnologue offers a complete and open solution to those problems. The paper goes on to describe what SIL International is contributing to the infrastructure for open language archiving so that this system of identifiers can serve the language archives community as the standard for language identification in metadata..." [cache]
[November 1, 2000] Language codes. From Document ISO/IEC JTC1 SC22/WG20 N786. November 1, 2000. Meeting #19 - Internationalization [October 30 - November 1, 2000], Agenda. "John explains a chaotic situation in ISO. 2 letter codes in ISO 639. Part 1 is being updated, presently in FDIS state. TC37 is responsible for part 1. TC46 is responsible for 639-2 and 639-3, for 2 letter codes and 3 letter codes respectively. TC46 has no secretariat. IETF has RFC 1766 for 2 letter codes. There are not enough languages covered by ISO 639 codes. (US MARC is equivalent to 639-2). SIL has many more codes in the Ethnologue - about 7000 language codes. N777 describes the plan of IETF for a new RFC. HTML 4.0 in W3C also considers additional language tags. Private names are also in use. The new internet draft will allow for standardized tags for Ethnologue name with private sub-tags. Cross mapping will be complex or impossible. WG20 should NOT get involved in the definition of these codes, we should reference the RFC when it is approved. US supports the SIL proposal..."[cache]
[October 04, 2000] "Status of the Work on the New ISO/IEC 18022 Identification Mapping, and IT-enablement of Standards for Widely Used Coded Value Domains." Document: N85. From Dr Jake Knoppers for ISO/IEC JTC 1/SC 32. See 'IT-enablement and Language Codes' (communiqué of August 9, 2001). [source]
[October 02, 2000] "Approach to development of the new ISO/IEC 18038 Identification and mapping of various categories of jurisdictional domains." Document N86. From Dr Jake Knoppers for ISO/IEC JTC 1/SC 32. See 'IT-enablement and Language Codes' (communiqué of August 9, 2001). [cache]
[September 20, 2000] "Language codes: Report to ISO/IEC JTC1/SC22/WG20. By John Clews. Reference: SC22/WG20 N780. September 20, 2000. Overview: "There is a certain amount of incompatibility in relation to standards for language coding. I would recommend that JTC1/SC22/WG20 members look at Peter Constable's recent well-argued paper at the International Unicode Conference for further clarification of the issues. Hopefully information on accessing this paper will be passed to the JTC1/SC22/WG20 convener shortly, for distribution. In addition, the actual ISO standards process seems not to be able to deliver the amount of codes that many IT vendors will require in a globalised market. The report below also looks at some areas of incompatibility that might impact on JTC1/SC22/WG20 standards..." [cache]
[July 24, 2000] Current Status of ISO 639-1 Tables. ISO/TC37/SC2/WG1. Document Reference: ISO/TC37/SC2/WG1 N 65. Date: 2000-07-24. "The following three tables are: (1) The finalized items in ISO 639-1 in alphabetical order by language identifier, i.e., the current version of table 3 in the DIS document. (2) The items that are still in 'Annex C' in alphabetical order by the English name, i.e., the current version of table C.1 in the DIS document. Note: This annex will not be included in the next version of the document. (3) All changes from ISO 639:1988 to the current version of 639-1, in alphabetical order by the (current) language identifiers. We may decide to include this information in a new annex to 639-1..." [cache]
[July 24, 2000] Comments on ISO/DIS 639-1. Reference: ISO/TC37/SC2/WG1 N 064. 2000-07-24. Comments by the Convener [Håvard Hjulstad]: "A number of submitted comments relate to the inclusion of specific languages in the tables, and to the language identifiers that have been assigned. These issues are dealt with by the Joint Advisory Committee (JAC) to ISO 639-1 and ISO 639-2. No such issues can be finalized by this commenting process; they will all be deferred to the JAC for further study. Commenters may be contacted by the JAC and requested to submit further information. Some submitted comments relate to the harmonization of the names of individual languages in ISO 639-1 and ISO 639-2. This issue is undergoing a special study by the JAC. All such comments will be submitted to the JAC. Some submitted comments relate to procedural matters that need to be harmonized with ISO 639- 2 and the JAC. Such comments will be submitted to the JAC and to the two Registration Authorities for the two parts of ISO 639..." [cache]
[July 05, 2000] "ISO 639-1 and 639-2: Items with different English or French names." Reference: ISO/TC37/SC2/WG1N66. By Håvard Hjulstad. 2000-07-05. "Please respond within 2000-08-31 (and preferably before the ISO/TC37 meeting 2000-08-14). The items below have different English or French names in my database (which I hope is up to date with the current documents). We probably want to co-ordinate the names, even though the English and French names are not object for standardization in either of the parts of ISO 639..." [cache]
[June 02, 2000] "Status report on progress of development in the new standard 'Identification, mapping and IT-enablement of standards for widely used coded value domains'." Document: N84. From Dr Jake Knoppers for ISO/IEC JTC 1/SC 32. See 'IT-enablement and Language Codes' (communiqué of August 9, 2001). [source]
[May 22, 2000] ISO 639: Language codes. Report to ISO/IEC JTC1/SC22/WG20." By John Clews. Reference: SC22/WG20 N759. 2000-05-22. Updated, 22 May 2000, from report to CEN/TC304 in (TC304.2292). 8 pages. "I represented CEN/TC304 at the ISO 639 Joint Advisory Committee in Washington in February 2000, which brings together experts from ISO/TC37/SC2 and ISO/TC46/SC4, who have developed the two parts of ISO 639 (Language codes). This is a report of that meeting, and on related issues that have surfaced subsequently. It also includes comments by Martin Duerst (W3C) and Erkki Kolehmainen (CEN/TC304) on some related issues..." [cache]
[February 22, 2000] Update and Request for ISO 639 Language Candidates. [Potential future candidates for new ISO 639 codes; larger languages.] Posting by John Clews. "ISO 639 tends to provide codes only for the larger languages, although it still needs to provide codes for several larger languages (by number of speakers). As a rough guide, I am aiming to ensure that codes will be provided in due course for most distinct languages where there are a million or more speakers... I plan to send the Linguist List a report of the meeting of the ISO 639 Joint Advisory Committee later. Codes for a few additional languages were added at this meeting: the main discussions were on clarifying some precedural issues, that should allow for much more rapid addition of codes in the future. Meanwhile, I would be glad if any of you could comment on the list below: the Foundation for Endangered Languages plans to submit an application for new codes for some of the larger languages below to be added to ISO 639-2, based on the following list. There's no suggestion that these languages are endangered: just that it would be useful to provide ISO 639 3-letter codes for at least some of them, and some more information on these languages would be helpful to present to the ISO 639 Joint Advisory Committee... This list runs broadly from East through West, from China through Europe. The addition of further major languages of the Americas is not proposed here, as ISO 639 covers most larger languages of the Americas fairly well already..." [cache]
[February 22, 2000] Liaison statement from JTC1/SC22/WG20 to ISO/TC37 on ISO 639, parts 1 and 2. From Rebecca S. Guenther (Chair of ISO 639/Joint Advisory Committee). Document Reference: SC22/WG20 N736. Request from WG20 [9-February-2000, Keld Simonsen]: "WG20 considers that stability of language codes is important. WG20 notes some reassignment of codes in recent years (e.g. 'he' to 'iw' for Hebrew) in ISO 639-1. WG20 asks for confirmation that no more code reassignments occur, in order to avoid conflicts in implementations." [Response:] "The Committee very strongly agrees about stability of codes. One principle agreed upon was that codes should not be changed. For those that have been changed in the past, the former codes shall not be reassigned. In addition, their former use will be documented and identified as discontinued..." [cache]
[January 2000] "Making standards work in electronic commerce and among jurisdictions: IT-enablement of data element-based standards." Document: N81. From Dr Jake Knoppers for ISO/IEC JTC 1 / SC 32. See 'IT-enablement and Language Codes' (communiqué of August 9, 2001). [source]
[August 1999] "ISO 639-1 and ISO 639-2: International Standards for Language Codes. ISO 15924: International Standard for names of scripts." By John D. Byrum (US Library of Congress). Paper presented at 65th IFLA Council and General Conference, Bangkok, Thailand, August 20 - August 28, 1999. "The author describes two international standards for the representation of the names of languages. The first (ISO-639-[1]) published in 1988 provides two-letter codes for 136 languages and was produced primarily to meet the terminological needs. The second (ISO 639-2) appeared in late 1998 and includes three-letter codes for 460 languages. This list addresses terminological needs but also for bibliographic applications. For this reason, 639-2 is covered in detail. Its features are explained, and principles and policies used for development of this code list are presented. Additionally, the author describes the governance mechanism established to maintain ISO 639-[1] and ISO 638-2. Also presented is a brief summary regarding a project in progress to provide codes for names of scripts and when completed to result in publication of ISO 15924. The paper concludes that 'the emergence of an international standard for language codes and of the developing international standard for script codes is a major contribution to Universal Bibliographic Control as these code lists enable of important information regarding the nature of publications represented by records to be communicated and shared unambiguously, efficiently, and internationally'." [cache, text only]
[August 05, 1998] "Horizontal issues and encodable value domains in electronic commerce: Non-technical summary and real world examples to supplement BT-EC report." Document: N83. From: Canadian National Body to ISO/IEC JTC 1 / SC 32. See 'IT-enablement and Language Codes' (communiqué of August 9, 2001). [source]

Receive updates from Managing Editor, Robin Cover.

Document URI: http://xml.coverpages.org/languageIdentifiers.html — Legal stuff
Robin Cover, Editor: [email protected]

Contents