Proposals are now being floated within several user communities for increasing the number of standardized language codes beyond the 200-400 range found in current ISO standards. A new work item approved by ISO earlier in 2001, for example, addresses the need for an International Standard with mechanisms for encoding language variation in terms of time, geography, dialectal variation, writing system, and so forth. An initial proposal calls for codes supporting representation of the language along at least five axes: "geog (geographical specification), script (writing system), temp (temporal specification), socli (sociolinguistic specification), and style (stylistic specification)." Other draft proposals call for adoption of schemes that identify 7,000 or even 70,000 languages and dialects. As the mass of networked digital information grows ever larger and becomes easily accessible, demand increases for a taxonomy of human languages adequate to support language data classification, categorization, and linguistic annotation. It is now widely recognized that the ISO standards providing "codes for the representation of names of languages" (ISO 639, ISO/FDIS 639-1, ISO 639-2) are inadequate to meet the application requirements being levied by users in new domains. The concern for better language description facility is now felt as urgent among digital librarians and archivists seeking to classify and linguistically annotate materials representing minority languages; others now worry about the emergence of de facto standards which conflict with the work of registered standards bodies. Language identification is of critical importance to markup since the use of language codes to assist in machine processing of text is documented in a wide range of specifications, including markup metalanguages (SGML, XML) and most markup language applications. Seeking to raise interest in this topic and awareness of its importance for markup language design, I have prepared a reference document "Language Identifiers in the Markup Context" with summaries of the major standards and emerging initiatives.
The document "Language Identifiers in the Markup Context" contains description and references for standards which authorize the use of language codes, as well as the [standardized] language identifier listings. In overview:
- Introduction
- Language Code Listings
- ANSI/NISO Codes for the Representation of Languages for Information
- Ethnologue
- IETF RFCs
- ISO 639
- Linguasphere Project
- E-MELD Language Codes Workgroup
- Linguist List Genetic Classification Coding Scheme
- MARC Code List for Languages
- Use of Standard Code Lists
- SGML (Standard Generalized Markup Language)
- XML (Extensible Markup Language)
- HTML (Hypertext Markup Language)
- TEI (Text Encoding Initiative Guidelines)
- Encoded Archival Description (EAD)
- Corpus Encoding Standard
- Language Tagging in Unicode
- Language Tags and Operating Systems
- General References
Principal references:
- "Language Identifiers in the Markup Context" - Main reference page.