A document prepared by Håvard Hjulstad (Convener of ISO/TC37/SC2/WG1 'Coding systems') outlines a number of important language encoding initiatives that are to be undertaken within the framework of ISO/TC37/SC2/WG1. The language identification codes of ISO 639-1 (alpha-2 code) and ISO 639-2 (alpha-3 code) were designed to meet the needs of terminology and library applications, but are judged inadequate as a basis for language-based text processing within Information and Communication Technology (ICT) industries. XML 1.0 Second Edition normatively references RFC 3066 ("Tags for the Identification of Languages"), which relies upon the ISO 639 language codes. The Convener notes a recognized need to "expand the current set of language identifiers and language identification mechanisms greatly; there may be a need for identifiers for 15-20 times as many linguistic units as the current [code] tables provide." Eleven (11) candidate projects are identified in the document, including: (1) a model for language identification [definitions for 'language', 'individual language', 'language variant', 'dialect']; (2) language identification structure [geographical variation, variation as to script, writing system, and orthography, temporal variation, stylistic variation]; (3) linguistic unit description format; (4) description of linguistic units and default values [script, orthography, geographical area]; (5) resolution of problems in current code tables; (6) further development of ISO 639-1 and ISO 639-2; (7) hierarchical language identifiers [language group identifiers]; (8) additional individual language identifiers [5000-7000 needed]; (9) geographical coordinate information; (10) topic mapping project; (11) mapping with other language identification code sets [e.g., Ethnologue and Linguasphere Register].
Bibliographic information: "Future Development of ISO 639." By Håvard Hjulstad (Convener of ISO/TC37/SC2/WG1 'Coding systems'). Document reference: ISO/TC37/SC2/WG1 N89. Date: 2002-03-04. 4 pages. [source .DOC; cache]
The XML connection: See the XML 1.0 Second Edition specification Section 2.12 (as emended in the 'E29 Substantive erratum'; see "Errata as of 2002-02-20" in "XML 1.0 Second Edition Specification Errata." It reads, with respect to the reserved xml:lang attribute: "The values of the attribute are language identifiers as defined by [IETF RFC 3066], Tags for the Identification of Languages, or its successor." The RFC itself cites ISO 639 as the principal authority for the rules governing the 'Primary-subtag' in the language tag syntax: "All 2-letter subtags are interpreted according to assignments found in ISO standard 639, 'Code for the representation of names of languages' [ISO 639], or assignments subsequently made by the ISO 639 part 1 maintenance agency or governing standardization bodies. (Note: A revision is underway, and is expected to be released as ISO 639-1:2000). All 3-letter subtags are interpreted according to assignments found in ISO 639 part 2, 'Codes for the representation of names of languages -- Part 2: Alpha-3 code [ISO 639-2]', or assignments subsequently made by the ISO 639 part 2 maintenance agency or governing standardization bodies..." See also E11 for the "RFC 1766 / RFC 3066" update. Language-sensitive processing of SGML-encoded text [ISO 8879] also references ISO 639.
Principal references:
- "Future Development of ISO 639." Word/.DOC source
- Tags for the Identification of Languages. IETF Network Working Group, Request for Comments: 3066.
- "Language Identifiers in the Markup Context" - Main reference page.