Report on Language Codes Workgroup Recommendations

The mandate of this workgroup was probably the simplest and most concrete of all the workgroup mandates. It was to formulate recommendations on:

Individual language tags
Tags for language groups and families

In a way the group may almost be viewed as a subcommittee of the metadata workgroup, to the extent that it was charged with providing a recommendation for a controlled vocabulary for the value of the attribute “lang” in the various places this attribute appears, e.g., in the OLAC language metadata protocol.

Convincing arguments for the inadequacy of present standards, specifically ISO 639, and the need for a comprehensive and officially accepted set of tags, we felt, were given in the electronic “preprints” submitted to members of the workgroup in anticipation of the workshop, notably in the article by Peter Constable and Gary Simons, “Language Identification and IT: Addressing Problems of Linguistic Diversity on a Global Scale”. We therefore did not address this most general issue, but rather turned our attention to the question of implementation.

As a result of our discussions we came up with one overarching, and rather sweeping, recommendation, plus a fairly specific recommendation for individual language tags, and a series of more tentative suggestions for tags for language groups and families.

General Recommendation: Universal Language Code Consortium (ULCC)

In the absence of a previously existing or better designation, we propose to refer to the set of language tags as the “universal language code” (ULC). We propose that an international consortium of linguistics-related groups and individuals be formed as a body which would be responsible for sanctioning such an inventory of codes, and to which proposals for additions and corrections would be submitted. We recommend:

· That this consortium be as international as possible. To begin with, we presuppose that LL and SIL, both represented at this workshop, would want to adhere. In addition we propose that representation be invited from the major national and international linguistics societies. (The question of membership by representatives of for-profit corporations was raised, but not discussed in any detail.)

· That the experience and practice of existing standards-related consortia be consulted for examples to be followed and pitfalls to be avoided.

Individual Language Tags

The full form of an individual language tag may eventually be of a structure:

“prefix ABC suffix”. In this structure:

· “ABC” is the core language tag. We agree with the principle, expressed also by Constable & Simons, that “language” be operationally defined by lack of mutual intelligibility with any other tagged speech variety. Although the determination of mutual intelligibility is not always a trivial task “on the ground”, for purposes of linguistic research no other basis for classification makes sense. Care must be taken in tag documentation to indicate how this determination was made. (Other criteria of course, such as nationality, script, ethnicity, etc., could be used to define other sets of language tags.)

Ethnologue: languages of the world (ed. Barbara Grimes, 2000; Dallas: SIL International), now in its 14^th edition, represents the most complete and consistent application of the mutual intelligibility principle to the world’s living languages. SIL International has graciously offered to make the Ethnologue codes available to the proposed LL metadata server. These language codes will be supplemented by codes, to be created by LL, for all the attested extinct languages. As additions, mergers (for cases of over-differentiation), and splits (for cases of under-differentiation) within the tag set are proposed by workers in the field, many of whom of course will be SIL-affiliated, we anticipate that there will be generally broad agreement between LL (ULCC) and SIL about the incorporation (or non-incorporation) of them into their respective metadata vocabularies.

For purposes of consistency, the ISO three- (and eventually four-)letter codes will be kept where they represent languages in the sense of the metadata vocabulary. We also propose to observe the standard namespace extension mechanism proposed by ISO, whereby non-ISO codes can be given the extension “x-[NAME]:” – thus “x-sil:” for Ethnologue codes, “x-ll;” for the extinct language codes added by LL, eventually “x-ulc:”. For persistence of reference, once a code is assigned it should not be reused with a new reference, even if it is dropped (through a merger or split) from the officially supported list.

It is obvious that the use of three (or even four) letters to represent more than six thousand objects makes it impossible for the language codes to be completely mnemonic, and many codes will not be mnemonic at all. It should be emphasized however that these codes are designed principally for metadata and searching, and will be hidden within markup tags. In the displayed text itself researchers will be using whatever language designation seems most appropriate given the scope and intended audience of the document.

· “prefix” could contain information about family/group membership – no specific further recommendation was made about this part, which would be left to further discussion.

· “suffix” would contain information about variety and dialect. It could be formally realized in any one of a number of ways: for example, as “ABC/suffix”; as a value of a “refine” attribute (“refine=’suffix’”), or perhaps by some other means. A precise recommendation remains to be made. As opposed to the core language tags, these variety/dialect designations would not be directly sanctioned by the consortium. However a mechanism for registration of variety/dialect codes proposed by individual investigators could be arranged in, for example, the OLAC metadata server, in such a way that use of the same code for distinct varieties or dialects of a single language could be avoided.

Language Groups

A great deal more work needs to be done in the area of tags for language groups. Although there is frequently general agreement, for a given language, about a general family grouping (e.g., Romance, Semitic), and often a higher ranked “super-family” (e.g., Indoeuropean, Afroasiatic), more detailed family-tree structure beyond that rapidly involves conflicting historical scenarios and views about the nature of language change. As interim measures we recommend the following:

· One or more working papers on historical groupings should be drawn up for circulation and comment.

· For the time being a default family-tree can be displayed in conjunction with the metadata server, perhaps based on that used by Ethnologue, which is generally reasonable and well-informed. There should however be header- or footer-caveats on every display indicating that other views differing more or less significantly may be possible – perhaps with links to one or more alternate tree displays where these have been proposed.

Respectfully submitted,

Language Codes Working Group

Gene Gragg, Moderator