More on Language Codes

What's wrong with ISO-639?

The International Standards Organization has formulated a standard (ISO-639) which assigns one of 464 three-letter codes to languages. Since the ISO is the internationally-accepted body charged with setting standards, there are good reasons to follow its recommendations. There are problems with this, however. For administrative and historical reasons, linguists have had little input to the code-set, and it does not therefore describe the linguistic universe as we may see it. With such a small number of codes, the standard can obviously cover only a minority of the world's languages. To make up for this deficiency, codes have been assigned to cover the residue of language families whose members have not all been assigned individual codes (e.g. AFA "Afro-Asiatic (Other)). To deal with other unincluded languages, geographical groupings (e.g. CAI "Central American Indian (Other)") have been assigned codes too. Nor is the standard internally consistent: it has two versions, ISO 639-2/B for bibliographic applications, and ISO 639-2/T for terminology applications, and they differ on 5% of the codes.

A competing - or perhaps complementary - standard is that of Ethnologue. From a linguistic point of view, this is a much more complete and coherent set of language codes than ISO has produced, for it has three-letter codes which cover 6,703 languages, and is based on internally consistent standards.

One of the most important questions we must address, then, is this one:

Questions about coding language groups

Ethnologue includes a complete statement of subgrouping information for each of the languages in its code-set, but there is no standard coding system which can be used for information interchange, or for relating different views of language relationships. We need to consider what a good coding scheme for linguists would be in this regard. As an example of what one possible coding scheme for a genetic classification of languages might be, we suggest you look at one which has been proposed for the LINGUIST database. In this area, you can help our task most by coming to the workshop prepared to answer these questions:

Genetic information, however is just one type of information which linguistic classification might care to code:

Finally, perhaps the most important issue we need to consider is this:

It is perhaps worth stating here that we are faced with an interesting time in linguistics. Linguistic data is going to appear on the web in ever-larger quantities. We cannot simply continue to do things as we always have. If we do not institute digital standards, others will do so for us, and we can safely say that we will not generally like the result.