Untitled Document

What's wrong with ISO-639?

The International Standards Organization has formulated a standard (ISO-639) which assigns one of 464 three-letter codes to languages. Since the ISO is the internationally-accepted body charged with setting standards, there are good reasons to follow its recommendations. There are problems with this, however. For administrative and historical reasons, linguists have had little input to the code-set, and it does not therefore describe the linguistic universe as we may see it. With such a small number of codes, the standard can obviously cover only a minority of the world's languages. To make up for this deficiency, codes have been assigned to cover the residue of language families whose members have not all been assigned individual codes (e.g. AFA "Afro-Asiatic (Other)). To deal with other unincluded languages, geographical groupings (e.g. CAI "Central American Indian (Other)") have been assigned codes too. Nor is the standard internally consistent: it has two versions, ISO 639-2/B for bibliographic applications, and ISO 639-2/T for terminology applications, and they differ on 5% of the codes.

A competing - or perhaps complementary - standard is that of Ethnologue. From a linguistic point of view, this is a much more complete and coherent set of language codes than ISO has produced, for it has three-letter codes which cover 6,703 languages, and is based on internally consistent standards.

One of the most important questions we must address, then, is this one:

Should we as linguists choose the Ethnologue coding-scheme or that of ISO-639? The value of conforming to an international standard is always considerable, and in that regard ISO-639 is clearly preferable. But it would require major reworking to be useful to linguists.

Questions about coding language groups

Ethnologue includes a complete statement of subgrouping information for each of the languages in its code-set, but there is no standard coding system which can be used for information interchange, or for relating different views of language relationships. We need to consider what a good coding scheme for linguists would be in this regard. As an example of what one possible coding scheme for a genetic classification of languages might be, we suggest you look at one which has been proposed for the LINGUIST database. In this area, you can help our task most by coming to the workshop prepared to answer these questions:

Should a classification system be one that generates an environment which to some degree "knows" the place in a family tree that a language belongs to? We should certainly not have to give an entire tree every time we mention a language; yet we do want to be able to extract all material which belongs to a particular subgroup. How would you implement such a system?
How can we best handle variant subgrouping? Whatever coding system we use, it should be able to represent variant trees in a family, rather than imposing one view on the entire community.
How should we handle subgroups as opposed to languages? By the standard historical method, any subgroup is simply a set of languages, all of which are derived from an earlier proto-language. Thus the node which defines a subgroup can also be seen as simply an earlier, extinct language. Vulgar Latin is in this sense equivalent to Proto-Romance. Should we therefore treat nodes in a family tree simply as languages?
How do we build a system of coding which can handle changing views of groupings? What happens when we add a subgroup, delete one, join families into macro-groupings?

Genetic information, however is just one type of information which linguistic classification might care to code:

Should we also attempt to classify languages by areal groupings, or by typological similarities?
If so, how would we formalize such a classification so that a computer can make consistent use of it?

Finally, perhaps the most important issue we need to consider is this:

For any system to be effective, it must be generally accepted. Linguistics needs some kind of mechanism for giving a community assent to a computational standard, and a generally-accepted method for considering whether changes should be incorporated into the standard coding system. How could we go about instituting such a mechanism?

It is perhaps worth stating here that we are faced with an interesting time in linguistics. Linguistic data is going to appear on the web in ever-larger quantities. We cannot simply continue to do things as we always have. If we do not institute digital standards, others will do so for us, and we can safely say that we will not generally like the result.

More on Language Codes

What's wrong with ISO-639?

Questions about coding language groups