More on Language Codes
What's wrong with ISO-639?
The International Standards Organization has
formulated a standard (ISO-639) which assigns one of 464 three-letter codes to
languages. Since the ISO is the internationally-accepted body charged with setting
standards, there are good reasons to follow its recommendations. There are problems
with this, however. For administrative and historical reasons, linguists have
had little input to the code-set, and it does not therefore describe the linguistic
universe as we may see it. With such a small number of codes, the standard can
obviously cover only a minority of the world's languages. To make up for this
deficiency, codes have been assigned to cover the residue of language families
whose members have not all been assigned individual codes (e.g. AFA "Afro-Asiatic
(Other)). To deal with other unincluded languages, geographical groupings (e.g.
CAI "Central American Indian (Other)") have been assigned codes too. Nor is the
standard internally consistent: it has two versions, ISO
639-2/B for bibliographic applications, and ISO 639-2/T
for terminology applications, and they differ on 5% of the codes.
A competing - or perhaps complementary - standard is that of Ethnologue. From a linguistic point
of view, this is a much more complete and coherent set of language codes than
ISO has produced, for it has three-letter codes which cover 6,703 languages,
and is based on internally consistent standards.
One of the most important questions we must address, then, is this one:
- Should we as linguists choose the Ethnologue coding-scheme or that of ISO-639?
The value of conforming to an international standard is always considerable,
and in that regard ISO-639 is clearly preferable. But it would require major
reworking to be useful to linguists.
Questions about coding language groups
Ethnologue includes a complete statement of subgrouping information for each
of the languages in its code-set, but there is no standard coding system which
can be used for information interchange, or for relating different views of
language relationships. We need to consider what a good coding scheme for linguists
would be in this regard. As an example of what one possible coding scheme for
a genetic classification of languages might be, we suggest you look at one
which has been proposed for the LINGUIST database. In this area, you can
help our task most by coming to the workshop prepared to answer these questions:
- Should a classification system be one that generates an environment which
to some degree "knows" the place in a family tree that a language belongs
to? We should certainly not have to give an entire tree every time we mention
a language; yet we do want to be able to extract all material which belongs
to a particular subgroup. How would you implement such a system?
- How can we best handle variant subgrouping? Whatever coding system we use,
it should be able to represent variant trees in a family, rather than imposing
one view on the entire community.
- How should we handle subgroups as opposed to languages? By the standard
historical method, any subgroup is simply a set of languages, all of which
are derived from an earlier proto-language. Thus the node which defines a
subgroup can also be seen as simply an earlier, extinct language. Vulgar Latin
is in this sense equivalent to Proto-Romance. Should we therefore treat nodes
in a family tree simply as languages?
- How do we build a system of coding which can handle changing views of groupings?
What happens when we add a subgroup, delete one, join families into macro-groupings?
Genetic information, however is just one type of information which linguistic
classification might care to code:
- Should we also attempt to classify languages by areal groupings, or by typological
similarities?
- If so, how would we formalize such a classification so that a computer can
make consistent use of it?
Finally, perhaps the most important issue we need to consider is this:
- For any system to be effective, it must be generally accepted. Linguistics
needs some kind of mechanism for giving a community assent to a computational
standard, and a generally-accepted method for considering whether changes
should be incorporated into the standard coding system. How could we go about
instituting such a mechanism?
It is perhaps worth stating here that we are faced with an interesting time in
linguistics. Linguistic data is going to appear on the web in ever-larger quantities.
We cannot simply continue to do things as we always have. If we do not institute
digital standards, others will do so for us, and we can safely say that we will
not generally like the result.