Language identification in metadata descriptions of language archive holdings

[Cache from http://www.ldc.upenn.edu/exploration/expl2000/papers/simons/simons.htm; please use this canonical URL/source if possible.]


Gary F. Simons
SIL International
gary_simons@sil.org

Paper presented at the workshop on
Web-Based Language Documentation and Description
12-15 December 2000, Philadelphia, USA.


Abstract. Uniform identification of languages is a foundational requirement within the metadata of language archives. This paper discusses the problems that a system of language identification must solve, and then proposes that the system of three-letter identification codes used in the Ethnologue offers a complete and open solution to those problems. The paper goes on to describe what SIL International is contributing to the infrastructure for open language archiving so that this system of identifiers can serve the language archives community as the standard for language identification in metadata.

Contents

  1. Language identification as a foundational metadata requirement
  2. The problem of language identification
    2.1 The nature of the problem
    2.2 A complete and open solution
    2.3 SIL codes and RFC 1766
  3. Using SIL language codes as a metadata standard
    3.1 Requirements
    3.2 Design of the data files
    3.3 Using the data files
  4. Conclusion
References


1. Language identification as a foundational metadata requirement

The feature that distinguishes a language archiving community from archiving in general is its focus on human language as a subject of enquiry. The document that lists the requirements for a language archiving infrastructure (Simons and Bird 2000) begins by listing three requirements that the users of any archiving community would have for searching the holdings of the community. The fourth requirement hones in on what makes the language archiving community unique, namely, its interest in ensuring that languages are uniquely and uniformly identified in all metadata descriptions so that holdings in all archives about the same language are tagged in the same way. These first four requirements can be summarized as follows:

  1. A single site on the web offers a combined catalog of all holdings of the language archives community.
  2. This catalog contains a full metadata description of each holding so that a user can tell what it contains without downloading it.
  3. Users can query on particular metadata elements (either individually or in combination) to perform focused searching of the holdings of the entire language archives community.
  4. The metadata tagging to identify language is done consistently so that a single search for a particular language will retrieve all relevant resources from all participating archives.

There are two primary metadata elements in which language identification is relevant. The first could be called text language; it identifies the language in which the text of a resource is written. The second could be called subject language; it identifies the language which a resource describes. A linguist studying a particular language would want to be able to keep publications in the language distinct from publications about the language, but would also want for both to use the same way of identifying the language so that a single catalog query on both metadata elements could retrieve all works related to the language. The focus of this paper is on developing and supporting a standard means of identifying languages for the language archives community. The details of how these two metadata elements would be marked up in a metadata standard for the language archives community are treated in another paper (Bird and Simons 2000).

Section 2 of this paper discusses the problems that a system of language identification must solve, and then proposes that the system of three-letter identification codes used in the Ethnologue offers a complete and open solution to those problems. Section 3 goes on to describe what SIL International is contributing to the infrastructure for open language archiving so that this system of identifiers can serve the language archives community as the standard for language identification in metadata.

2. The problem of language identification

Given that language archives must identify languages in metadata descriptions, how should they identify them? The language archives community needs a standard for achieving this purpose.

2.1 The nature of the problem

We are used to seeing languages identified by their name in library catalogs. However, when we consider the full scope of the more than 6,000 languages spoken in the world today, we quickly see that a system based on names could never serve to solve the problem of unique and uniform identification. These are some of the facts that stand in the way:

The sum of these facts taken together suggests that a standard based on names will not work. Rather, what is needed is a standard based on unique identifiers that do not change, combined with accessible documentation that clarifies the particular speech variety denoted by each identifier.

The information technology community has a standard for language identification, namely, ISO 639 (ISO 1998). Part 1 of this standard lists two-letter codes for identifying about 140 of the world's major languages; part 2 of the standard lists three-letter codes for identifying about 400 languages. ISO 639 in turn forms the core of another standard, RFC 1766 (Alvestrand 1995), which is the standard used for language identification in the xml:lang attribute of XML (W3C 1998) and in the language element of the Dublin Core (DCMI 1999). RFC 1766 provides a mechanism for users to register new language identification codes for languages not covered by ISO 639, but very few additional languages have been registered.

2.2 A complete and open solution

Unfortunately, the existing standard falls far short of meeting the needs of the language archives community since it fails to account for more than 90% of the world's languages. This standard also falls short by not providing adequate documentation of what languages the codes refer to. There are, however, two systems of language identifiers that do provide identification codes for all the living languages of the world; these are the Linguasphere Register (Dalby 1999) and SIL's Ethnologue (Grimes 2000). Both of these are available as published reference books, but only the Ethnologue also makes its complete system of identifiers openly available on the Web. Indeed, with over half-a-million page hits per month, the Ethnologue has become the leading source of information on the Web concerning language identification. As the only complete language identification scheme openly available on the Web, the SIL language codes are the only viable candidate for use as a language identification standard by the language archives community.

The main content of the Web edition of the Ethnologue (www.sil.org/ethnologue/) is a set of pages that list all the languages spoken in a particular country. The listing for a single language contains the following information in a format like this:

LANGUAGE NAME (ALTERNATE NAMES). [Unique three-letter identification code] Notes on population. Notes on location. Linguistic classification. Dialects: DIALECT NAME (ALTERNATE NAMES FOR DIALECT). Notes on language use. Notes on available literature.

For instance, the following is the listing for Cherokee, which has CER as its unique identifier:

CHEROKEE (TSALAGI, TSLAGI) [CER] 22,500 speakers, including 14,000 speakers out of 70,000 population on Oklahoma rolls (1986 Durbin Feeling, Cherokee Nation, OK); 8,500 in North Carolina. 11,905 speakers including 130 monolinguals; 308,132 ethnic Cherokee (1990 USA Census Bureau). Eastern and northeastern Oklahoma and Cherokee Reservation, Great Smokey Mts., western North Carolina. Iroquoian, Southern Iroquoian, Cherokee. Dialects: ELATI (LOWER CHEROKEE, EASTERN CHEROKEE), KITUHWA (MIDDLE CHEROKEE), OTALI (UPPER CHEROKEE, WESTERN CHEROKEE, OVERHILL CHEROKEE), OVERHILL-MIDDLE CHEROKEE. The Elati dialect is extinct. Language use is vigorous in some Oklahoma communities, elsewhere some younger ones prefer English. Now being taught in schools, churches, and other classes (1986 Cherokee Advocate). 15% to 20% can read Cherokee, 5% can write it (1986 Cherokee Heritage Center). Dictionary. Grammar. NT 1850-1951. Bible portions 1829-1953.

Note that the language and dialect names are specific to the country under which the language is listed. The same language spoken in a different country could list different names and dialects. The fact that two entries refer to the same language is indicated by the equivalence of their three-letter codes. The Web site offers a CGI page for looking up a three-letter code to find out what language it represents and what countries it is spoken in. For instance, the URL to query the CER code in the current Web site for the 13th edition of the Ethnologue would be:

http://www.sil.org/ethnologue/lookup?cer

In the forthcoming Web site for the 14th edition, the query will be:

http://www.ethnologue.com/show_language.asp?code=cer

2.3 SIL codes and RFC 1766

Constable and Simons (2000) go into greater depth on the problems involved in language identification and conclude by proposing a scheme for incorporating the use of SIL language codes into RFC 1766 so that this standard, already in widespread use by the information technology community, could be extended to handle all the languages of the world. The proposal is that the standard be extended to allow for language identification codes from multiple namespaces. A code qualified by a namespace would have three parts: n to invoke a namespace, the code for the authorized naming authority, and the language code maintained by that authority. Thus, the SIL code for Cherokee would be: n-sil-cer.

Namespaces are needed so that alternate coding systems based on different operational definitions of language would be possible. The identification of languages by SIL is based on a primary criterion of mutual non-intelligibility. That is, two communities are thought to speak different languages if neither can inherently understand the speech of the other. This way of defining language is widely used among field linguistics, but some people have expressed dissatisfaction with the language listings given in the Ethnologue because they would like to give preference to other criteria, like ethnolinguistic identity or shared literary tradition or government policy. Other ways of defining language may be more useful for other purposes, so the idea of alternate namespaces (with their corresponding naming authorities) seems a useful one.

It is important to note that the SIL language codes offer complete coverage only for living (and recently extinct) languages. The language archives community will no doubt want to extend its collections to include ancient languages as well. The existing ISO 639 standard does have a few dozen codes for ancient languages, but the inventory falls far short of what would be needed by the scholarly community. Thus it is important that an institution step forward which would be able to function as the naming authority to devise and maintain language identification codes for ancient languages.

The namespace proposal made in Constable and Simons (2000) was forwarded to the officials involved in the revision of RFC 1766 and they decided not to act on it at this time. Subsequently, the Unicode Technical Committee has underscored the critical need for a standard way of identifying all the world's languages by unanimously passing a resolution endorsing the namespace proposal and asking RFC 1766 to incorporate it. At this date it is not yet clear what the agreed upon mechanism will ultimately be for referring to identifiers for all the world's languages.

In the meantime, the language archives community can use codes of the form x-sil-cer. This form employs RFC 1766's prefix (x-) for user-defined language codes. It is thus a valid language identifier within the framework of RFC 1766, and members of the language archives community could agree among themselves that the x-sil- prefix means that the remainder of the code is a three-letter language identifier from the Ethnologue.

3. Using SIL language codes as a metadata standard

In order to facilitate the use of its three-letter language identifiers throughout the language archives community, SIL International is making a public release of data files that enumerate the complete set of identifiers. These can be loaded into local databases by developers who are writing software that archivists would use to build metadata descriptions of archived items or that linguists would use to form search queries against archived holdings. The following subsections list the requirements that motivated the design of the files, describe that design, and then explain how they can be used.

3.1 Requirements

The archivist building metadata descriptions and the linguist searching archive catalogs both face the same problem--they need to find the right identifier for a particular language. SIL's approach to supporting this task is designed to meet the following three requirements:

  1. Given the name or the alternate name of a language or of a dialect of the language, it must be possible to find the language identifier for that language.
  2. Given a particular language identifier, it must be possible to lookup the full description of the corresponding language in order to verify that the selected identifier refers to the intended language.
  3. Given a country of interest, it must be possible to narrow the search for a language identifier to the languages of that country.

Once the full set of language identifiers begins to be used as a standard, another problem arises for the long term. As languages and dialects change over time and as better information is discovered about lesser-known languages, changes must be made to the set of identifiers. New identifiers may be added, while others may be taken out of use. In the process, existing identifiers may be affected in that the range of speech varieties they refer to could narrow or widen. Thus SIL's approach to managing language identifiers is designed to meet the following additional requirements:

  1. Given a particular language identifier, it must be possible to learn if (and how) its meaning may have changed over time.
  2. Given a particular date, it must be possible to learn all the changes that have occurred in the language identifiers since that date.

3.2 Design of the data files

There are three files that make up the package of data tables that SIL International proposes to release in support of its standard for language identifiers. They are tab-delimited files in which each line represents one row of a data table. LanguageCodes.tab is the complete list of three-letter language identifiers listed by country and with all of their known names (including language names, dialect names, and alternate names). CountryCodes.tab is the list of two-letter country codes that are used in the main language code table. ChangeHistory.tab records the history of changes to the language identifiers. The following declarations provide the formal definitions for SQL data tables into which the tab-delimited files can be loaded:

CREATE TABLE LanguageCodes (
   LangID      char(3) NOT NULL,        -- Three-letter code
   CountryID   char(2) NOT NULL,        -- Country for this name 
   NameType    char(2) NOT NULL,        -- L(anguage), LA(lternate),
                                        -- D(ialect), DA(lternate)
   Name        varchar(75) NOT NULL )   -- Language name
 
CREATE TABLE CountryCodes ( CountryID char(2) NOT NULL, -- Two-letter code Name varchar(75) NOT NULL ) -- Country name

CREATE TABLE ChangeHistory ( LangID char(3) NOT NULL, -- The ID that is different Date smalldatetime NOT NULL, -- The date of the change Action char(1) NOT NULL -- C(reated), M(odified), R(etired) Description varchar(2000) NOT NULL ) -- Description of change

3.3 Using the data files

LanguageCodes.tab documents 37,420 distinct names used with 7,148 distinct language identifiers. The table contains 46,416 records since some of the names are used in more than one country or with more than one language or dialect. The following shows the entries for the first three language identifiers:

LangID CountryID NameType Name                                                                        
------ --------- -------- ------------- 
AAA    NG        L        Ghotuo
AAA    NG        LA       Otuo
AAA    NG        LA       Otwa
AAB    NG        LA       Alumu
AAB    NG        D        Arum
AAB    NG        LA       Arum-cesu
AAB    NG        LA       Arum-chessu
AAB    NG        L        Arum-tesu
AAB    NG        D        Tesu
AAC    PG        L        Ari

We see that AAA denotes a language spoken in Nigeria which has a primary name of Ghotuo and two alternate names. AAB, also spoken in Nigeria, has three alternate names and two dialect names in addition to its primary name. The third language, AAC, is spoken in Papua New Guinea and has just one name.

The LanguageCodes.tab table would be used to implement a search by name. The name specified by the user would be compared to all the values in the Name column to find potential language identifiers that could match it. To allow the user to verify that a proposed identifier is indeed the right one, the software would offer the following link to the Ethnologue Web site (where XXX is the proposed three-letter identifier) which generates a report giving detailed information about the selected language:

http://www.ethnologue.com/show_language.asp?code=XXX

CountryCodes.tab lists the two-letter identifier and name for 266 countries of the world. The codes are from the international standard named ISO 3166-1 (ISO 1997). The following shows the entries for the first five codes in the list:

CountryID Name                                               
--------- --------------------- 
AD        Andorra
AE        United Arab Emirates
AF        Afghanistan
AG        Antigua and Barbuda
AI        Anguilla

The CountryCodes.tab table would be used to narrow the search for an identifier to a particular country. The user would choose a country from the country list in order to select the appropriate country code which would be used in a SQL query to restrict the language identifier list to just entries for that country. For instance, if the user were interested only in Afghanistan, the following SQL query would return just the table rows for that country:

SELECT * FROM LanguageCodes WHERE CountryID='AF'

Alternatively, the following link to the Ethnologue Web site could be used to generate a report listing all the languages for Afghanistan:

http://www.ethnologue.com/show_country.asp?code=AF

The ChangeHistory.tab table would be used by archivists who want to keep the language tagging of their holdings up-to-date with the latest version of the standard. For instance, if an archive were up-to-date with its language tagging through the end of 1999, the following SQL query would be used to find out what changes have occurred in the set of language identifiers since that time:

SELECT * FROM ChangeHistory WHERE Date >=2000-01-01

To discover what metadata descriptions might need a change in their language tagging due to changes to the standard language identifiers, the archivist could do a relational join on LangID between the metadata records of the archive and this change history table.

4. Conclusion

Uniform identification of languages is a foundational requirement within the metadata of language archives. The three-letter identifiers used within the Ethnologue appear to be the only system of language identifiers that covers all the living languages of the world and that is openly available. As one of its contributions to the infrastructure for language archiving, SIL International is committed to supporting that system of identifiers in such a way that it can serve the language archives community as the standard for language identification in metadata.

Some questions remain to be answered before all the details of a standard are in place. Three unsettled issues that have been touched on, but which lie outside the scope of this paper to solve, are:

  1. the exact markup scheme for language-related metadata elements,
  2. the exact syntax for referring to SIL language codes within the framework of RFC 1766, and
  3. the need for a standard that provides identifiers for ancient languages.

Further, there are two substantive issues that have not been touched on, but which would ideally be part of the complete solution. The first of these is the identification of dialects. In short, does the language archives community need a standard for the uniform identification of particular dialects? The second issue is linguistic classification. The Ethnologue database and Web site could also serve as a standard source for the names of linguistic subgroups to use in metadata, and offer a service for resolving the names of linguistic subgroups onto lists of member language identifiers to use when searching archives. But even in the face of these open issues, the existing set of SIL language codes is still in a position to offer the language archives community a firm solution to its foundational requirement for language identification in metadata.

References

Alvestrand, Harald, ed. 1995. RFC 1766: Tags for the identification of languages. <http://www.ietf.org/rfc/rfc1766.txt?number=1766>

Bird, Steven and Gary Simons. 2000. White paper on establishing an infrastructure for open language archiving. Working paper for the workshop on Web-based Language Documentation and Description, 12-15 December 2000, Philadelphia, PA. <http://www.ldc.upenn.edu/exploration/expl2000/whitepaper.html>

Constable, Peter and Gary F. Simons. 2000. Language identification and IT: Addressing problems of linguistic diversity on a global scale. SIL Electronic Working Papers 2000-001. <http://www.sil.org/silewp/2000/001/>

Dalby, David. 1999. Linguasphere register of the world's languages and speech communities. Linguasphere Press. <http://www.linguasphere.org/>

DCMI [Dublin Core Metadata Initiative]. 1999. Dublin core metadata element set, version 1.1: Reference description. <http://purl.org/dc/documents/rec-dces-19990702.htm>

Grimes, Barbara F., ed. 2000. Ethnologue: languages of the world, 14th edition. Dallas, TX: SIL International. Web edition of 13th edition at <http://www.sil.org/ethnologue/>. Web edition of 14th edition forthcoming at <http://www.ethnologue.com/>.

ISO [International Organization for Standardization]. 1997. ISO 3166-1: 1997 (E/F), Codes for the representation of names of countries and their subdivisions--Part 1: Country codes. Geneva: International Organization on Standardization. <http://www.din.de/gremien/nas/nabd/iso3166ma/>.

ISO. 1998. ISO 639-2:1998(E/F), Codes for the representation of names of languages--part 2: alpha-3 code. Geneva: International Organization for Standardization. <http://lcweb.loc.gov/standards/iso639-2/langhome.html>.

Simons, Gary and Steven Bird. 2000. Requirements on the infrastructure for digital language documentation and description. Working paper for the workshop on Web-based Language Documentation and Description, 12-15 December 2000, Philadelphia, PA. <http://www.ldc.upenn.edu/exploration/expl2000/requirements.html>

W3C [World Wide Web Consortium]. 1998. Extensible markup language (XML) 1.0. <http://www.w3.org/TR/1998/REC-xml-19980210>