Last modified: February 12, 2002
Electronic Metadata for Endangered Languages Data (EMELD)

[February 12, 2002] The EMELD project is sponsored by the National Science Foundation (NSF) and is based at the Linguist List at Wayne State and Eastern Michigan Universities; subcontracts are held by the Linguistic Data Consortium at the University of Pennsylvania, the Endangered Languages Fund, and the University of Arizona. The project has working groups for Language Codes, Markup, and Metadata. The primary goal of EMELD is to establish and promote consensus about an "archive infrastructure" appropriate for "members of the scientific community who are faced with two urgent situations: (1) the number of languages in the world is rapidly diminishing while (2) the number of initiatives to digitize language data is rapidly multiplying... EMELD addresses the need for collaboration among archivists, field linguists, and language engineers. [Apart from such collaboration,] (1) a common standard for the digitization of linguistic data may never be agreed upon; and the resulting variation in archiving practices and language representation would seriously inhibit data access, searching, and cross-linguistic comparison; (2) standards may be set without guidance from descriptive linguists, the people who best know the range of structural possibilities in human language..." [adapted from the home page]

As of December 2001, the EMELD Project had "instantiated a full database of language codes on its site, and has defined a set of approximately 200 ancient languages, as well as some 20 constructed languages, all of which have been assigned codes and brief descriptions..."

A paper presented at the IRCS Workshop on Linguistic Databases (December 2001) describes "the beginning of an effort within the Linguist List's Electronic Metastructure for Endangered Languages Data (E-MELD) project to develop markup recommendations for representing the morphosyntactic structures of the world's endangered languages. Rather than proposing specific markup recommendations as in the Text Encoding Initiative (TEI), we propose to construct an environment for comparing data sets using possibly different markup schemes. The central feature of our proposed environment is an ontology of morphosyntactic terms with multiple inheritance and a variety of relations holding among the terms. We are developing our ontology using the Protégé editor, and are extending an existing upper-level ontology known as SUMO... The paper describes the first stage in reaching the second of these goals. Our decision to begin work on the analysis of morphosyntactic terms was based on the recommendations of a markup work group that the Linguist List organized at the Language Digitization Workshop in Santa Barbara, June 21-24, 2001. That group divided the task of developing markup recommendations into several problem areas, and identified morphosyntactic markup as the first problem to be tackled... The architecture for the envisioned system is given [in Figure 4]. The three major components of the E-MELD system are (1) the graphical user interface (GUI), (2) the knowledge base (containing the ontology and query engine), and (3) the database of endangered languages marked up in XML format. The end user will be able to access the E-MELD system via the World Wide Web as the knowledge base and language data will reside together at a remote site. The user may pose queries to the knowledge base in standard search engine format (similar to that of Yahoo or Google). For example, the query 'ergative P2' will return a list of languages and/or actual language data from P2 languages containing ergative constructions. The only requirement that is required is that the documents containing the individual language data be in XML format. The query engine will have access to XML metadata and all language data in each file. Once the envisioned system is implemented only minimal maintenance will be required to add additional language data. Adding new data sets merely requires the ontology manager to interpret the researcher's tagset and to incorporate it into the existing ontology..." [see below: William Lewis, Scott Farrar, D. Terence Langendoen]


