Cover Pages: Electronic Metadata for Endangered Languages Data (EMELD)

[February 12, 2002] The EMELD project is sponsored by the National Science Foundation (NSF) and is based at the Linguist List at Wayne State and Eastern Michigan Universities; subcontracts are held by the Linguistic Data Consortium at the University of Pennsylvania, the Endangered Languages Fund, and the University of Arizona. The project has working groups for Language Codes, Markup, and Metadata. The primary goal of EMELD is to establish and promote consensus about an "archive infrastructure" appropriate for "members of the scientific community who are faced with two urgent situations: (1) the number of languages in the world is rapidly diminishing while (2) the number of initiatives to digitize language data is rapidly multiplying... EMELD addresses the need for collaboration among archivists, field linguists, and language engineers. [Apart from such collaboration,] (1) a common standard for the digitization of linguistic data may never be agreed upon; and the resulting variation in archiving practices and language representation would seriously inhibit data access, searching, and cross-linguistic comparison; (2) standards may be set without guidance from descriptive linguists, the people who best know the range of structural possibilities in human language..." [adapted from the home page]

As of December 2001, the EMELD Project had "instantiated a full database of language codes on its site, and has defined a set of approximately 200 ancient languages, as well as some 20 constructed languages, all of which have been assigned codes and brief descriptions..."

A paper presented at the IRCS Workshop on Linguistic Databases (December 2001) describes "the beginning of an effort within the Linguist List's Electronic Metastructure for Endangered Languages Data (E-MELD) project to develop markup recommendations for representing the morphosyntactic structures of the world's endangered languages. Rather than proposing specific markup recommendations as in the Text Encoding Initiative (TEI), we propose to construct an environment for comparing data sets using possibly different markup schemes. The central feature of our proposed environment is an ontology of morphosyntactic terms with multiple inheritance and a variety of relations holding among the terms. We are developing our ontology using the Protégé editor, and are extending an existing upper-level ontology known as SUMO... The paper describes the first stage in reaching the second of these goals. Our decision to begin work on the analysis of morphosyntactic terms was based on the recommendations of a markup work group that the Linguist List organized at the Language Digitization Workshop in Santa Barbara, June 21-24, 2001. That group divided the task of developing markup recommendations into several problem areas, and identified morphosyntactic markup as the first problem to be tackled... The architecture for the envisioned system is given [in Figure 4]. The three major components of the E-MELD system are (1) the graphical user interface (GUI), (2) the knowledge base (containing the ontology and query engine), and (3) the database of endangered languages marked up in XML format. The end user will be able to access the E-MELD system via the World Wide Web as the knowledge base and language data will reside together at a remote site. The user may pose queries to the knowledge base in standard search engine format (similar to that of Yahoo or Google). For example, the query 'ergative P2' will return a list of languages and/or actual language data from P2 languages containing ergative constructions. The only requirement that is required is that the documents containing the individual language data be in XML format. The query engine will have access to XML metadata and all language data in each file. Once the envisioned system is implemented only minimal maintenance will be required to add additional language data. Adding new data sets merely requires the ontology manager to interpret the researcher's tagset and to incorporate it into the existing ontology..." [see below: William Lewis, Scott Farrar, D. Terence Langendoen]

References:

Main EMELD website
Auxiliary EMELD website
Emeld Proposal
[Initial] "Workshop on The Digitization of Language Data: The Need for Standards." Santa Barbara, California, June 21 - 24, 2001
"The E-MELD Project. By Anthony Aristar and Helen Aristar-Dry. Pages 11-16 in Proceedings of the IRCS Workshop on Linguistic Databases (11-13 December 2001, University of Pennsylvania, Philadelphia, USA; Organized by Steven Bird, Peter Buneman and Mark Liberman; Funded by the National Science Foundation). [cache]
EMELD Language Lookup Pages:
- Ancient and extinct languages
- Constructed languages
- All languages [look up a language, family, or language code; searches both Ethnologue and LINGUIST databases]
"Building a Knowledge Base of Morphosyntactic Terminology." By William Lewis, (Department of Linguistics, University of Arizona), Scott Farrar, and D. Terence Langendoen. Pages 150-156 in Proceedings of the IRCS Workshop on Linguistic Databases (11-13 December 2001, University of Pennsylvania, Philadelphia, USA; Organized by Steven Bird, Peter Buneman and Mark Liberman; Funded by the National Science Foundation).
[February 02, 2002] "Morpho-Syntax Ontology. Concept Hierarchy for E-MELD Ontology Project." [Reported] By Scott Farrar (University of Arizona, Tucson, AZ, USA). "The ontology for morpho-syntax terms is now ready for a first review. Keep in mind this is a first attempt and that we're constantly revising and open to critique by the group... The following represents an effort by E-MELD to create an ontology of linguistic concepts. Thus far, only concepts specific to the domain of morpho-syntax has been addressed..."
See also: "Language Identifiers in the Markup Context" - Main reference page.


SEARCH \| ABOUT \| INDEX \| NEWS \| CORE STANDARDS \| TECHNOLOGY REPORTS \| EVENTS \| LIBRARY