Hand-to-Hand Wrestling with Small Linguistic Corpora

Arienne M. Dwyer

Universität Mainz

Keywords: linguistic corpora, encoding, databases


This is a description of a small corpus linguistics project from start to (well, nearly) finish, undertaken entirely by one not particularly computationally-adept linguist. I focus not just on the end results, but also on the various options and obstacles encountered at each stage of the project. As such it illustrates in a nutshell the strengths and weaknesses of some of the current data- and text-management methodologies and tools for non-specialists. It also may provide a useful model for other linguists and linguistic anthropologists facing similar issues.

With such an illustrative example I hope to (1) show that a properly constrained corpus linguistics project does not necessarily have to have a huge team of computer scientists to be realized. I also hope to initiate a discussion which (2) looks at ways of bridging the gap between information specialists and information gatherers/analyzers, and which also (3) looks at ways of making linguistic computing applications and methodologies more accessible to non-(computing) specialists.


Much of academic work has become specialized to the point where one person collects the data (a field worker or a text historian), another processes it (an information/data specialist), and still another consumes it (other academics). In general, field workers or textual scholars know little about computing, while data specialists know little about the significance of the data. The two are, however, interdependent: the data specialist depends on the scholar to do all the analytical markup, and the linguist/text scholar depends on the data specialist to make the data consistent, accessible, and manipulable.

As a data gatherer and analyzer, I want my materials in a variety of formats, to be accessible to people in a variety of fields (e.g. linguistic anthropology, folklore, history), as well as to the speakers of the language. Secure long-term storage of the data is also crucial. This paper, then, follows the trail of one linguist bumbling through the thickets of text encoding and relational database management.

An Example of a Small Linguistic Corpus: The Salar Project

Salar is an unwritten Turkic language spoken primarily in northern Tibet. The corpus of ca. 40,000 words consists of transcriptions of field recordings of a variety of spoken language forms: conversations, stories, speeches, and ballads.

The aims of the project were: (1) to archive such examples of language use as separate texts; (2) put the data in a queryable format (e.g., a database) to extract linguistic information; (3) to design a dictionary with an electronic interface, based on this queryable format; and (4) to write a usage-based grammar of the language, based on examples pulled out of the data.

Linguistic analysis, here broadly conceived, includes not only phonological, morphological, and grammatical analysis, but also sociolinguistic analyses as well. It aims above all to examine the usage of language on the basis of spoken discourse (rather than on an idealized or standardized language), in order to reveal the patterns or schemata by which speakers construct language[1].

Linguistic markup requirements include phonological features (phonemes in the International Phonetic Alphabet (IPA), secondary articulations (preaspiration, devoicing, epenthesis), prosodic features (stress, intonation contours, length of pauses), morphosyntactic features (derivational suffixation, cliticization, government, etc.), and etymological features. Metalinguistic markup requirements include a reference number, title, speaker(s) (with cross-references to speaker age, gender, education level), recording date, locale, and any mid-text switch of speakers.


Linguistic corpora from unwritten languages present a different set of problems from a typical legacy text analysis project. There is much more information packed into an audio recording of speech than there is in a written text. Recordings can be transcribed and re-transcribed into text in a variety of ways based on the interests of the transcriber. But once the transcriptions are encoded as primary data in a corpus, they take on an almost unjustified permanence.

Some Central Asian epics, including those of the Salars, feature a nine-headed monster called a Mangus. When wrestling with a linguistic corpus, the core issues[AMD1] of data representation, text encoding, and text searching often raise their fearsome heads before the unwary field linguist. Each of these is discussed in turn with examples drawn from the Salar project:

Data representation

Consists of self-designed International Phonetic Alphabet fonts, mapped according to Unicode mapping tables (The Unicode Standard, v.1, v.2, The Unicode Consortium).

Text encoding

Based on TEI-conformant SGML (The Text Encoding Initiative, P3). I weigh its advantages and disadvantages, including, for example, the large amounts of metatextual information required in the header of such a spoken text (e.g. speaker, locale, and recording information).

Text Searching

Two approaches are demonstrated and assessed here: (1) text is chunked into a sample Foxprow relational database with a dictionary-like queryable front end[2]; (2) text files marked up in SGML are queried via the MULTEXT Project's query language interpreter MtSgmlQL (CNRS et al. 1996, cf. www.lpl.univ-aix.fr/projects/multext/mql1.html).

I compare the range of queries possible in both formats, as well as their ease of use. In demonstrating the two approaches, I discuss certain (linguistic) problems: the concordance (or linking, in the database) of several transcription-versions of the text to each other, and to multilingual free translations of the texts (here, in English and Chinese). With regard to the database approach, I discuss whether the average non-computer specialist can make do with general purpose off- the-shelf database software such as Foxprow or Access, or whether an investment in CELLAR (Summer Institute of Linguistics, 996) is preferable.


. The above example of the Salar corpora suggests that it is necessary to store and present such linguistic (and/or anthropological/folkoristic) data in a variety of formats. The queryable-SGML file and/or the SGML- based database is one prototype data-analysis/management system for other linguistic corpora.

2. Departments with doctoral programs in linguistic anthropology and linguistics would do well to (1) offer a specific computing course tailored to the needs of linguists, perhaps in conjunction with the university's humanities computing center or computer science department; and (2) require its (post-)graduate students to undertake a very small sample project (e.g. with a collection of ten utterances) using humanities computing tools.

3. Individual scholars need more training opportunities in humanities computing; we need more, and more varied, specialized courses such as the excellent CETH seminar. (The latter unfortunately lacks a corpus linguistics track.)

4. Individual scholars need to forge connections with data specialists to learn methodologies and tools specific to their project. This will improve the quality of the project itself and contribute to the development of better tools and methods (or at l east the refinement of existing ones). (One of the reasons data-gatherers-and-analyzers (scholars) are rarely motivated to stand shoulder-to-shoulder with data specialists and seize the proverbial means of production is of course due to the current tenure system, which at present usually declines to recognize work in electronic form as legitimate publications.)


