[Archive copy mirrored from the URL: http://www.qucis.queensu.ca/achallc97/papers/p035.html; see this canonical version of the document.]
Keywords: linguistic corpora, encoding, databases
With such an illustrative example I hope to (1) show that a properly constrained corpus linguistics project does not necessarily have to have a huge team of computer scientists to be realized. I also hope to initiate a discussion which (2) looks at ways of bridging the gap between information specialists and information gatherers/analyzers, and which also (3) looks at ways of making linguistic computing applications and methodologies more accessible to non-(computing) specialists.
As a data gatherer and analyzer, I want my materials in a variety of formats, to be accessible to people in a variety of fields (e.g. linguistic anthropology, folklore, history), as well as to the speakers of the language. Secure long-term storage of the data is also crucial. This paper, then, follows the trail of one linguist bumbling through the thickets of text encoding and relational database management.
The aims of the project were: (1) to archive such examples of language use as separate texts; (2) put the data in a queryable format (e.g., a database) to extract linguistic information; (3) to design a dictionary with an electronic interface, based on this queryable format; and (4) to write a usage-based grammar of the language, based on examples pulled out of the data.
Linguistic analysis, here broadly conceived, includes not only phonological, morphological, and grammatical analysis, but also sociolinguistic analyses as well. It aims above all to examine the usage of language on the basis of spoken discourse (rather than on an idealized or standardized language), in order to reveal the patterns or schemata by which speakers construct language[1].
Linguistic markup requirements include phonological features (phonemes in the International Phonetic Alphabet (IPA), secondary articulations (preaspiration, devoicing, epenthesis), prosodic features (stress, intonation contours, length of pauses), morphosyntactic features (derivational suffixation, cliticization, government, etc.), and etymological features. Metalinguistic markup requirements include a reference number, title, speaker(s) (with cross-references to speaker age, gender, education level), recording date, locale, and any mid-text switch of speakers.
Some Central Asian epics, including those of the Salars, feature a nine-headed monster called a Mangus. When wrestling with a linguistic corpus, the core issues[AMD1] of data representation, text encoding, and text searching often raise their fearsome heads before the unwary field linguist. Each of these is discussed in turn with examples drawn from the Salar project:
I compare the range of queries possible in both formats, as well as their ease of use. In demonstrating the two approaches, I discuss certain (linguistic) problems: the concordance (or linking, in the database) of several transcription-versions of the text to each other, and to multilingual free translations of the texts (here, in English and Chinese). With regard to the database approach, I discuss whether the average non-computer specialist can make do with general purpose off- the-shelf database software such as Foxprow or Access, or whether an investment in CELLAR (Summer Institute of Linguistics, 996) is preferable.
2. Departments with doctoral programs in linguistic anthropology and linguistics would do well to (1) offer a specific computing course tailored to the needs of linguists, perhaps in conjunction with the university's humanities computing center or computer science department; and (2) require its (post-)graduate students to undertake a very small sample project (e.g. with a collection of ten utterances) using humanities computing tools.
3. Individual scholars need more training opportunities in humanities computing; we need more, and more varied, specialized courses such as the excellent CETH seminar. (The latter unfortunately lacks a corpus linguistics track.)
4. Individual scholars need to forge connections with data specialists to learn methodologies and tools specific to their project. This will improve the quality of the project itself and contribute to the development of better tools and methods (or at l east the refinement of existing ones). (One of the reasons data-gatherers-and-analyzers (scholars) are rarely motivated to stand shoulder-to-shoulder with data specialists and seize the proverbial means of production is of course due to the current tenure system, which at present usually declines to recognize work in electronic form as legitimate publications.)