[Archive copy mirrored from the URL: http://www.qucis.queensu.ca/achallc97/papers/p043.html; see this canonical version of the document.]

Paper

Encoding and Parallel alignment of linguistic corpora in six Central and Eastern European Languages

Tomaz Erjavec

Institute "Jozef Stefan", Ljubljana (Slovenia)
tomaz.erjavec@ijs.si

Nancy Ide

Vassar College and Laboratoire Parole et Langage/CNRS
ide@cs.vassar.edu or ide@univ-aix.fr

Dan Tufis

Research Institute for Informatics, Bucharest (Romania)
tufis@ns.ici.ro

Keywords: corpora, encoding, SGML

Introduction

The language industries rely increasingly heavily on the availability of large-scale language resources, appropriate software tools, and standards to make them maximally reusable. Such resources and tools exist or are under development for most Western languages, and efforts to develop standard for corpus encoding and linguistic software development are well underway, in particular in the LRE project MULTEXT [1], one of the largest EU projects in the domain of language tools and resources (Ide and Véronis, 1994).

However, there have been no comparable efforts for Central and Eastern European (CEE) languages. No large-scale, systematic attempts at corpus collection currently exist (in particular for multilingual, parallel corpora in these languages); tools specifically adapted to corpora in CEE languages are not widely available; and most standardization efforts have not yet taken into account the specific characteristics of CEE languages.

MULTEXT-EAST [2] is a spin-off of the LRE project MULTEXT which is intended to fill these gaps by developing significant resources for six CEE languages (Bulgarian, Czech, Estonian, Hungarian, Romanian, Slovenian) and adapting existing tools and standards to them. MULTEXT-EAST extends MULTEXT's scope to CEE languages with the following goals:

This paper describes the Multext-East corpus and its development, including encoding concerns, especially as they arose from the need to handle languages for which encoding practices are not yet well established. It will also report on the special encoding concerns that arose in the process of aligning parallel texts in the six languages plus English.

The Multext-East corpus

MULTEXT has developed a Corpus Encoding Standard (CES) [3] optimally suited for use in corpus linguistics and language engineering applications, which can serve as a widely accepted set of encoding standards for European corpus work. The CES is an application of SGML (ISO 8879:1986, Information Processing--Text and Office Systems--Standard Generalized Markup Language), based on and in broad agreement with the TEI Guidelines for Electronic Text Encoding and Interchange (Sperberg-McQueen and Burnard, 1994; see also Ide and Vironis, 1995a). The standard identifies a minimal encoding level that corpora must achieve to be considered standardized in terms of descriptive representation (marking of structural and linguistic information) as well as general architecture (so as to be maximally suited for use in a text database). It also provides encoding conventions for more extensive encoding of linguistic corpora, and for linguistic annotation.

MULTEXT-EAST has applied the CES to texts in six CEE languages, including fiction and newspaper data. The experience of applying the CES to these new languages has led to a major revision and extension of the CES, in particular to handle the required additional character sets. In addition, the lack of substantial pre-existing texts in some electronic format in the Eastern European countries, and the resulting need to develop many corpora based on printed materials only, has made it necessary to consider the kinds of markup that can or should be included and the optimal stages of markup enhancement when corpora are generated in this way.

Corpus composition

MULTEXT-EAST has built an annotated multilingual corpus, composed of material comparable to MULTEXT's, whose primary goal is to provide an example and test-bed for:

The corpus has been prepared in TEI-conformant SGML format and annotated for basic structural features as well as sub-paragraph segmentation, part of speech, and alignment of parallel texts.

The corpus is composed of three major parts:

  1. Multilingual Comparable Corpus

    For each of the six MULTEXT-EAST languages, the comparable corpus will include two subsets of at least 100,000 words each, consisting of

    The data is comparable across the six languages, in terms of the number and size of texts. The entire multilingual comparable corpus has been prepared in CES format, manually or using ad-hoc tools, and has been automatically annotated for tokenization, sentence boundaries, and part of speech annotation using the project tools.

  2. Multilingual Parallel Corpus

    For the six MULTEXT-EAST languages, the parallel corpus includes approximately 100,000 words per language, consisting of translations of Orwell's Nineteen Eighty-Four. The entire multilingual parallel corpus has also been prepared in CES conformant format. For each language, half of the corpus has been marked and validated for alignment and sentence boundaries. Alignment is between the English version and each of the six MULTEXT-EAST languages, thus constituting six pair-wise alignments.

  3. Multilingual Speech Corpus

    MULTEXT-EAST has recorded a small corpus of spoken texts in each of the six languages, similar to the EUROM-1 speech corpus, comprising forty short passages of five thematically connected sentences, each spoken by several native speakers, with phonemic and orthographic transcriptions. MULTEXT-EAST has enhanced this spoken corpus with markup for prosody. The prosody markup consists of two levels: F0 curve modeling and symbolic coding.

    Orwell's Nineteen Eighty-Four

    For the purposes of this short abstract, we will outline our work on Orwell's Nineteen Eighty-Four; the final paper will provide details concerning the entire corpus.

    The novel Nineteen Eighty-Four by George Orwell is the central component of the MULTEXT-East corpus: it is the parallel text, where the English original is sentence aligned with the six languages of the project, and each translation tagged for part-of-speech. Despite the small size of this parallel corpus (7 x 100k words), it nevertheless constitutes a valuable linguistic resource for the MULTEXT-East languages, especially because the project has also delivered lexica which cover the word-forms of Nineteen Eighty-Four. Furthermore, through collaboration with the TELRI project and others, the corpus is likely to be extended to cover additional Central and Eastern European languages in the near future.

    Numerous encoding decisions were made in the process of encoding the novel, some of which were influenced by the ultimate goal of tagging (for part of speech) and aligning the parallel versions. In other cases, decisions were driven by cost considerations: among the versions of Nineteen Eighty-Four that were encoded in the six languages, many had previously existed in electronic form using some other encoding system (e.g., typesetter's tapes), usually providing only rendition information. In one case (Bulgarian) the entire novel was typed in from a previously printed text by hand. For versions existing in previous versions a process of "up-translation" was applied to convert rendition infromation to descriptive CES encoding; our desire was to enable as much of this conversion as possible via automatic means, which reduces costs. In addition, for the purposes of automatic alignment of the parallel texts, it was important that the CES structural markup of Nineteen Eighty-Four be similar across the languages, which also dictated that certain encodings were viable while others were not.

    A few of the encoding decisions were as follows:

    The following is an example from the Slovene Nineteen Eighty-Four, with markup down to the paragraph level only:

    <p>
    Ministrstvo resnice &mdash; <name>Minires</name> v
    <name>Novoreku</name><ptr n=1 rend="*"
    target=N1> &mdash; se je
    osupljivo lo&ccaron;ilo od kateregakoli predmeta. Bilo je
    velikanska
    piramidasta zgradba iz ble&scaron;&ccaron;e&ccaron;e
    belega betona, ki
    je v terasah kipela kvi&scaron;ku, tristo metrov visoko v zrak. S
    kraja, kjer je stal <name>Winston</name>, se je ravno
    &scaron;e dalo
    prebrati tri partijske parole, ki so se v lepih &ccaron;rkah
    odra&zcaron;ale z belega pro&ccaron;elja:
    <q rend="CN CP" type=slogan>VOJNA JE MIR</q>
    <q rend="CN CP" type=slogan>SVOBODA JE
    SU&Zcaron;ENJSTVO</q>
    <q rend="CN CP" type=slogan>NEVEDNOST JE
    MO&Ccaron;</q>
    </p>
    

    Aligning Orwell's Nineteen Eighty-Four

    The automatic alignment programs used to align the seven versions of Orwell's Nineteen Eighty-Four require that there exist the same number of paragraphs in each of the aligned versions. Our alignment was of each of the six versions in the MULTEXT-East languages to the English (six pair-wise alignments), and therefore it was necessary to ensure that the paragraph-level elements in each of these versions matched that in the English.

    Obviously, a matching of this kind is not immmediately possible, since in fact there exist enormous discrepancies in the paragraphing across languages when a text--especially a work of fiction such as the Orwell--is translated into other languages. For example, the opening four paragraphs of the first chapter of the English Nineteen Eighty-Four correspond to a single paragraph in the Romanian verion. On the other hand, one paragraph in the English version in places spans several paragraphs in the Bulgarian.

    In order to accomodate the needs of our aligner, it was necessary to devise a system for inserting and properly identifying "dummy" paragraphs, for cases where additional paragraphs had to be inserted in one of the translations, and at the same time it was necessary to devise means to temporarily "eliminate" a paragraph in some cases. This requirement placed some additional demands on our markup scheme and in particular, our system for providing unique identifiers for each element in the corpus.

    Summary

    The final paper will provide much more detail than is possible in this abstract concerning the corpus and its encoding, as well as the problems and considerations for the part of speech tagging and the alignment of the parallel texts. The experience of developing such a corpus should provide valuable input for others who are developing corpora, especially corpora intended for linguistic analysis and applications.

    References

    [1] Ide, N., and J. Véronis. 1994. "MULTEXT (Multilingual Tools and Corpora)". Proceedings of the 14th International Conference on Computational Linguistics, COLING'94, Kyoto, Japan 1994, 90-96. Project description, documentation, resources and tools available at http://www.lpl.univ-aix.fr/projects/multext/.

    [2] Erjavec, T, Ide, N., Petkevic, V., Véronis, J. Multext-East: Multilingual Text Tools and Corpora for Central and Eastern European Languages. Proceedings of the TELRI Workshop, Tihanyi, Hungary, 1995, to appear. Project description, documentation and resources available at http://nl.ijs.si/ME/.

    [3] Ide, N. and Véronis, J. A Standard for Encoding Linguistic Corpora. Presented at ALLC/ACH96, Bergen, Norway, June 1996. Documentation available at http://www.cs.vassar.edu/CES/.