Encoding and Parallel alignment of linguistic corpora in six Central and Eastern European Languages

Tomaz Erjavec

Institute "Jozef Stefan", Ljubljana (Slovenia)
tomaz.erjavec@ijs.si

Nancy Ide

Vassar College and Laboratoire Parole et Langage/CNRS
ide@cs.vassar.edu or ide@univ-aix.fr

Dan Tufis

Research Institute for Informatics, Bucharest (Romania)
tufis@ns.ici.ro
Keywords: corpora, encoding, SGML

Introduction

The language industries rely increasingly heavily on the availability of large-scale language resources, appropriate software tools, and standards to make them maximally reusable. Such resources and tools exist or are under development for most Western languages, and efforts to develop standard for corpus encoding and linguistic software development are well underway, in particular in the LRE project MULTEXT [1], one of the largest EU projects in the domain of language tools and resources (Ide and Véronis, 1994).

However, there have been no comparable efforts for Central and Eastern European (CEE) languages. No large-scale, systematic attempts at corpus collection currently exist (in particular for multilingual, parallel corpora in these languages); tools specifically adapted to corpora in CEE languages are not widely available; and most standardization efforts have not yet taken into account the specific characteristics of CEE languages.

MULTEXT-EAST [2] is a spin-off of the LRE project MULTEXT which is intended to fill these gaps by developing significant resources for six CEE languages (Bulgarian, Czech, Estonian, Hungarian, Romanian, Slovenian) and adapting existing tools and standards to them. MULTEXT-EAST extends MULTEXT's scope to CEE languages with the following goals:

test and adaptation of language standards
development of an annotated multilingual corpus
development of morpho-lexical resources
adaptation of the MULTEXT corpus tools.

This paper describes the Multext-East corpus and its development, including encoding concerns, especially as they arose from the need to handle languages for which encoding practices are not yet well established. It will also report on the special encoding concerns that arose in the process of aligning parallel texts in the six languages plus English.

The Multext-East corpus

MULTEXT has developed a Corpus Encoding Standard (CES) [3] optimally suited for use in corpus linguistics and language engineering applications, which can serve as a widely accepted set of encoding standards for European corpus work. The CES is an application of SGML (ISO 8879:1986, Information Processing--Text and Office Systems--Standard Generalized Markup Language), based on and in broad agreement with the TEI Guidelines for Electronic Text Encoding and Interchange (Sperberg-McQueen and Burnard, 1994; see also Ide and Vironis, 1995a). The standard identifies a minimal encoding level that corpora must achieve to be considered standardized in terms of descriptive representation (marking of structural and linguistic information) as well as general architecture (so as to be maximally suited for use in a text database). It also provides encoding conventions for more extensive encoding of linguistic corpora, and for linguistic annotation.

MULTEXT-EAST has applied the CES to texts in six CEE languages, including fiction and newspaper data. The experience of applying the CES to these new languages has led to a major revision and extension of the CES, in particular to handle the required additional character sets. In addition, the lack of substantial pre-existing texts in some electronic format in the Eastern European countries, and the resulting need to develop many corpora based on printed materials only, has made it necessary to consider the kinds of markup that can or should be included and the optimal stages of markup enhancement when corpora are generated in this way.

Corpus composition

MULTEXT-EAST has built an annotated multilingual corpus, composed of material comparable to MULTEXT's, whose primary goal is to provide an example and test-bed for:

the applicability of MULTEXT's multilingual tools (especially engine-based tools, alignment software, and multilingual extraction tools) to CEE language corpora; and
the applicability to CEE languages of the TEI Guidelines and MULTEXT's TEI-based corpus markup standard, as well as the MULTEXT-EAGLES pan-European lexical specifications and part-of-speech tagset.

The corpus has been prepared in TEI-conformant SGML format and annotated for basic structural features as well as sub-paragraph segmentation, part of speech, and alignment of parallel texts.

The corpus is composed of three major parts:

Multilingual Comparable Corpus
For each of the six MULTEXT-EAST languages, the comparable corpus will include two subsets of at least 100,000 words each, consisting of
- fiction, comprising a single novel or excerpts from several novels;
- newspapers.
The data is comparable across the six languages, in terms of the number and size of texts. The entire multilingual comparable corpus has been prepared in CES format, manually or using ad-hoc tools, and has been automatically annotated for tokenization, sentence boundaries, and part of speech annotation using the project tools.
Multilingual Parallel Corpus
For the six MULTEXT-EAST languages, the parallel corpus includes approximately 100,000 words per language, consisting of translations of Orwell's Nineteen Eighty-Four. The entire multilingual parallel corpus has also been prepared in CES conformant format. For each language, half of the corpus has been marked and validated for alignment and sentence boundaries. Alignment is between the English version and each of the six MULTEXT-EAST languages, thus constituting six pair-wise alignments.
Multilingual Speech Corpus
MULTEXT-EAST has recorded a small corpus of spoken texts in each of the six languages, similar to the EUROM-1 speech corpus, comprising forty short passages of five thematically connected sentences, each spoken by several native speakers, with phonemic and orthographic transcriptions. MULTEXT-EAST has enhanced this spoken corpus with markup for prosody. The prosody markup consists of two levels: F0 curve modeling and symbolic coding.
Orwell's Nineteen Eighty-Four
For the purposes of this short abstract, we will outline our work on Orwell's Nineteen Eighty-Four; the final paper will provide details concerning the entire corpus.
The novel Nineteen Eighty-Four by George Orwell is the central component of the MULTEXT-East corpus: it is the parallel text, where the English original is sentence aligned with the six languages of the project, and each translation tagged for part-of-speech. Despite the small size of this parallel corpus (7 x 100k words), it nevertheless constitutes a valuable linguistic resource for the MULTEXT-East languages, especially because the project has also delivered lexica which cover the word-forms of Nineteen Eighty-Four. Furthermore, through collaboration with the TELRI project and others, the corpus is likely to be extended to cover additional Central and Eastern European languages in the near future.

Numerous encoding decisions were made in the process of encoding the novel, some of which were influenced by the ultimate goal of tagging (for part of speech) and aligning the parallel versions. In other cases, decisions were driven by cost considerations: among the versions of Nineteen Eighty-Four that were encoded in the six languages, many had previously existed in electronic form using some other encoding system (e.g., typesetter's tapes), usually providing only rendition information. In one case (Bulgarian) the entire novel was typed in from a previously printed text by hand. For versions existing in previous versions a process of "up-translation" was applied to convert rendition infromation to descriptive CES encoding; our desire was to enable as much of this conversion as possible via automatic means, which reduces costs. In addition, for the purposes of automatic alignment of the parallel texts, it was important that the CES structural markup of Nineteen Eighty-Four be similar across the languages, which also dictated that certain encodings were viable while others were not.
A few of the encoding decisions were as follows:
- Sub-paragraph markup included names, but only the proper noun form of nouns was marked. Inflected forms (common in many of the MULTEXT-EAST languages) were not marked. In this way lexical lookup was not required for marked names, which could be assumed to be proper nouns.
- Rendition information was not systematically retained, except where it could be accomplished automatically, since rendition of the same item often varied widely across the different language editions. For example, words in bold were often in italic or quotation marks in another language, and dialogue was marked in some languages with quotation marks, in others with preceding dashes, etc.
- Although the CES provides a means to store markup for segmentation into sentences and words in a separate SGML document which is linked to the original, it was decided to include sentence boundary markup in the original texts. Because sentence boundaries are automatically determined and marked by the MULTEXT segmentation tool, it was necessary to find a clean way to re-insert the sentence boundary markup into the original SGML document containing markup to the paragraph level.
- The CES also provides for specifying links between two or more aligned documents in a separate SGML document. Because of the current lack of software to manipulate inter-document links of this kind, all the parallel texts and alignment documents are included in a single CES corpus document. This allows the use of SGML IDREFs to refer to aligned elements, which in turn demanded devising a clean system for assigning IDs to elements in the parallel texts.
The following is an example from the Slovene Nineteen Eighty-Four, with markup down to the paragraph level only:
```
<p>
Ministrstvo resnice &mdash; <name>Minires</name> v
<name>Novoreku</name><ptr n=1 rend="*"
target=N1> &mdash; se je
osupljivo lo&ccaron;ilo od kateregakoli predmeta. Bilo je
velikanska
piramidasta zgradba iz ble&scaron;&ccaron;e&ccaron;e
belega betona, ki
je v terasah kipela kvi&scaron;ku, tristo metrov visoko v zrak. S
kraja, kjer je stal <name>Winston</name>, se je ravno
&scaron;e dalo
prebrati tri partijske parole, ki so se v lepih &ccaron;rkah
odra&zcaron;ale z belega pro&ccaron;elja:
<q rend="CN CP" type=slogan>VOJNA JE MIR</q>
<q rend="CN CP" type=slogan>SVOBODA JE
SU&Zcaron;ENJSTVO</q>
<q rend="CN CP" type=slogan>NEVEDNOST JE
MO&Ccaron;</q>
</p>
```
Aligning Orwell's Nineteen Eighty-Four
The automatic alignment programs used to align the seven versions of Orwell's Nineteen Eighty-Four require that there exist the same number of paragraphs in each of the aligned versions. Our alignment was of each of the six versions in the MULTEXT-East languages to the English (six pair-wise alignments), and therefore it was necessary to ensure that the paragraph-level elements in each of these versions matched that in the English.
Obviously, a matching of this kind is not immmediately possible, since in fact there exist enormous discrepancies in the paragraphing across languages when a text--especially a work of fiction such as the Orwell--is translated into other languages. For example, the opening four paragraphs of the first chapter of the English Nineteen Eighty-Four correspond to a single paragraph in the Romanian verion. On the other hand, one paragraph in the English version in places spans several paragraphs in the Bulgarian.
In order to accomodate the needs of our aligner, it was necessary to devise a system for inserting and properly identifying "dummy" paragraphs, for cases where additional paragraphs had to be inserted in one of the translations, and at the same time it was necessary to devise means to temporarily "eliminate" a paragraph in some cases. This requirement placed some additional demands on our markup scheme and in particular, our system for providing unique identifiers for each element in the corpus.
Summary
The final paper will provide much more detail than is possible in this abstract concerning the corpus and its encoding, as well as the problems and considerations for the part of speech tagging and the alignment of the parallel texts. The experience of developing such a corpus should provide valuable input for others who are developing corpora, especially corpora intended for linguistic analysis and applications.

References
[1] Ide, N., and J. Véronis. 1994. "MULTEXT (Multilingual Tools and Corpora)". Proceedings of the 14th International Conference on Computational Linguistics, COLING'94, Kyoto, Japan 1994, 90-96. Project description, documentation, resources and tools available at http://www.lpl.univ-aix.fr/projects/multext/.
[2] Erjavec, T, Ide, N., Petkevic, V., Véronis, J. Multext-East: Multilingual Text Tools and Corpora for Central and Eastern European Languages. Proceedings of the TELRI Workshop, Tihanyi, Hungary, 1995, to appear. Project description, documentation and resources available at http://nl.ijs.si/ME/.
[3] Ide, N. and Véronis, J. A Standard for Encoding Linguistic Corpora. Presented at ALLC/ACH96, Bergen, Norway, June 1996. Documentation available at http://www.cs.vassar.edu/CES/.

Paper

Encoding and Parallel alignment of linguistic corpora in six Central and Eastern European Languages

Tomaz Erjavec

Institute "Jozef Stefan", Ljubljana (Slovenia) tomaz.erjavec@ijs.si

Nancy Ide

Vassar College and Laboratoire Parole et Langage/CNRS ide@cs.vassar.edu or ide@univ-aix.fr

Dan Tufis

Research Institute for Informatics, Bucharest (Romania) tufis@ns.ici.ro Keywords: corpora, encoding, SGML

Introduction

The Multext-East corpus

Corpus composition

Orwell's Nineteen Eighty-Four

Aligning Orwell's Nineteen Eighty-Four

Summary

References

Institute "Jozef Stefan", Ljubljana (Slovenia)
tomaz.erjavec@ijs.si

Vassar College and Laboratoire Parole et Langage/CNRS
ide@cs.vassar.edu or ide@univ-aix.fr

Research Institute for Informatics, Bucharest (Romania)
tufis@ns.ici.ro
Keywords: corpora, encoding, SGML