[Archive copy mirrored from the URL: http://www.qucis.queensu.ca/achallc97/papers/p043.html; see this canonical version of the document.]
Keywords: corpora, encoding, SGML
However, there have been no comparable efforts for Central and Eastern European (CEE) languages. No large-scale, systematic attempts at corpus collection currently exist (in particular for multilingual, parallel corpora in these languages); tools specifically adapted to corpora in CEE languages are not widely available; and most standardization efforts have not yet taken into account the specific characteristics of CEE languages.
MULTEXT-EAST [2] is a spin-off of the LRE project MULTEXT which is intended to fill these gaps by developing significant resources for six CEE languages (Bulgarian, Czech, Estonian, Hungarian, Romanian, Slovenian) and adapting existing tools and standards to them. MULTEXT-EAST extends MULTEXT's scope to CEE languages with the following goals:
MULTEXT-EAST has applied the CES to texts in six CEE languages, including fiction and newspaper data. The experience of applying the CES to these new languages has led to a major revision and extension of the CES, in particular to handle the required additional character sets. In addition, the lack of substantial pre-existing texts in some electronic format in the Eastern European countries, and the resulting need to develop many corpora based on printed materials only, has made it necessary to consider the kinds of markup that can or should be included and the optimal stages of markup enhancement when corpora are generated in this way.
The corpus has been prepared in TEI-conformant SGML format and annotated for basic structural features as well as sub-paragraph segmentation, part of speech, and alignment of parallel texts.
The corpus is composed of three major parts:
For each of the six MULTEXT-EAST languages, the comparable corpus will include two subsets of at least 100,000 words each, consisting of
For the six MULTEXT-EAST languages, the parallel corpus includes approximately 100,000 words per language, consisting of translations of Orwell's Nineteen Eighty-Four. The entire multilingual parallel corpus has also been prepared in CES conformant format. For each language, half of the corpus has been marked and validated for alignment and sentence boundaries. Alignment is between the English version and each of the six MULTEXT-EAST languages, thus constituting six pair-wise alignments.
MULTEXT-EAST has recorded a small corpus of spoken texts in each of the six languages, similar to the EUROM-1 speech corpus, comprising forty short passages of five thematically connected sentences, each spoken by several native speakers, with phonemic and orthographic transcriptions. MULTEXT-EAST has enhanced this spoken corpus with markup for prosody. The prosody markup consists of two levels: F0 curve modeling and symbolic coding.
The novel Nineteen Eighty-Four by George Orwell is the central component of the MULTEXT-East corpus: it is the parallel text, where the English original is sentence aligned with the six languages of the project, and each translation tagged for part-of-speech. Despite the small size of this parallel corpus (7 x 100k words), it nevertheless constitutes a valuable linguistic resource for the MULTEXT-East languages, especially because the project has also delivered lexica which cover the word-forms of Nineteen Eighty-Four. Furthermore, through collaboration with the TELRI project and others, the corpus is likely to be extended to cover additional Central and Eastern European languages in the near future.
Numerous encoding decisions were made in the process of encoding the novel, some of which were influenced by the ultimate goal of tagging (for part of speech) and aligning the parallel versions. In other cases, decisions were driven by cost considerations: among the versions of Nineteen Eighty-Four that were encoded in the six languages, many had previously existed in electronic form using some other encoding system (e.g., typesetter's tapes), usually providing only rendition information. In one case (Bulgarian) the entire novel was typed in from a previously printed text by hand. For versions existing in previous versions a process of "up-translation" was applied to convert rendition infromation to descriptive CES encoding; our desire was to enable as much of this conversion as possible via automatic means, which reduces costs. In addition, for the purposes of automatic alignment of the parallel texts, it was important that the CES structural markup of Nineteen Eighty-Four be similar across the languages, which also dictated that certain encodings were viable while others were not.
A few of the encoding decisions were as follows:
The following is an example from the Slovene Nineteen Eighty-Four, with markup down to the paragraph level only:
<p> Ministrstvo resnice — <name>Minires</name> v <name>Novoreku</name><ptr n=1 rend="*" target=N1> — se je osupljivo ločilo od kateregakoli predmeta. Bilo je velikanska piramidasta zgradba iz bleščeče belega betona, ki je v terasah kipela kvišku, tristo metrov visoko v zrak. S kraja, kjer je stal <name>Winston</name>, se je ravno še dalo prebrati tri partijske parole, ki so se v lepih črkah odražale z belega pročelja: <q rend="CN CP" type=slogan>VOJNA JE MIR</q> <q rend="CN CP" type=slogan>SVOBODA JE SUŽENJSTVO</q> <q rend="CN CP" type=slogan>NEVEDNOST JE MOČ</q> </p>
Obviously, a matching of this kind is not immmediately possible, since in fact there exist enormous discrepancies in the paragraphing across languages when a text--especially a work of fiction such as the Orwell--is translated into other languages. For example, the opening four paragraphs of the first chapter of the English Nineteen Eighty-Four correspond to a single paragraph in the Romanian verion. On the other hand, one paragraph in the English version in places spans several paragraphs in the Bulgarian.
In order to accomodate the needs of our aligner, it was necessary to devise a system for inserting and properly identifying "dummy" paragraphs, for cases where additional paragraphs had to be inserted in one of the translations, and at the same time it was necessary to devise means to temporarily "eliminate" a paragraph in some cases. This requirement placed some additional demands on our markup scheme and in particular, our system for providing unique identifiers for each element in the corpus.
[2] Erjavec, T, Ide, N., Petkevic, V., Véronis, J. Multext-East: Multilingual Text Tools and Corpora for Central and Eastern European Languages. Proceedings of the TELRI Workshop, Tihanyi, Hungary, 1995, to appear. Project description, documentation and resources available at http://nl.ijs.si/ME/.
[3] Ide, N. and Véronis, J. A Standard for Encoding Linguistic Corpora. Presented at ALLC/ACH96, Bergen, Norway, June 1996. Documentation available at http://www.cs.vassar.edu/CES/.