ETAP

[Mirrored from: http://strindberg.ling.uu.se/~corpora/etapp/]

ETAP

This is a description of one of the research projects at the department of Linguistics in Uppsala University, Sweden: "Etablering och annotering av parallelkorpus för igenkänning av översättningsekvivalenter" (in English: "Creating and annotating a parallel corpus for the recognition of translation equivalents"). This project is a part of The Stockholm-Uppsala Research Programme "Translation and Interpreting - A Meeting between Languages and Cultures" financed by the National Bank of Sweden (Riksbanken Jubleumsfond).

Some links to interesting sites:

Overview of the Parallel Corpora Projects in Uppsala.
Home page of the Language Engineering Group at Uppsala University.
Home page of the Department of Linguistics at Uppsala University.

Creating and annotating a parallel corpus for the recognition of translation equivalents

Project No. 12

Project leader:

Anna Sågvall Hein
Department of Linguistics
Uppsala University
Box 513
S 751 20 Uppsala
Sweden
Fax: Fax: +46-18181416
E-mail: Anna.Sagvall_Hein@ling.uu.se

Collaborators:

Lars Borin
Erik Tjong Kim Sang
Per Starbäck
Bengt Dahlqvist
Klas Prütz
Students enrolled with the Language Engineering Master's Programme

Languages: Swedish, Dutch, English, Finnish, French, German, Italian and Spanish.

The basic aim of the project is to develop a computerized multilingual corpus that can be used in bilingual lexicographic work and in methodological studies directed towards the automatic recognition and extraction of translation equivalents from text. The corpus will comprise Swedish source text representing different styles and domains with translations into several languages. A basic requirement on the corpus is to have it word class tagged and aligned, primarily, sentence by sentence.

By October 1996, the project has resulted in two parallel, aligned, subcorpora, the Scania Corpus and the Swedish Statement of Government Policy Corpus.

The Scania corpus, Scania 9606, is a collection of truck maintenance manuals from the Swedish truck manufacturer Scania CV AB in Södertälje. It consists of 80 documents in eight languages: Swedish (source), Dutch, English, Finnish, French, German, Italian and Spanish. The total size of the corpus is 1.6 million words (63mB).

The Swedish Statement of Government Policy Corpus, Regeringsförklaringen 9607, is a collection of Government Statements, made in 1988 (Swedish, English, German, and French), 1994 (Swedish), 1995 (Swedish), and 1996 (Swedish, English, German, French, and Spanish). The total size of the corpus is 26,709 current words (371kB). It is available at http://strindberg.ling.uu.se/~corpora/rf/

Text structure in the documents in the two corpora has been automatically marked up with TEI Lite conformant SGML (by means of software developed in the project). The sentences (or sentence fragments) in the different language versions have been aligned with each other. Software has been developed for accessing the parallel corpora. A demo of the software can be found at the url for the Regeringsförklaringen 9607 corpus.

A first step in the tagging of the Swedish part of the Scania corpus has been taken with the morphological analysis of its word forms, and word forms that are ambiguous with regard to part of speech have been tentatively disambiguated by heuristic means. Accordingly, there are 178,355 tokens (single words and lexicalised phrases), 19,360 types (word forms), and 9,549 lemmas. Frequency of sentence length has also been generated.

The morphological analysis was carried out by means of Sve.Ucp, a morphological analyser developed at the department. Sve.Ucp uses a stem dictionary, and this dictionary was extended to cover the vocabulary of the Scania corpus. As regards the Regeringsförklaringen corpus, the words were analysed once, and the stem dictionary is currently being updated to account for missing words.

Current methodological work concentrates on the second step in the tagging process, in specific the implementation, exploration and evaluation of different methods for the disambiguation of the alternative analyses that are produced by the morphological analyser.

Another methodological issue in focus is the design and implementation of a an adequate corpus format for structuring and searching a multilingual, parallel corpus, in bilingual lexical acquisition.

Detailed counts for the two corpora:

Scania 9606

--------+------------+------------+------------+
language|     files  |     words  |    bytes   |
--------+------------+------------+------------+
German  |        80  |    186293  |   8004331  |  
English |        80  |    220827  |   7886082  |  
Spanish |        80  |    250730  |   8090916  |  
Finnish |        80  |    148348  |   7833990  |  
French  |        80  |    244239  |   8156457  |  
Italian |        80  |    228631  |   8127121  |  
Dutch   |        80  |    216424  |   8072128  |  
Swedish |        80  |    172259  |   7792597  |  
--------+------------+------------+------------+
        |       640  |   1667751  |  63963622  |  
--------+------------+------------+------------+

Regeringsförklaringen 9607

--------+------------+------------+------------+
language|     files  |     words  |    bytes   |
--------+------------+------------+------------+
German  |         2  |      4259  |     67650  |  
English |         2  |      4492  |     63522  |  
Spanish |         1  |      2318  |     30930  |  
French  |         2  |      5221  |     67769  |  
Swedish |         4  |     10419  |    140924  |  
--------+------------+------------+------------+
        |        11  |     26709  |    370795  |  
--------+------------+------------+------------+

Last update: December 12, 1996. corpora@ling.uu.se