Parallel Corpora in Uppsala

[Mirrored from: http://strindberg.ling.uu.se/~corpora/]

Parallel Corpora in Uppsala

This page contains an overview of the parallel corpora projects in the Language Engineering (Språkteknologi) group of the Linguistic department of the University of Uppsala in Sweden.

Project Staff: Anna Sågvall-Hein (project leader), Lars Borin, Jon Brewer, Bengt Dahlqvist, Klas Prütz, Per Starbäck, Erik Tjong Kim Sang and Eva Wikholm. External consultants: Ingrid Almqvist and Vesa Autio, both from Scania CV AB, Södertälje.
Address: Institutionen för Lingvistik
Box 513
751 20 Uppsala
Sweden
phone: +46 18 18 11 13
fax: +46 18 18 14 16
Email: corpora@ling.uu.se
WWW: http://www.ling.uu.se/corpora/

Overview

In our department the two main projects using parallel corpora are the Scania Corpus Project and the ETAP project. Parallel corpora will also be used in our part of the Scarrie project. An important part of the Scania Corpus Project is the development of the controlled language ScaniaSwedish.

We have developed and are developing different language resources:

The Scania corpus: A collection of truck manuals from the Swedish truck manufacturing company Scania. The manuals are available in eight languages: Swedish (source language), Dutch, English, Finnish, French, German, Italian and Spanish. The current size of the corpus is approximately 220,000 words per language.
Swedish newspaper corpus: This corpus will consist of Swedish texts from the local newspaper Upsala Nya Tidning and the national newspaper Svenska Dagbladet. The current size of the corpus is 6.5 million words.
Swedish political texts: This corpus contains texts from the Swedish government among which the declarations issued by the Swedish prime-minister when a new cabinet starts (regeringsförklaringen). These declarations are issued in five languages: Swedish, English, French, German and Spanish (since 1996). The current size of the corpus is 11,000 words.
A Swedish dictionary of general language: This electronic dictionary contains 60,000 lemmas and several thousands phrases.
A multilingual term bank of car maintenance terms: This term bank consists of approximately 4000 Swedish car maintenance terms with translations in 6 other languages: Dutch, English, French, German, Italian and Spanish.
The Swedish Immigrant Newspaper corpus: The Swedish Immigrant Newspaper (Invandrartidningen) is issued in nine languages: Swedish, Albanian, Arabic, English, Finnish, Persian, Polish, Servo-Croatic and Spanish. Presently this corpus consists only of a few texts. We are currently establishing contact with the newspaper for obtaining a larger number of their texts.

We have chosen SGML and the TEI Lite for structuring these language resources. The language resources will be used in our translation tool Multra.

Parallel Corpora around the World

If you know any interesting parallel corpora web sites that have not been included in this list then please mail the urls to corpora@ling.uu.se

ACL SIGLEX Parallel Corpora page: A collection of links to publicly available parallel corpora. The collection is maintained by Ken Litkowski of the ACL Special Intrest Group on the Lexicon.
Bilingual Reference Corpora for translators and translation studies: A paper on parallel corpora construction and usage by Carol Peters and Eugenio Picchi of the Istituto di Linguistica Computazionale in Pisa, Italy.
CRATER Multilingual Aligned Annotated Corpus: A multilingual aligned corpus in English, French and Spanish available from the School of Engineering, Computing and Mathematical Sciences, Lancaster University, UK.
English Turkish Aligned Parallel Corpora: Aligned parallel texts in English and Turkish presented by Kemal Oflazer of the Bilkent University in Ankara, Turkey.
ECI Multilingual Corpus CD: A CD-rom containing a collection of mostly monolingual texts in 25 languages. It contains some multilingual nonaligned texts as well.
The ENPC Project: A description of the English-Norwegian Parallel Corpus project at the Norwegian Computing Centre for the Humanities in Oslo, Norway.
FECCS: A project in Constrastive Linguistics at the University of Jyväskylä, Finland which uses a bilingual Finnish-English corpus.
INTERSECT: Parallel Corpora and Contrastive Linguistics: A project at the University of Brighton, United Kingdom in which parallel texts in French and English are being constructed and analysed.
Knowledge Acquisition for Japanese-English Machine Translation: A project which uses parallel corpora in Japanese and English to extract knowledge that can be used by a Machine Translation system.
The Lingua Project: An excellent description of the Lingua Parallel Concordancing Project which aims at managing a multilingual corpus to ease students' and teachers' work in second language learning. 11 organisations from 6 different countries participate in this project.
Linguistic Data Consortium: The Linguistic Data Consortium at the University of Pennsylvania, USA supplies a big parallel corpus of United Nation texts in English, French and Spanish.
Michael Barlow's Parallel Corpora Page: An overview page about the global research in parallel corpora. Michael Barlow also maintains a general Corpus Linguistics page
Multext: Multext is a EU sponsored project on Multilingual Text Tools and Corpora for the languages Dutch, English, French, German, Italian, Spanish and Swedish.
Multext East: Multext East is a spin-off of Multext and concerns creating parallel corpora for Bulgarian, Czech, Estonian, Hungarian, Rumanian and Slovenian.
PEDANT Parallel Texts in Göteborg: A report about the parallel texts project at Göteborg University, Sweden.
Proteus Project: The Proteus Project is a machine translation project of the Computer Science department of New York University and the Autonomous University of Madrid. They use parallel corpora in English and Spanish.
Text-based contrastive studies in English: A project at Lund University, Sweden which aims at building a parallel corpus of texts in Swedish and English and carrying out cross-linguistic studies on the corpus.
Translearn: A European project (LE1) aimed at the development of a translation support tool. Languages covered: English, French, Greek and Portuguese. The project has probably finished already.

The Controlled Languages Homepage maintained by Willem-Olaf Huijsen at the University of Utrecht, The Netherlands is an interesting page for people that study Machine Translation.

Parallel Corpora Tools

MultiLingual Parallel Concordancer by David Woolls: A concordancing program for parallel texts developed by David Woolls and others in the Lingua project. Platform: MS Windows. To be released in januari 1997.
ParaConc by Michael Barlow: The concordance program for parallel texts of Michael Barlow. Does not support automatical alignment of texts. Platform: Mac.
WordSmith Tools by Mike Scott: Software developed by Mike Scott and published by Oxford University Press. It can perform lexical analysis of texts and alignment of multi-lingual texts. Platform: MS Windows 3.1 or higher.

The Computing in the Humanities and Social Sciences (CHASS) center in Toronto, Canada supplies a large list with general corpus processing software.

Last update: December 13, 1996. corpora@ling.uu.se