[Mirrored from: http://strindberg.ling.uu.se/~corpora/]
Parallel Corpora in Uppsala
This page contains an overview of the parallel corpora projects
in the Language Engineering (Språkteknologi) group of the
Linguistic department of the University of Uppsala in Sweden.
Anna Sågvall-Hein (project leader),
Erik Tjong Kim Sang and
Ingrid Almqvist and
Vesa Autio, both from Scania CV AB, Södertälje.
Institutionen för Lingvistik
751 20 Uppsala
phone: +46 18 18 11 13
fax: +46 18 18 14 16
In our department the two main projects using parallel corpora are the
Scania Corpus Project and the
Parallel corpora will also be used in our part of the
An important part of the Scania Corpus Project is the development of
the controlled language
We have developed and are developing different language resources:
The Scania corpus
A collection of truck manuals from the Swedish truck manufacturing
The manuals are available in eight languages:
Swedish (source language),
The current size of the corpus is approximately 220,000 words per
Swedish newspaper corpus
This corpus will consist of Swedish texts from the local newspaper
Upsala Nya Tidning and the national newspaper
The current size of the corpus is 6.5 million words.
Swedish political texts
This corpus contains texts from the
among which the declarations issued by the Swedish prime-minister
when a new cabinet starts (regeringsförklaringen).
These declarations are issued in five languages:
Swedish, English, French, German and Spanish (since 1996).
The current size of the corpus is 11,000 words.
A Swedish dictionary of general language
This electronic dictionary contains 60,000 lemmas and several
A multilingual term bank of car maintenance terms
This term bank consists of approximately 4000 Swedish car maintenance
terms with translations in 6 other languages:
The Swedish Immigrant Newspaper corpus
The Swedish Immigrant Newspaper
is issued in nine languages:
Presently this corpus consists only of a few texts.
We are currently establishing contact with the newspaper for
obtaining a larger number of their texts.
We have chosen
SGML and the
for structuring these language resources.
The language resources will be used in our translation tool
Parallel Corpora around the World
If you know any interesting parallel corpora web sites that have not been
included in this list then please mail the urls to
- ACL SIGLEX
Parallel Corpora page
A collection of links to publicly available parallel corpora.
The collection is maintained by Ken Litkowski of the ACL Special
Intrest Group on the Lexicon.
Reference Corpora for translators and translation studies
A paper on parallel corpora construction and usage by
Carol Peters and Eugenio Picchi
of the Istituto di Linguistica Computazionale in Pisa, Italy.
- CRATER Multilingual Aligned Annotated Corpus
A multilingual aligned corpus in English, French and Spanish available
from the School of Engineering, Computing and Mathematical Sciences,
Lancaster University, UK.
- English Turkish Aligned Parallel Corpora
Aligned parallel texts in English and Turkish presented by Kemal Oflazer
of the Bilkent University in Ankara, Turkey.
- ECI Multilingual Corpus CD
A CD-rom containing a collection of mostly monolingual texts in 25
It contains some multilingual nonaligned texts as well.
- The ENPC Project
A description of the English-Norwegian Parallel Corpus project at the
Norwegian Computing Centre for the Humanities in Oslo, Norway.
A project in Constrastive Linguistics at the University of Jyväskylä,
Finland which uses a bilingual Finnish-English corpus.
- INTERSECT: Parallel Corpora and Contrastive Linguistics
A project at the University of Brighton, United Kingdom in which
parallel texts in French and English are being constructed and analysed.
- Knowledge Acquisition for Japanese-English Machine Translation
A project which uses parallel corpora in Japanese and English to
extract knowledge that can be used by a Machine Translation system.
An excellent description of the Lingua Parallel Concordancing Project
which aims at managing a multilingual corpus to ease students' and
teachers' work in second language learning.
11 organisations from 6 different countries participate in this project.
- Linguistic Data Consortium
The Linguistic Data Consortium at the University of Pennsylvania, USA
of United Nation texts in English, French and Spanish.
- Michael Barlow's
Parallel Corpora Page
An overview page about the global research in parallel corpora.
Michael Barlow also maintains a general
Multext is a EU sponsored project on Multilingual Text Tools and Corpora
for the languages Dutch, English, French, German, Italian, Spanish and
- Multext East
Multext East is a spin-off of Multext and concerns creating parallel corpora
for Bulgarian, Czech, Estonian, Hungarian, Rumanian and Slovenian.
Parallel Texts in Göteborg
A report about the parallel texts project at
Göteborg University, Sweden.
- Proteus Project
The Proteus Project is a machine translation project of the Computer
Science department of New York University and the
Autonomous University of Madrid.
They use parallel corpora in English and Spanish.
- Text-based contrastive studies in English
A project at Lund University, Sweden which aims at building a parallel
corpus of texts in Swedish and English and carrying out cross-linguistic
studies on the corpus.
A European project (LE1) aimed at the development of a translation
Languages covered: English, French, Greek and Portuguese.
The project has probably finished already.
maintained by Willem-Olaf Huijsen at the University of Utrecht, The
Netherlands is an interesting page for people that study Machine
Parallel Corpora Tools
Parallel Concordancer by David Woolls
A concordancing program for parallel texts developed by David Woolls
and others in the
Platform: MS Windows.
To be released in januari 1997.
by Michael Barlow
The concordance program for parallel texts of Michael Barlow.
Does not support automatical alignment of texts.
Tools by Mike Scott
Software developed by Mike Scott and published by Oxford University
It can perform lexical analysis of texts and alignment of
Platform: MS Windows 3.1 or higher.
The Computing in the Humanities and Social Sciences (CHASS) center in
Toronto, Canada supplies a large list with
corpus processing software.
Last update: December 13, 1996.