[Mirrored from: http://strindberg.ling.uu.se/~corpora/]
Parallel Corpora in Uppsala
This page contains an overview of the parallel corpora projects
in the Language Engineering (Språkteknologi) group of the
Linguistic department of the University of Uppsala in Sweden.
-
Project Staff
-
Anna Sågvall-Hein (project leader),
Lars Borin,
Jon Brewer,
Bengt Dahlqvist,
Klas Prütz,
Per Starbäck,
Erik Tjong Kim Sang and
Eva Wikholm.
External consultants:
Ingrid Almqvist and
Vesa Autio, both from Scania CV AB, Södertälje.
-
Address
-
Institutionen för Lingvistik
Box 513
751 20 Uppsala
Sweden
phone: +46 18 18 11 13
fax: +46 18 18 14 16
Email: corpora@ling.uu.se
WWW: http://www.ling.uu.se/corpora/
Overview
In our department the two main projects using parallel corpora are the
Scania Corpus Project and the
ETAP project.
Parallel corpora will also be used in our part of the
Scarrie
project.
An important part of the Scania Corpus Project is the development of
the controlled language
ScaniaSwedish.
We have developed and are developing different language resources:
-
The Scania corpus
-
A collection of truck manuals from the Swedish truck manufacturing
company Scania.
The manuals are available in eight languages:
Swedish (source language),
Dutch,
English,
Finnish,
French,
German,
Italian and
Spanish.
The current size of the corpus is approximately 220,000 words per
language.
-
Swedish newspaper corpus
-
This corpus will consist of Swedish texts from the local newspaper
Upsala Nya Tidning and the national newspaper
Svenska Dagbladet.
The current size of the corpus is 6.5 million words.
-
Swedish political texts
-
This corpus contains texts from the
Swedish
government
among which the declarations issued by the Swedish prime-minister
when a new cabinet starts (regeringsförklaringen).
These declarations are issued in five languages:
Swedish, English, French, German and Spanish (since 1996).
The current size of the corpus is 11,000 words.
-
A Swedish dictionary of general language
-
This electronic dictionary contains 60,000 lemmas and several
thousands phrases.
-
A multilingual term bank of car maintenance terms
-
This term bank consists of approximately 4000 Swedish car maintenance
terms with translations in 6 other languages:
Dutch,
English,
French,
German,
Italian and
Spanish.
-
The Swedish Immigrant Newspaper corpus
-
The Swedish Immigrant Newspaper
(Invandrartidningen)
is issued in nine languages:
Swedish,
Albanian,
Arabic,
English,
Finnish,
Persian,
Polish,
Servo-Croatic and
Spanish.
Presently this corpus consists only of a few texts.
We are currently establishing contact with the newspaper for
obtaining a larger number of their texts.
We have chosen
SGML and the
TEI Lite
for structuring these language resources.
The language resources will be used in our translation tool
Multra.
Parallel Corpora around the World
If you know any interesting parallel corpora web sites that have not been
included in this list then please mail the urls to
corpora@ling.uu.se
- ACL SIGLEX
Parallel Corpora page
-
A collection of links to publicly available parallel corpora.
The collection is maintained by Ken Litkowski of the ACL Special
Intrest Group on the Lexicon.
- Bilingual
Reference Corpora for translators and translation studies
-
A paper on parallel corpora construction and usage by
Carol Peters and Eugenio Picchi
of the Istituto di Linguistica Computazionale in Pisa, Italy.
- CRATER Multilingual Aligned Annotated Corpus
-
A multilingual aligned corpus in English, French and Spanish available
from the School of Engineering, Computing and Mathematical Sciences,
Lancaster University, UK.
- English Turkish Aligned Parallel Corpora
-
Aligned parallel texts in English and Turkish presented by Kemal Oflazer
of the Bilkent University in Ankara, Turkey.
- ECI Multilingual Corpus CD
-
A CD-rom containing a collection of mostly monolingual texts in 25
languages.
It contains some multilingual nonaligned texts as well.
- The ENPC Project
-
A description of the English-Norwegian Parallel Corpus project at the
Norwegian Computing Centre for the Humanities in Oslo, Norway.
- FECCS
-
A project in Constrastive Linguistics at the University of Jyväskylä,
Finland which uses a bilingual Finnish-English corpus.
- INTERSECT: Parallel Corpora and Contrastive Linguistics
-
A project at the University of Brighton, United Kingdom in which
parallel texts in French and English are being constructed and analysed.
- Knowledge Acquisition for Japanese-English Machine Translation
-
A project which uses parallel corpora in Japanese and English to
extract knowledge that can be used by a Machine Translation system.
- The
Lingua Project
-
An excellent description of the Lingua Parallel Concordancing Project
which aims at managing a multilingual corpus to ease students' and
teachers' work in second language learning.
11 organisations from 6 different countries participate in this project.
- Linguistic Data Consortium
-
The Linguistic Data Consortium at the University of Pennsylvania, USA
supplies a
big
parallel corpus
of United Nation texts in English, French and Spanish.
- Michael Barlow's
Parallel Corpora Page
-
An overview page about the global research in parallel corpora.
Michael Barlow also maintains a general
Corpus
Linguistics page
- Multext
-
Multext is a EU sponsored project on Multilingual Text Tools and Corpora
for the languages Dutch, English, French, German, Italian, Spanish and
Swedish.
- Multext East
-
Multext East is a spin-off of Multext and concerns creating parallel corpora
for Bulgarian, Czech, Estonian, Hungarian, Rumanian and Slovenian.
- PEDANT
Parallel Texts in Göteborg
-
A report about the parallel texts project at
Göteborg University, Sweden.
- Proteus Project
-
The Proteus Project is a machine translation project of the Computer
Science department of New York University and the
Autonomous University of Madrid.
They use parallel corpora in English and Spanish.
- Text-based contrastive studies in English
-
A project at Lund University, Sweden which aims at building a parallel
corpus of texts in Swedish and English and carrying out cross-linguistic
studies on the corpus.
- Translearn
-
A European project (LE1) aimed at the development of a translation
support tool.
Languages covered: English, French, Greek and Portuguese.
The project has probably finished already.
The Controlled
Languages Homepage
maintained by Willem-Olaf Huijsen at the University of Utrecht, The
Netherlands is an interesting page for people that study Machine
Translation.
Parallel Corpora Tools
- MultiLingual
Parallel Concordancer by David Woolls
-
A concordancing program for parallel texts developed by David Woolls
and others in the
Lingua project.
Platform: MS Windows.
To be released in januari 1997.
-
ParaConc
by Michael Barlow
-
The concordance program for parallel texts of Michael Barlow.
Does not support automatical alignment of texts.
Platform: Mac.
- WordSmith
Tools by Mike Scott
-
Software developed by Mike Scott and published by Oxford University
Press.
It can perform lexical analysis of texts and alignment of
multi-lingual texts.
Platform: MS Windows 3.1 or higher.
The Computing in the Humanities and Social Sciences (CHASS) center in
Toronto, Canada supplies a large list with
general
corpus processing software.
Last update: December 13, 1996.
corpora@ling.uu.se