Contents of the ECI/MCI Corpus
Id Type Language Size
(K words)
=================================================
alb01 Word list and Texts Albanian 205
(a) Albanian word list 32K words with syntactic classes
The Albanian dictionary of the 1984
published in Tirana by the Academy of Sciences.
(b) The novel "Koncert në fund të dimrit" by Ismail Kadare published
in Tirana.
bul01 Technical Bulgarian 5
A number of scientific papers from "Science" journal.
chi01 Newspaper Chinese 2895
The PH text corpus described here contains 3.75
million Chinese characters. It is a collection of news from the
China's official Xinhua (New China) news agency (hereafter XinHua)
during a period from January 1990 to March 1991.
It is GB coded with word and phrase boundaries marked.
cze01 newspaper czech 726
Newspaper Texts (Lidove noviny, Literarni noviny)
cze02 newspaper czech 4000
Newspaper Texts (Lidove noviny, Literarni noviny)
dut01 newspaper dutch 600
Articles from the student newspaper Universiteitskrant of the
University of Groningen from the academic years 1990/1991 and
1991/1992.
dut02 mixed dutch 5203
A large Dutch corpus from INL including transcripts of radio
programs, newspaper and magazine issues and some technical texts.
dut03 mixed dutch 128
A continuation of dut02.
eng01 novels english 241
Three English novels from the OTA collection:
Thomas Hardy 'Far from the Madding Crowd'
George Eliot 'Silas Marner'
Charles Dickens 'A Christmas Carol'
eng02 novels english 900
The Complete Sherlock Holmes, Sir Arthur Conan-Doyle.
est01 mixed estonian 100
Extracts from general fiction and prose.
fre01 newspaper french 4121
Text from Le Monde newspaper, consisting of articles from September
and October 1989, and January 1990.
gae01 dictionary gaelic 141
MacBain, Alexander, "An etymological dictionary of the
Gaelic language", Gairm Publications, 1982
1st edition - 1896 revised 1911
ger01 sentenceList german 20
Lists of german sentences - tagged with some syntactic info.
The sentence test suite of DiTo, a linguistic database
for diagnostics in the syntax components of NLP systems.
ger02 newspaper german 191
German Newspaper articles from VDI-Nachricten 1990-1991
ger03 newspaper german 34291
Frankfurter Rundschau Newspaper text
ger04 newspaper german 7376
Donau Courier newspaper texts
gre01 mixed greek 2515
Newspapers, periodicals, popular fiction 1976-1990;
ita01 novels italian 13
6 short stories by G.Verga
ita03 newspaper italian 303
Corpus of Italian newspapers (La Republica, La Stampa,
Il Mattino, Il Corriere)
jap01 dictionary japanese 203
EDICT Japanese/English dictionary.
jap02 Technical Japanese 148(?)
Japanese version of the ITU CCITT data.
lat01 poetry Latin 75
Vergil, Aeneid, book I - XII
Vergil, Georgicon, book I - III
lit01 Fiction Lithuanian 20
"KOLEKCIONIERIUS" Story
mal01 Technical/Novels Malay 563
A collection of original Malay texts and translations from
English, mainly technical books with some novels.
From University Sains Malaysia and Dewan Bahasa & Pustaka (publishers)
mul01 Financial En/Fr/Ge 566
Financial reports from Union Bank Switz. (most french-german)
mul02 technical Fr/Ge/It 177
Avalanche bulletins 1986-1991 (ca. 40 per year/250 words)
Swiss Federal Institute for Snow and Avalanche Bulletins.
(Very little Italian)
mul03 legal Fr/Ge/It 227
Text of Swiss Civil Code
mul04 technical En/Fr/Sp 13497
International Telecommunications Union CCITT handbook
mul05 legal En/Fr/Spa 5000 K words
International Labour Organisation "Official Bulletin, B Series":
"Reports of the Committee on Freedom of Association of the Governing
Body of the ILO and related material 1984-1989".
mul06 technical 9 EC langs 219
The announcement text of the EC Esprit program.
mul07 sentencelist En/Fr 12
BABEL project data - French business sentences and English
translations.
mul08 novel En/Serb 386
George Orwell's "1984" in English, Serbian, Croatian and
Slovenian versions.
mul09 technical 5 EC langs 248
ScanWorX User's Guide (Optical Character Reader)
mul10 Mixed English/French 19
HCRC MT Evaluation Corpus: French/English parallel texts
mul11 Financial German/French 615
Financial Reports from CREDIT SUISSE
mul12 Legal Danish/Spanish/English 1199
The machine-readable 'Civil Law Corpus' from the
Copenhagen Buisness School
mul13 novel Uzbek/English 72
Usbek Novel 'Ärk Freedom' with English interlineal translation
nor01 novels norwegian 2226
Collection of texts Bokmaal & Nynorsk, some novels and some
Ibsen plays.
por01 mixed portuguese 675
An extract from the Borba/Ramsey corpus of Brazilian Portuguese.
rus01 technical Russian 364
Technical reports (computer related) by Andrei Mikheev
ser01 stories serbian 700
Short stories and novel extracts
spa01 speech spanish 1041
Transcribed Spanish speech from
CORPUS ORAL DE REFERENCIA DEL ESPANOL CONTEMPORANEO 1991-1992
spa02 newspaper spanish 447
1 week of local Spanish newspaper "Sur" from April and Sept 1991.
spa03 newspaper spanish 830
"El Diario Vasco" newspaper articles 1991
swe01 mixed swedish 1718
A Fragment of SUC: the Stockholm-Umea Corpus of modern
written Swedish. Text extracts (~2000 words each) from
books and newspapers published after 1990.
tur01 dictionary turkish 173
pc-kimmo rule specification and word lists for turkish morphology
tur02 newspaper turkish 110
This is news text excerpted from the Anatolia New Agency feed
covering roughly Sept/Oct 1992. Aproximately 10% of the total.
Total 98,792 K words