Contents of the ECI/MCI Corpus





 Id      Type         	   Language	Size    
        		                (K words)   
=================================================

 alb01	 Word list and Texts	Albanian	205

	 (a) Albanian word list 32K words with syntactic classes
	 The Albanian dictionary of the 1984
	 published in Tirana by the Academy of Sciences.
	 (b) The novel "Koncert n fund t dimrit" by Ismail Kadare published
	 in Tirana.

 bul01	 Technical		Bulgarian	   5

	 A number of scientific papers from "Science" journal.

 chi01	 Newspaper	  Chinese       2895

	The PH text corpus described here contains 3.75
	million Chinese characters.  It is a collection of news from the
	China's official Xinhua (New China) news agency (hereafter XinHua)
	during a period from January 1990 to March 1991. 
	It is GB coded with word and phrase boundaries marked.

 cze01   newspaper         czech	 726   

	 Newspaper Texts (Lidove noviny, Literarni noviny)

 cze02   newspaper         czech	4000   

	 Newspaper Texts (Lidove noviny, Literarni noviny)

 dut01   newspaper         dutch	 600  

	 Articles from the student newspaper Universiteitskrant of the
	 University of Groningen from the academic years 1990/1991 and
         1991/1992.

 dut02	 mixed		   dutch	5203

	 A large Dutch corpus from INL including transcripts of radio
	 programs, newspaper and magazine issues and some technical texts.

 dut03	 mixed		   dutch	128

	 A continuation of dut02.

 eng01   novels             english	 241   

	 Three English novels from the OTA collection:
	 Thomas Hardy 		'Far from the Madding Crowd'
	 George Eliot 		'Silas Marner'
	 Charles Dickens 	'A Christmas Carol'

 eng02   novels		   english	 900   

	 The Complete Sherlock Holmes, Sir Arthur Conan-Doyle.

 est01   mixed             estonian	 100   

	 Extracts from general fiction and prose.

 fre01   newspaper          french	4121 
  
	 Text from Le Monde newspaper, consisting of articles from September
	 and October 1989, and January 1990.

 gae01   dictionary           gaelic	 141   

	 MacBain, Alexander, "An etymological dictionary of the
	 Gaelic language", Gairm Publications, 1982
	 1st edition - 1896 revised 1911

 ger01   sentenceList        german	  20   

	 Lists of german sentences - tagged with some syntactic info.
	 The sentence test suite of DiTo, a linguistic database
	 for diagnostics in the syntax components of NLP systems.

 ger02   newspaper          german	 191   

	 German Newspaper articles from VDI-Nachricten 1990-1991

 ger03	 newspaper	german	34291

	 Frankfurter Rundschau Newspaper text

 ger04	 newspaper	german	7376

	 Donau Courier newspaper texts

 gre01   mixed             greek	2515   

	 Newspapers, periodicals, popular fiction 1976-1990;

 ita01   novels             italian	  13  
 
	 6 short stories by G.Verga

 ita03   newspaper              italian	 303   

	 Corpus of Italian newspapers (La Republica, La Stampa,
	 Il Mattino, Il Corriere)

 jap01   dictionary        japanese	 203   

	 EDICT Japanese/English dictionary.

 jap02	 Technical		Japanese	  148(?)

	 Japanese version of the ITU CCITT data.

 lat01	 poetry		   Latin	75

	 Vergil, Aeneid, book I - XII
	 Vergil, Georgicon, book I - III

 lit01	 Fiction 		Lithuanian	  20

	 "KOLEKCIONIERIUS" Story

 mal01   Technical/Novels	Malay	563

	 A collection of original Malay texts and translations from
	 English, mainly technical books with some novels.
	 From University Sains Malaysia and Dewan Bahasa & Pustaka (publishers)

 mul01   Financial               En/Fr/Ge	 566   

	 Financial reports from Union Bank Switz. (most french-german)

 mul02   technical              Fr/Ge/It	 177   

	 Avalanche bulletins 1986-1991 (ca. 40 per year/250 words)
         Swiss Federal Institute for Snow and Avalanche Bulletins.
         (Very little Italian)

 mul03   legal        Fr/Ge/It	 227   

	 Text of Swiss Civil Code 

 mul04   technical               En/Fr/Sp	 13497

	 International Telecommunications Union CCITT handbook

 mul05	 legal		   En/Fr/Spa	 5000 K words

	 International Labour Organisation "Official Bulletin, B Series":
	 "Reports of the Committee on Freedom of Association of the Governing 
	 Body of the ILO and related material 1984-1989".

 mul06   technical          9 EC langs	 219   

	 The announcement text of the EC Esprit program.

 mul07   sentencelist             En/Fr	  12   

	 BABEL project data - French business sentences and English
	 translations.

 mul08   novel              En/Serb	 386  

	 George Orwell's "1984" in English, Serbian, Croatian and
	 Slovenian versions. 

 mul09   technical            5 EC langs	 248 

	 ScanWorX User's Guide (Optical Character Reader) 

 mul10	 Mixed		English/French		19

	 HCRC MT Evaluation Corpus: French/English parallel texts

 mul11	 Financial	German/French	615

	 Financial Reports from CREDIT SUISSE

 mul12	 Legal		Danish/Spanish/English	1199

	 The machine-readable 'Civil Law Corpus' from the 
	 Copenhagen Buisness School

 mul13	 novel		   Uzbek/English	72

	 Usbek Novel 'rk Freedom' with English interlineal translation

 nor01   novels           norwegian	2226   

	 Collection of texts Bokmaal & Nynorsk, some novels and some
	 Ibsen plays.

 por01   mixed            portuguese	 675   

	 An extract from the Borba/Ramsey corpus of Brazilian Portuguese.

 rus01	 technical	   Russian	364

	 Technical reports (computer related) by Andrei Mikheev

 ser01   stories           serbian	 700   

	 Short stories and novel extracts

 spa01   speech            spanish	1041   

	 Transcribed Spanish speech from 
         CORPUS ORAL DE REFERENCIA DEL ESPANOL CONTEMPORANEO 1991-1992

 spa02   newspaper              spanish	 447   

	 1 week of local Spanish newspaper "Sur" from April and Sept 1991.

 spa03   newspaper          spanish	 830   

	 "El Diario Vasco" newspaper articles 1991

 swe01   mixed               swedish	1718   

	 A Fragment of SUC: the Stockholm-Umea Corpus of modern
         written Swedish. Text extracts (~2000 words each) from
	 books and newspapers published after 1990.

 tur01   dictionary        turkish	 173   

	 pc-kimmo rule specification and word lists for turkish morphology

 tur02	 newspaper	   turkish	 110

	 This is news text excerpted from the Anatolia New Agency feed
	 covering roughly Sept/Oct 1992. Aproximately 10% of the total.

Total                                   98,792 K words