SGML: ECI (European Corpus Initiative)


European Corpus Initiative. PI: Henry S. Thompson. RA: David McKelvie. Funding: (1) EC NERC (2) EC ELSNET and (3) HCRC core funding

The aim was to produce a reasonably large text corpus of the major European languages for the linguistic research community. It is generally agreed that there are not enough corpora in languages other than English. A number of corpus projects are currently under way, and we believed it would be useful to collect a sample of this data and make it available in a standardised form, as a basis for evaluation and shared results. The original call for contributions describes the ECI as follows:

``The European Corpus Initiative was founded to oversee the acquisition and preparation of a large multi-lingual corpus to be made available in digital form for scientific research at cost and without royalties. We believe that widespread easy access to such material would be a great stimulus to scientific research and technology development as regards language and language technology. . . . No amount of abstract argument as to the value of corpus material is as powerful as the experience of actually having access to some in one's laboratory.''

In reply to this call, many colleagues spontaneously offered their data and/or supplied the contacts to negotiate redistribution rights for the ECI. In many cases we actively sought out data providers especially for the under-represented languages and for larger parallel collections. The majority of the work of collecting materials and permissions and converting them into a consistent format has been done at the HCRC in Edinburgh and at ISSCO, University of Geneva, under the overall supervision of Henry S. Thompson and Susan Armstrong, respectively.

The ECI/MCI corpus has now been published on CD-ROM, and contains almost 100 million words in 27 (mainly European) languages. It consists of 48 opportunistically collected component corpora marked up in SGML (to varying levels of detail), with easy access to the source text without markup. 12 of the component corpora are multilingual parallel corpora with from two to nine sub-corpora.