SGML: MULTEXT

From: Multext: Multilingual Text Tools and Corpora At HCRC.

Multext: Multilingual Text Tools and Corpora At HCRC, PI: Henry Thompson. RAs: David McKelvie, Steve Finch. Funding: EC LRE

The aims are: (i) to create a marked-up, validated multilingual text corpus using a well-defined annotation standard; and (ii) to develop a set of software tools to perform common corpus-based NLP tasks.

These standards and tools will be applicable to the main European languages. The corpus annotation standard will be based on SGML and the work of the TEI and the EAGLES projects. The software tools will be based on the UNIX philosophy: small tools which can easily be fitted together into more complex tools; each tool is intended to be easily para-metrized to work with new languages. This work will be undertaken in co-operation with various other international efforts. The main software tools which will be developed during the project are: a word/sentence segmenter; a morphology analyser and lexical lookup program; automatic part-of-speech disambiguator; a multilingual text aligner; and interactive post-editing tools.

The related MLCC project is collecting a multilingual text corpus consisting of financial newspaper articles in 6 languages, and parallel texts from the Journal of the European Parliament. This corpus will be used to test the tools developed in the Multext project. The industrial partners in the project will develop applications based on these tools for the following tasks: term extraction, machine translation lexicon extraction, and a machine translation-testbench.

The Language Technology Group's activities in this project are several. We are co-ordinating the development of software tools in general, and designing and developing a tool shell program in particular. The latter provides an environment in which individual NL tools can be connected together to perform more complex tasks; it also provides a transparent interface to SGML marked up corpora. This interface allows the individual tools to remain largely unaware of the larger document structure while maintaining this structure during processing. We are also collecting the MLCC newspaper corpus, and marking up the English component of this corpus, as well as providing text from the locally collected ECI corpus.