Cover Pages: Reuters Research and Standards Group Releases 800,000 XML-encoded News Stories.

The Reuters Research and Standards Group (RSG) has announced that it will release a free archive of over 800,000 news stories in XML markup for use in research and development of natural-language-processing, information-retrieval or document-understanding systems. The Reuters Corpus "offers researchers a unique body of static information upon which to research, test and benchmark emerging technologies. These include research into language processing, speech synthesis, voice recognition, indexation, search and information retrieval. The archive includes all English language stories produced by Reuters globally between 20-August-1996 and 19-August-1997. The news data is available on two CD-ROMs and formatted in XML to make it easier to use as a research tool. All the news stories are fully referenced using a total of 775 different category codes for topic, geography and industry sector. As part of the research agreement covering use of the archive, researchers will supply Reuters with a copy of any material published using the data. Working with this feedback from research groups, Reuters hopes to bring out other Corpora including multi-lingual versions and volumes covering other date ranges."

Related language materials are available from UPenn LDC: "The Linguistic Data Consortium has developed a range of (mainly SGML-based) formats for transcripts and other types of annotation that it has published. The LDC has also implemented a general data model for searching annotated text and speech corpora online, via LDC-Online.

The following is currently available from RSG: Reuters Corpus, Volume 1, English language, 1996-08-20 to 1997-08-19. Release date 2000-11-03, Format version 1, correction level 0. This is distributed on two CDs and contains about 810,000 Reuters, English Language News stories. It requires about 25 GB for storage of the uncompressed files... Our next goal is to produce a volume of non-English Language News stories covering the period 20 August 1996 - 19 August 1997. Although this is parallel in time to the currently released information it should not be considered to be a parallel corpus, for translation purposes. Future work will expand both English and non-English material to cover additional (more recent) years."

From the announcement of March 01, 2001:

The growth of the Internet has led to an explosion in the information services available to businesses and consumers. Additionally, improvements in bandwidth have increased the variety of channels and devices used to deliver and access information. Consequently, research into technologies that help businesses and individuals improve the way they access, search and manipulate information, has assumed even greater significance. Availability of the Reuters Corpus assists organisations conducting this research.

Richard Willis, Head of Research and Standards, Reuters Chief Technology Office, commented: "Reuters has always been heavily involved in language and data research and to strengthen our links with the research community around the world, we have made available one of the most complete news archives ever released. The data provided will aid research into many aspects of language processing and information retrieval."

Dr Marc Moens, Head of Edinburgh University's Language Technology Group commented: "Because of its size and the amount of preparation that has gone into it, the Reuters collection provides scope for many new types of research and development work. It allows for the systematic evaluation of progress and comparison of results between different development groups. I am sure this Corpus will soon be seen as a standard in document related work."

Professor Yorick Wilks of Sheffield University said: "We can already see the potential benefits of such a Corpus for stylistic language analysis. The topic codes would also give us the opportunity to analyse the geographic location, industry area or topic that received news coverage from Reuters. Areas such as semantic web applications, categorization research and machine learning of topic routings would also benefit. This will be a very useful resource."

"The Research and Standards Group's function is to assist projects throughout Reuters by providing a centre of expertise, and to investigate opportunities that arise by virtue of emerging technologies and products. RSG also provides a point of contact between Reuters and academia and the wider research communities."

Principal references:

Reuters Web site
Reuters News Stories corpus
Reuters Research and Standards Group
Announcement: "Reuters Releases Free Archive of Over 800,000 News Stories for Use by Universities and Research Organisations." [source]
Reuters-21578 Text Categorization Test Collection - 22 data files corresponding to the SGML DTD.
See also: Linguistic Data Consortium (LDC)


SEARCH \| ABOUT \| INDEX \| NEWS \| CORE STANDARDS \| TECHNOLOGY REPORTS \| EVENTS \| LIBRARY