Reuters Releases Free Archive of Over 800,000 News Stories for Use by Universities and Research Organisations
London, UK. March 01, 2001.
Reuters, the global information, news and technology group, is for the first time making available free of charge, large quantities of archived Reuters news stories for use by research communities around the world. The first Reuters Corpus archive includes over 800,000 English language news stories, equivalent to the annual global news output of Reuters.
The Reuters Corpus offers researchers a unique body of static information upon which to research, test and benchmark emerging technologies. These include research into language processing, speech synthesis, voice recognition, indexation, search and information retrieval.
The growth of the Internet has led to an explosion in the information services available to businesses and consumers. Additionally, improvements in bandwidth have increased the variety of channels and devices used to deliver and access information. Consequently, research into technologies that help businesses and individuals improve the way they access, search and manipulate information, has assumed even greater significance. Availability of the Reuters Corpus assists organisations conducting this research.
Richard Willis, Head of Research and Standards, Reuters Chief Technology Office, commented: "Reuters has always been heavily involved in language and data research and to strengthen our links with the research community around the world, we have made available one of the most complete news archives ever released. The data provided will aid research into many aspects of language processing and information retrieval."
The archive includes all English language stories produced by Reuters globally between 20 August 1996 and 19 August 1997. The news data is available on two CD-ROMs and formatted in XML to make it easier to use as a research tool. All the news stories are fully referenced using a total of 775 different category codes for topic, geography and industry sector.
Dr Marc Moens, Head of Edinburgh University's Language Technology Group commented: "Because of its size and the amount of preparation that has gone into it, the Reuters collection provides scope for many new types of research and development work. It allows for the systematic evaluation of progress and comparison of results between different development groups. I am sure this Corpus will soon be seen as a standard in document related work."
Professor Yorick Wilks of Sheffield University said: "We can already see the potential benefits of such a Corpus for stylistic language analysis. The topic codes would also give us the opportunity to analyse the geographic location, industry area or topic that received news coverage from Reuters. Areas such as semantic web applications, categorization research and machine learning of topic routings would also benefit. This will be a very useful resource."
As part of the research agreement covering use of the archive, researchers will supply Reuters with a copy of any material published using the data. Working with this feedback from research groups, Reuters hopes to bring out other Corpora including multi-lingual versions and volumes covering other date ranges. Further information on the Corpus is available at www.reuters.com/researchandstandards/corpus/.
Reuters (about.reuters.com) premier position as a global information, news and technology group is founded on its reputation for speed, accuracy, integrity and impartiality combined with continuous technological innovation. Reuters strength is based on its unique ability to offer customers around the world a combination of content, technology and connectivity. Reuters makes extensive use of Internet technologies for the widest distribution of information and news. Around 73 million unique visitors per month access Reuters content on some 1,400 Internet websites. Reuters is the world's largest international text and television news agency with 2,157 journalists, photographers and camera operators in 190 bureaux, serving 151 countries. In 2000 the Group had revenues of [BP] £ 3.59 billion and on 31 December 2000, the Group employed 18,082 staff in 204 cities in 100 countries.
XML (Extensible Markup Language) is a flexible way to create data formats so that the data and the format can be shared on the Internet and intranets. XML can be used by any individual or group that wants to share information in a consistent way. XML is a formal recommendation from the World Wide Web Consortium and is similar to the language of today's Internet pages, the Hypertext Markup Language. XML is extensible because the markup symbols used are self-defining and unlimited in number.
Tel: +44(0) 20 7542 6487
Head of Communications
Reuters Chief Technology Office
Prepared by Robin Cover for The XML Cover Pages archive.