SGML: New Release from the LDC



From corpora-request@lists.uib.no Tue Mar 19 13:41:59 1996
Posted-Date: Tue, 19 Mar 1996 12:42:45 EST
Message-Id: <9603191742.AA29555@unagi1k.cis.upenn.edu>
To: corpora@hd.uib.no (Corpora)
Cc: ldc@unagi.cis.upenn.edu
Subject: New Release from the LDC
Date: Tue, 19 Mar 1996 12:42:45 EST
Reply-To: LDC Office <ldc@unagi1k.cis.upenn.edu>
From: LDC Office <ldc@unagi1k.cis.upenn.edu>
Sender: owner-corpora@lists.uib.no


                Announcing a NEW RELEASE from the
                   LINGUISTIC DATA CONSORTIUM

		  SPANISH NEWS TEXT COLLECTION


The Spanish News Corpus consists of journalistic text data from one
newspaper (El Norte, Mexico) and from the Spanish-language services
of three newswire sources: Agence France Presse, Associated Press
Worldstream, and Reuters.  (The Reuters collection comprises two
distinct services: Reuters Spanish Language News Service and Reuters
Latin American Business Report.)

All text data are stored on one CD-ROM, in a standard compressed
form.  The fours sets of newswire data (AFP, APWS, and two Reuters
services) are each organized as one data file per day of collection. 
The period covered by these collections runs from December 1993 (for
APWS and Reuters) or May 1994 (APWS) through December 1995.  (The El
Norte data, provided to us by INFOSEL Mexico, are arbitrarily grouped
into files of about 1 megabyte in size when uncompressed; date
information is not available for individual articles, but the general
period of the collection is 1993.)

The approximate amounts of data per source (when uncompressed) is
indicated below (in total megabytes and millions of words of text):

       Source	MB	MW
       -------------------
	AFP	345	44
	APWS	253	33
	REUSL	333	41
	REULA	233	23
	INFOSEL	209	31

The presentation of text data in these collections is modeled on the
TIPSTER corpus.  Within each data file, SGML tagging is used (1) to
mark article boundaries, (2) to delimit the text portion within each
article, and (3) to label various pieces of information about the
article that are external to the text content (e.g. headlines,
bylines, and so on).

The copyright holders of this text have requested that it be made
available to LDC members only. Due to the release date this corpus is
available to 1995 and 1996 members.  In order to obtain this corpus,
current LDC members must submit a signed User Agreement Form.

Inquiries about the corpus or requests for it, or information about
becoming members should be directed to ldc@unagi.cis.upenn.edu.

Further information about the LDC and its available corpora can be
accessed on the Linguistic Data Consortium WWW Home Page at URL
http://www.cis.upenn.edu/~ldc. Information is also available via ftp
at ftp.cis.upenn.edu under pub/ldc; for ftp access, please use
"anonymous" as your login name, and give your email address when asked
for password.