New [SGML Encoded] Corpora from the Linguistic Data Consortium

From owner-tei-l@LISTSERV.UIC.EDU Tue Apr 21 14:39:52 1998
Date: Tue, 21 Apr 1998 14:30:15 -0500 (CDT)
From: LDC Office <>
Subject: New Corpora from the Linguistic Data Consortium


                 Announcing NEW RELEASES from the
                   Linguistic Data Consortium

1996 Broadcast News Training Speech Data
1996 Broadcast News Dev. and Eval. Data
1996 Broadcast News Transcripts

The 1996 Broadcast News Speech Corpus contains a total of 104
hours of broadcasts from ABC, CNN, and CSPAN television
networks and NPR and PRI radio networks with corresponding
transcripts. The primary motivation for this collection is to
provide training data for the DARPA "Hub-4" Project on
continuous speech recognition in the broadcast domain. The
speech files are available in a 19 disc training data set with
one additional disc of development data and an additional disc
of evaluation data. The following programs are represented in
this corpus:

  ABC Nightline
  ABC World Nightly News
  ABC World News Tonight
  CNN Early Edition
  CNN Early Prime News
  CNN Headline News
  CNN Prime Time News
  CNN The World Today
  CSPAN Washington Journal
  NPR All Things Considered
  NPR Marketplace

Transcripts have been made of all recordings in this
publication, manually time aligned to the phrasal level,
annotated to identify boundaries between news stories, speaker
turn boundaries, and gender information about the speakers. The
released version of the transcripts is in SGML format, and
there is accompanying documentation, and an SGML DTD file,
included with the transcription release.  The transcripts are
available via ftp.

Because of restrictions imposed by the copyright holders of the
news text, these corpora are available to 1997 and 1998 LDC
members only.  Members who wish to receive these corpora MUST
available on the Linguistic Data Consortium WWW Home Page at URL

If you would like to order a copy of these corpora, please
email your request to <>. If you need
additional information before placing your order, or would like
to inquire about membership in the LDC, please send email or
call (215) 898-0464.

Further information about the LDC and its available corpora can
be accessed on the Linguistic Data Consortium WWW Home Page at

Information is also available via ftp at
under pub/ldc; for ftp access, please use "anonymous" as your
login name, and give your email address when asked for