New LDC Corpus - Mandarin Broadcast News


From       owner-tei-l@LISTSERV.UIC.EDU Wed Sep 30 18:47:58 1998
Date:      Wed, 30 Sep 1998 18:16:56 CDT
From:      LDC Office <ldc@unagi.cis.upenn.edu>
Subject:   New Corpus

Announcing a NEW CORPUS from the LDC

1997 Mandarin Broadcast News Speech and Transcripts

New from the LDC (Linguistic Data Consortium), this collection consists of 30 hours of recorded broadcasts and transcripts that have been drawn from the following sources:

Of these three sources, the first two comprise the bulk of the collection, and are represented in roughly equal amounts; only a relatively small sample of KAZN-AM recordings are included, owing to the relatively high proportion of unusable material (commercials, local traffic reports loaded with California place names, etc).

The transcripts were created by native speakers of Mandarin working at the LDC; they are in GB-encoded form, with SGML tagging to identify story boundaries, speaker turn boundaries, and phrasal pauses; these tags include time stamps to align the text with the speech data. Word segmentation (white-space between words) is included. A working DTD is provided, and the markup is consistent with that of the 1997 English and Spanish Hub-4 collections.

Because of restrictions imposed by the copyright holders, this corpus is available to 1998 LDC members only. Members who wish to receive this corpus must sign the 1997 Mandarin Broadcast News license. This license can be retrieved from the LDC website at:

   http://www.ldc.upenn.edu/ldc/catalog/nonmem_agree/agreements.html.

If you would like to order a copy of this corpus, please email your request to ldc@unagi.cis.upenn.edu. If you need additional information before placing your order, or would like to inquire about membership in the LDC, please send email or call +1 (215) 898-0464.

Further information about the LDC and its available corpora can be accessed on the Linguistic Data Consortium WWW Home Page at URL:

      http://www.ldc.upenn.edu/.