[Mirrored from: http://www.ldc.upenn.edu/ldc/about/guide_text.html]

The LDC as publisher and distributor of speech corpora

TEXT CORPORA

Use of SGML

The best formatting mechanism for text is Standard Generalized Markup Language (SGML); it is widely and commonly used (more so than SPHERE: the HyperText Markup Language (HTML), which is the format used throughout the World Wide Web, is actually one instance of SGML usage), it can be kept quite simple, there is free software available to support its use, and it is adaptable to a wide range of languages and uses. It includes the notion of a "Document Type Definition" (DTD), which provides a clear and complete specification of the markup used in a given collection of text. The LDC does not require that a fully functional DTD be supplied, or that the SGML tagging of a text collection be fully compliant to a given set of conventions (e.g. those developed by the Text Encoding Initiative, TEI); what is essential is that the markup be clear, consistent, and correctly applied, so that it can be "parsed" according to a finite set of rules.

Cases where SGML is not needed

Some collections of text data, such as lexicons, language models, and so on, may be structured more as database tables rather than as documents, and in such cases the use of SGML may seem unnecessary and even unsuitable (the CELEX and COMLEX lexicons are good examples). Provided that a tabular formatting of the data is consistent, easily parsible, and fully documented, the addition of SGML markup is not required. (Still, it would certainly be possible to formulate an SGML structure and suitable DTD that would keep the presentation format simple while adding the value of SGML syntax checks for quality control, self-describing data structure, and so on).

Character set encoding and file formats

As indicated above for transcription data, text corpora should be provided in commonly used character sets. The LDC can provide for character set translation where necessary, but if the original data are in an unusual character set, and you are not able to do the conversion yourself, we may need your help to define the appropriate conversion, carry it out, and verify the results.

Data files must have a simple "plain text" format. The LDC cannot publish or process data collections that are provided in a format that is proprietary or specific to a particular word processing program or other commercial software. Likewise, files in PostScript format cannot be accepted as research data.

Documentation for end users

Documentation should explain the text data format in reasonable detail, and show relevant examples. If the data are in tabular form, the column format and field defintions need to be described in full. For data in less familiar languages, some description of the character encoding will be helpful (e.g. in case your corpus ends up being someone's first exposure to Thai or Korean).