[Mirrored from: Part 4. Encoding Primary Data]

Corpus Encoding Standard - Document CES 1. Part 4. Version 1.2. Last modified 7 April 1996.

Part 4
Encoding Primary Data

NAVIGATOR : | Next | Prev | CES 1 Table of contents |

4.0. Overview

For the foreseeable future, the greatest portion of texts that will be encoded exist already in electronic form. Such texts are referred to as legacy data. The vast majority of these documents were originally intended to be printed and therefore already contain markup in the form of typesetter codes, word processing formats, etc., primarily related to visual presentation.

The goal of encoding for corpus linguistics is to describe text structure that is linguistically relevant and mark objects relevant to analysis. Thus, for the purposes of corpus work in language engineering applications, a text (prior to linguistic annotation) is a set of linguistic objects, comprising at least

large units of discourse, such as paragraphs, chapters, etc. together with titles, footnotes, etc.;
basic linguistic objects common to linguistic analyses, such as sentences, clauses, phrases, words, morphemes, and phonemes, as well as names, dates, abbreviations, etc.;

The text seen as a printed or displayed object, including fonts, layout, etc., and the text seen as a collection of linguistic objects represent two different views of the text. Some of the components of one of these views correspond to components of the other, while others do not. Therefore, the process of preparing a corpus originally existing as legacy data involves

the translation, where relevant, of presentation markup into markup descriptive of linguistic categories (e.g., the translation of items in bold to titles, etc.);
the elimination of presentational markup which does not signify an object of linguistic relevance;
possibly, the addition of tags for elements not marked in any way in the legacy document (e.g., proper names).

This process is potentially very costly, depending on how well presentational categories map directly into distinct linguistic categories, and how much additional markup for elements not marked in the original, or which are not easily distinguishable based on typography, is desired.

Because of the potential cost, data preparation is often accomplished by taking the data through by a series of transformations, each of which raises the information level to some extent. The final state models the richest possible information state. The transformation process cannot be completely deterministic, since raising the information level often involves deciding which among several possible candidates a given tag maps to, as well as adding structural information that is not present or fully explicit in the previous state. Therefore, the transformation process is not fully automatic or entirely cost-free. However, it is possible to minimize transformation costs from one information state to the next higher one.

The CES provides a TEI-conformant DTD that can be used in such a process for encoding primary data. It has been designed to enable representing the text at any of various stages of information transformation (i.e., translating existing markup into relevant, increasingly information-rich categories). The representation of the text in the first (minimum required) representation can often be accomplished by automatic means and may be nearly cost-free. Users of the CES can encode their texts to conform to intermediate stages, aiming toward a rich representation of relevant linguistic informaton, depending on cost considerations, application needs, etc.

4.1. Levels of encoding for primary data

For the encoding of primary data the CES identifies three levels of encoding:

Level 1: This is the minimum encoding level required for CES conformance, requiring markup for gross document structure (major text divisions), down to the level of the paragraph, conformant to the cesDoc DTD.
Level 2: This level requires that paragraph level elements are correctly marked, and (where possible) the function of rendition information at the sub-paragraph level is determined and elements marked accordingly.
Level 3: This is the most restrictive and refined level of markup for primary data. It places additional constraints on the encoding of s-units and quoted dialogue, and demands more sub-paragraph level tagging.

The following sections provide precise criteria for conformance to each level.

4.2. Level 1 conformance

4.2.1. Requirements

The document validates against the cesDoc DTD, using an SGML parser such as sgmls.
The header must provide a full description of all encoding formats utilized in the document.
The document must not contain foreign markup.
CES-conformant encoding to the paragraph level must be included. However, note that for Level 1 CES conformance, paragraph-level markup need not be refined. For example, via automatic means all carriage returns may be changed to <p> (paragraph) tags; additional work is needed to identify and mark those situations where the carriage return signals a list, a long quote, etc. This level of refinement is not required. Documents differentiating only <p> tags are still complaint to the cesDoc DTD, which (minimally) requires the following structure :


                     <cesDoc version="3.9">
                       <cesHeader version="2.0"> ... </cesHeader>
                       <text>  
                         <body>
                            <div> [optional]
                               <p>
                               <p>
                               <p>
                                ...

4.2.2. Recommendations

There should be no information loss for sub-paragraph elements. Sub-paragraph elements identified in the original by special typography not directly representable in the SGML encoded version (e.g., distinction by font such as italics, vs. distinction by capital letters or quote marks, which is directly representable in the encoded version) should be marked, typically using a <hi> tag.
Markup of sub-paragraph elements is conformant to CES specifications.

4.2.3. Requirements for documents adapted from legacy data

When the document differs from an original either encoded using another encoding scheme, or containing no markup (apart from carriage returns to signal paragraphs, etc.), the CES-encoded text must be accompanied by a copy of the original data or information specifying where the original can be permanently and readily obtained (in the <sourceDesc> element), for the following reasons:
- it ensures that the encoded text can always be checked against the original.
- since the rendering of visual presentation classes into more descriptive markup categories is necessarily an interpretive process, having the original on hand enables the user to examine the original categories and, potentially, modify or improve them as necessary.
- because encoded texts may be gradually enriched by a number of users over time, it becomes increasingly essential to retain a trace of the "archaeology" of the document as well as to ensure that the original is permanently preserved.
All information in the original essential for the recognition of content is retained in the encoded version. This refers particularly to rendition information such as italics, etc., which may exist in a printed original, that may signal a linguistically relevant element.
Information whose sole function is to allow re-creation of an original printed source (if one exists) should be discarded.
The original character sequence comprising the document should be retained, by employing the following principles:
- None of the original sequence of characters (with the possible exception of rendition text) should be deleted or altered.
- The original data should not be given in attributes, but should always appear as tag content. Note that data such as list numbers, footnote symbols, etc., can be considered rendition text and placed in attributes on the appropriate tag.
- Apart from the original data, no other data should appear as tag content.
- The original order of the data should not be changed.
- Line breaks in the original which do not signal logical divisions (paragraphs, etc.) should be encoded as blanks or, when they break a logically contiguous unit, ignored.
The translation process should be documented in the text and/or corpus header, as appropriate, in the <encodingDesc> element.

4.2.4. Recommendations for documents adapted from legacy data

Alignment between the original data and the SGML encoded text should be provided.

4.3. Level 2 conformance

4.3.1. Requirements

The requirements for a Level 1 document are satisfied.
If a sub-paragraph element is marked, every occurrence of that element has been identified and marked in the text.
SGML entities replace all special characters (e.g., —, £, etc.).
Quotation marks are removed and either replaced by appropriate standard SGML entities, or represented in a rend attribute on a <q> or <quote> tag.
The document validates against the cesDoc DTD, using an SGML parser such as sgmls.

4.3.2. Recommendations

All paragraph level elements (lists, quotes, etc.) are correctly identified
Where possible, <hi> tags are resolved to more precise tags (foreign, term, etc.)

4.4. Level 3 conformance

Conformance to this level demands

Requirements for a Level 2 document are satisfied.
All paragraph level elements (lists, quotes, etc.) are correctly identified
Where possible, <hi> tags are resolved to more precise tags (foreign, term, etc.)
The following sub-paragraph elements have been identified and marked (either with explicit tags such as <abbr>, <num>, etc. or with user-defined morpho-syntactic tags--see section 4.5.13).
- abbreviations
- numbers
- names
- foreign words and phrases
Where s-units and dialogue are tagged, the <p> - <s> - <q> hierarchy described in section 4.5 must be followed.
The encoding for all elements including and below the level of the paragraph has been validated for a 10 percent sample of the text. Note: this does not include morpho-syntactic tagging, if present.
The document validates against the cesDoc DTD, using an SGML parser such as sgmls.