[Mirrored from: Part 4. Encoding Primary Data]
Corpus Encoding Standard
-
Document CES 1. Part 4. Version 1.2. Last modified 7 April 1996.
Contents
For the foreseeable future, the greatest portion of texts that will be encoded
exist already in electronic form. Such texts are referred to as legacy data.
The vast majority of these documents were originally intended to be printed and
therefore already contain markup in the form of typesetter codes, word
processing formats, etc., primarily related to visual presentation.
The goal of encoding for corpus linguistics is to describe text structure that
is linguistically relevant and mark objects relevant to analysis. Thus, for
the purposes of corpus work in language engineering applications, a text (prior to
linguistic annotation) is a
set of linguistic objects, comprising at least
- large units of discourse, such as paragraphs, chapters, etc. together with
titles, footnotes, etc.;
- basic linguistic objects common to linguistic analyses, such as sentences,
clauses, phrases, words, morphemes, and phonemes, as well as names, dates,
abbreviations, etc.;
The text seen as a printed or displayed object,
including fonts, layout, etc., and the text seen as a collection of linguistic
objects represent two different views of the text. Some of the
components of one of these views correspond to components of the other, while
others do not. Therefore, the process of preparing a corpus originally existing
as legacy data involves
- the translation, where relevant, of presentation markup into markup
descriptive of linguistic categories (e.g., the translation of items in bold to
titles, etc.);
- the elimination of presentational markup which does not signify an object
of linguistic relevance;
- possibly, the addition of tags for elements not marked in any way in the
legacy document (e.g., proper names).
This process
is potentially very
costly, depending on how well presentational categories map directly into
distinct linguistic categories, and how much additional markup for elements not
marked in the original, or which are not easily distinguishable based on
typography, is
desired.
Because of the potential cost, data preparation is often accomplished by taking the data through by a series of transformations, each
of which raises the information level to some extent. The final state models
the richest possible information state.
The transformation process cannot be completely deterministic,
since raising the information level often involves deciding which among several possible
candidates a given tag maps to, as well as adding structural information that
is not present or fully explicit in the previous state. Therefore, the transformation
process is not fully automatic or entirely cost-free. However, it is possible to
minimize transformation costs from one information state to the next higher
one.
The CES provides a TEI-conformant DTD that can be used in such a
process for encoding primary data. It has been designed to enable representing the text at any of various
stages of information transformation (i.e., translating existing markup into
relevant, increasingly information-rich categories).
The representation of the text in the first (minimum required) representation can often be accomplished by automatic means and may be nearly cost-free. Users of the CES can encode their texts to conform to intermediate stages, aiming toward a rich representation of relevant linguistic informaton,
depending on cost considerations, application needs, etc.
For the encoding of primary data the CES identifies three levels of encoding:
- Level 1
- This is the minimum encoding level required for CES
conformance, requiring markup for gross document structure (major text
divisions), down to the level of the paragraph, conformant to the cesDoc DTD.
- Level 2
- This level requires that
paragraph level elements are correctly marked, and (where possible) the
function of rendition information at the sub-paragraph level is determined and
elements marked accordingly.
- Level 3
- This is the most restrictive and refined level of markup for
primary data. It places additional constraints on the encoding of s-units and quoted dialogue, and demands more sub-paragraph level tagging.
The
following sections provide precise criteria for conformance to each level.
- The document validates against the cesDoc DTD, using an SGML parser such as
sgmls.
- The header must provide a full description of all encoding formats utilized in the document.
- The document must not contain foreign markup.
- CES-conformant encoding to the paragraph level must be included. However,
note that for Level 1 CES conformance, paragraph-level markup need not be refined.
For example, via automatic means all carriage returns may be changed to
<p> (paragraph) tags; additional work is needed to identify and
mark those situations where the carriage return signals a list, a long quote,
etc. This level of refinement is not required. Documents differentiating only
<p> tags are still complaint to the cesDoc DTD, which (minimally) requires
the following structure :
<cesDoc version="3.9">
<cesHeader version="2.0"> ... </cesHeader>
<text>
<body>
<div> [optional]
<p>
<p>
<p>
...
- There should be no information loss for sub-paragraph elements. Sub-paragraph elements identified in the original by special typography not
directly representable in the SGML encoded version (e.g., distinction by font
such as italics, vs. distinction by capital letters or quote marks, which is
directly representable in the encoded version) should be marked, typically
using a <hi> tag.
- Markup of sub-paragraph elements is conformant to CES specifications.
- When the document differs from an original either encoded using another encoding scheme, or containing no markup (apart from carriage returns to signal paragraphs, etc.), the CES-encoded text must be
accompanied by a copy of the original data or information specifying where the
original can be permanently and readily obtained (in the
<sourceDesc>
element), for the following reasons:
- it ensures that the encoded text can always be checked
against the original.
- since the rendering of visual presentation classes
into more descriptive markup categories is necessarily an interpretive process,
having the original on hand enables the user to examine the original categories
and, potentially, modify or improve them as necessary.
- because encoded
texts may be gradually enriched by a number of users over time, it becomes
increasingly essential to retain a trace of the "archaeology" of the document
as well as to ensure that the original is permanently preserved.
- All information in the original essential for the recognition of content
is retained in the encoded version. This refers particularly to rendition
information such as italics, etc., which may exist in a printed original, that may signal a linguistically relevant
element.
- Information whose sole function is to allow re-creation of
an original printed source (if one exists) should be discarded.
- The original character sequence comprising the document should be
retained, by employing the following principles:
- None of the original sequence of characters (with the possible
exception of rendition text) should be deleted or altered.
- The original data should not be given in attributes, but should
always appear as tag content. Note that data such as list numbers, footnote symbols, etc., can be considered rendition text and placed in attributes on the appropriate tag.
- Apart from the original data, no other data should appear as tag
content.
- The original order of the data should not be changed.
- Line breaks in the original which do not signal logical divisions
(paragraphs, etc.) should be encoded as blanks or, when they break a logically
contiguous unit, ignored.
- The translation process should be documented in the text and/or corpus
header, as appropriate, in the <encodingDesc> element.
- Alignment between the original data and the SGML encoded text should be
provided.
- The requirements for a Level 1 document are satisfied.
- If a sub-paragraph element is marked, every occurrence of that element has
been identified and marked in the text.
- SGML entities replace all special characters (e.g., —, £,
etc.).
- Quotation marks are removed and either replaced by appropriate standard SGML
entities, or represented in a rend attribute on a <q> or
<quote> tag.
- The document validates against the cesDoc DTD, using an SGML parser such as
sgmls.
- All paragraph level elements (lists, quotes, etc.) are correctly
identified
- Where possible, <hi> tags are resolved to more precise tags
(foreign, term, etc.)
Conformance to this level demands
- Requirements for a Level 2 document are satisfied.
- All paragraph level elements (lists, quotes, etc.) are correctly
identified
- Where possible, <hi> tags are resolved to more precise tags
(foreign, term, etc.)
- The following sub-paragraph elements have been identified and
marked (either with explicit tags such as <abbr>,
<num>, etc. or with user-defined morpho-syntactic
tags--see section 4.5.13).
- abbreviations
- numbers
- names
- foreign words and phrases
-
Where s-units and dialogue are tagged, the
<p> - <s> - <q> hierarchy described in section 4.5 must be followed.
- The encoding for all elements including and below the level of the paragraph has been validated for a 10 percent sample of the text. Note: this does not include morpho-syntactic tagging, if present.
- The document validates against the cesDoc DTD, using an SGML parser such as
sgmls.