[Mirrored from: Part 1. General Principles]
Corpus Encoding Standard
-
Document CES 1. Part 1. Version 1.2. Last modified 1 April 1996
Contents
This section gives the sense in which we use the terms in this report.
A text is a piece of human language communication in the broader sense,
that one has reason to consider as a whole.
A representation of a text is a material transcription of a text. It can
be a paper printing, an electronic form, an audio recording, etc.
An interpretation of a text, or collection of texts, is any information
added to a text. There are degrees of interpretation: at one end of the
continuum is interpretation for which there exists widely accepted criteria,
such as the labeling of a part of a text as a title or paragraph. On the other
end is interpretation of a more debatable or subjective nature, such as the
provision of linguistic annotation or the identification of the presence of a
given topic in some part of a text.
A text plus its interpretation itself satisfies the definition of text, and
therefore can be considered as a new text. For example, the editor's
annotations of an original manuscript can be viewed as a part of a new text
comprising the edited version. Similarly, a text plus part of speech annotation
can itself be seen as a new text.
Encoding is any means of making explicit some interpretation of a text
or collection of texts.
Marking up is one possible means of encoding texts by interspersing
sequences of
- markup or tags, which represent the interpretation of
segments of text
- content which consists of segments of the represented text
A
markup scheme is a triple consisting of
- a character set;
- a syntax (rules defining what constitutes a well-formed marked-up text);
- a semantics (rules defining what constitutes a valid marked-up text in
some universe of interpretation.
Syntactic rules define:
- legal markup
- legal content
- legal ways of interspersing markup and content.
Syntactically
well-formed texts are not necessarily semantically valid. For example, the
sequence <word>New York</word> might be syntactically
well-formed in some markup scheme, but this does not ensure that "New York" is
a word in a given universe of interpretation. Note that in some universes of
interpretation, the interpretation of "New York" as a word would be valid,
while in others it may not. Semantic rules define which syntactically
well-formed texts are valid in different universes of interpretation.
A markup metalanguage is a set of rules that formally describe the form
of the syntactic rules of a markup scheme.
Semantic rules are very often not formalized in a markup scheme. In many cases
semantics relies on common knowledge about, for example, what constitutes a
title or a chapter. This leaves room for confusion in the way markup is
applied.
There are two general uses for a markup scheme:
- local processing, including data capture, as well as various
applications such as text editing and formatting; search and retrieval;
linguistic, semantic, metrical, etc. analysis; text collation; etc.
- data interchange between individuals or sites.
The CES is
intended for data interchange. A standard for interchange is desirable, so
that only translation between a single markup scheme and a local format is
required to use an externally-acquired text, rather than pair-wise translation
between all possible local formats. A standard for local processing is not
possible or even desirable at this point, given the wide range of application
domains and platforms.
An interchange standard must necessarily be domain, application, and
platform-independent, and therefore maximally general. Ideally, it must be as
expressive as any local format in order to enable translation of all local
formats into the interchange format with no loss of information. Therefore,
before an interchange standard can be developed, it is necessary to identify a
set of common text categories and a common text model. The existence of a
standard set of categories and a common text model will contribute to the
convergence of existing local practices and provide a framework for the
development of local processing formats in the future.
We distinguish three levels of text standardization. The successive levels are
increasingly prescriptive in terms of the markup conventions that must be used
to conform to that level of standardization. Each level requires
standardization at the preceding level as a prior condition. In addition, the
three levels of standardization are interdependent; that is, decisions at one
level will affect what can be done at the next level.
Because each of the three levels imposes increasing uniformity of encoding, the
data become more and more reusable as standardization is tightened. At the same
time, the application areas within which the data are reusable typically become
more and more restricted. Thus, there is a trade-off between generality and
reusability as the level of standardization is increased.
Standardization at the metalanguage level regulates the form of the
syntactic rules and the basic mechanisms of markup schemes. It
does not specify the markup itself (tag names, allowable sequences of tags,
etc.).
SGML (ISO 8879:1986; see also Goldfarb, 1990; Bryan, 1988; and van Herwijnen,
1991) is unique in that it is a standard at the metalanguage level only. The
SGML reference concrete syntax defines the forms of tags (including internal
attributes), the base character set, naming rules, reserved words, allowable
features (e.g., omission of end tags), etc. It does not define actual tag names
or rules for their use in a marked-up text.
Using the SGML Document Type Definition (DTD) mechanism, the user can define
tag names and "document models" which specify the relations among tags. This
constitutes the syntactic level (see below).
Standardization at the metalanguage level does not fully achieve the goal of
universal document interchange, since it is possible, for example, to have
entirely different document structures and markup even though the texts are
encoded using the same metalanguage specifications.
A more powerful way to standardize texts is to specify precise tag names and
syntactic rules for using the tags (i.e., the context(s) in which they can
legally appear), as well as constraints on content (for example, by specifying
that a tag can be associated with numeric data only).
Most familiar markup schemes are at this level: they provide precise tag names and
rules for using the tags (i.e., the context(s) in which they can legally appear).
SGML documents are
standardized at this level if they have common DTDs.
Conformance to a syntactic standard can be checked by parsing; that is, there
are formal means to verify that a given text follows the markup syntax rules.
Standardization at the syntactic level does not guarantee that markup has been
consistently applied with the same interpretation. For example, even if a tag
such as <word> appears in a legal syntactic context in a given
interchanged text, it is possible that the sender and receiver do not have the
same understanding of the content marked by that tag. This impairs immediate
reusability of the data, since, for example, even a simple word count or the
content of a lexicon created from the text could vary considerably depending on
the definition of a word.
Markup semantics are typically informal, usually relying on the user to apply a
given tag appropriately. For example, a tag such as <title> is likely
to be used to mark those things which humans more or less agree upon to be titles.
This kind of semantics is typically specified in accompanying user manuals; TEI P3 is an extensive example of the
specification of tag semantics at this level.
Standardization at this level requires more precise
definitions and constraints on the content of markup. The CES aims to
standardize at the semantic level for those elements most relevant to language
engineering applications, in particular, linguistic elements. Although it is
not always possible to provide precise formal specifications for markup
semantics, we attempt to identify a definition or set of definitions for
linguistic elements that best serve the needs of language engineering
applications.
We distinguish three broad categories of information which are of direct
relevance for the encoding of corpora for use in language engineering
applications.
Documentation
This includes global information about the text, its content, and its encoding. This type of markup corresponds roughly to the TEI header.
For example:
- bibliographic description of the document;
- documentation of character sets and entities;
- description of encoding conventions;
etc.
Within the primary data, we can distinguish two types of information that may be encoded:
- Gross structure
This includes universal text elements down to the level of
paragraph, which is the smallest unit that can be identified
language-independently; for example:
- structural units of text, such as volume, chapter, etc., down to the level
of paragraph; also footnotes, titles, headings, tables, figures, etc.;
- features of typography and layout, for previously printed texts: e.g., list item markers;
- non-textual information (graphics, etc.).
etc.
- Sub-paragraph structures
This includes elements appearing at the sub-paragraph level which are usually
signalled (sometimes ambiguously) by typography in the text and which are
language dependent; for example:
- orthographic sentences, quotations;
- orthographic words;
- abbreviations, names, dates, highlighted words;
etc.
This type of information enriches the text with the results of some linguistic
analyses; most often in language engineering applications, such analysis is at
the sub-paragraph level. For example:
- morphological information;
- syntactic information (e.g., part of speech, parser output);
- alignment of parallel texts;
- prosody markup;
etc.
An obvious criterion for the CES is that it enable marking
those features and properties of texts that are required for language
engineering applications. This means that the set of features must be
extensive enough to serve at least
a large percentage of corpus encoding
needs. At the same time, it is desirable that the scheme does not include a
vast array of unnecessary or peripheral elements or encoding options. This
is important for the following reasons:
- a simpler scheme is easier to understand and use;
- for a corpus encoding standard to be effective, it should disallow,
where possible, multiple different ways to encode the same phenomenon, but
rather should allow the one which is best suited to the application;
- similarly, it should not allow encoding options which are not
appropriate for this application.
Therefore, the CES has been designed to include a small but adequate set of
elements for corpus-based work. In some instances, this has meant including
only specific TEI elements where more general tags exist; and in other
cases, the reverse is true. In each case, the choices are made on the basis
of what is required for corpus-based language engineering research and
applications.
An encoding scheme should be built around consistent principles to determine
what kind of objects are tags, what kind of objects are attributes, what kind
of object(s) appear as tag content, etc. A well-thought out system with strong
principles (for example, tags for structural and logical pieces, attributes for
properties, etc.) ensures the intellectual integrity and coherence of the
encoding scheme and provides a basis for those who modify or extend it.
Conversely, a lack of such a principled basis leads to practical problems in
processing an encoded text, for example, for validation, search and retrieval,
etc., since different encoding styles can be mixed within the same document.
Consistency is also essential to facilitate the mapping of the SGML encoded
text into other formats, for example, data base formats.
For more discussion and examples, see excerpt
from
MUL/EAG-CES 3: Corpus Encoding Standard: Background and Principles.
When a text is encoded from a printed or electronic source (typesetter's tapes,
etc.) the ability to recover the source text from the encoded version--that is,
to distinguish what was in the source from the markup and potential additional
information--is often desirable. There are a number of different ways to define
what is to be recovered from a source text, (e.g., a facsimile of a particular
printed version of a text, layout, typography, etc.). For many purposes
(comparison and validation between the source and the encoded text, operations
such as word counts, search, concordance generation, linguistic analysis,
etc.), it is sufficient to recover the sequence of characters constituting the
text, independent of any typographic representation.
Recovery is an algorithmic process and should be kept as simple as possible,
since complex algorithms are likely to introduce errors. Therefore, an encoding
scheme should be designed around a set of principles intended make recovery
possible with simple algorithms. Processes such as tag removal, simple mappings
are more straightforward and less error prone than, say, algorithms which
require rearranging the sequence of elements, or which are context-dependent,
etc. In order to provide a coherent and explicit set of recovery
principles, various recovery algorithms and a related encoding principles need
to be worked out, taking into account such things as the role and nature of
mappings (tags to typography, normalized characters, spellings, etc. with the
original, etc.), the encoding of rendition characters and rendition text,
definitions and separability of the source and annotation (such as linguistic
annotation, notes, etc.), linkage of different views or versions of a text, etc.
Validation is the process by which software checks that the markup in a
document conforms to the structural specifications given in a DTD. SGML
validation software checks that tags have legal names, are properly nested,
appear in the correct order, contain all required tags, etc.; that attributes
appear when and only when they should, have legal values; etc.
The ability to validate is important because it enables trapping errors during data
capture. It also enables ensuring that the encoded text corresponds to the model
given in the DTD, thus providing a possible means by which the adequacy of the model
itself can be verified.
There is a tension between the generality of an encoding scheme and the ability
to validate. Over-generative DTDs allow many tag sequences which, for any given
text, are not valid. In addition, the use of abstract, general tags also
constrains the ability to validate; for example, the use of a general tag such
as <div> to mark hierarchical divisions of a text (corresponding,
for example, to book, chapter, section, etc.) disallows constraints on what can
appear within a given text division, making it impossible to ensure that
tighter structural constraints for a given book are observed, (e.g., that
titles do not appear within chapters, or that a paragraph does not appear
outside the chapter level, etc.).
Data capture involves
- capture of the text itself, either by hand or via OCR, acquisition of word
processor output, typesetter tapes, etc.; we assume that by-hand capture is not very
likely for applications, although it is not excluded.
- addition of markup. Fully automatic markup is rarely possible; markup is
typically achieved either by hand or semi-automatically, via format translators,
annotation programs such as POS taggers, etc.
The kind of markup that is added to a text directly affects the costs of
capture. Some kinds of markup can be very costly, if, for example, no program
can accomplish it automatically or if markup programs leave so many ambiguities
that a large amount of post-editing is required. Capturability is an important
concern when defining minimum requirements for conformance to a standard
because corpora often consist of millions of words of text, making hand marking
and substantial post-editing too costly to be practical. Capturability has
important repercussions for the design of the encoding scheme:
- The scheme should accomodate the various levels of analysis of the text
and provide markup for both very crude element designation (which can be much
less costly to achieve) as well as more precise tagging. For example, markup
indicating that a word or any arbitrary segment appears in italics already
exists in many texts (such as typesetter's tapes), and it is therefore
virtually cost-free to mark it as such; to determine more precisely what the
italics mean can be much more costly, since italics can indicate any number of
things (title, caption, quotation, emphasis, foreign word, term, etc.).
Similarly, it may be cheap and sufficient for many applications to make only a
gross distinction between the main text (to which one may want to restrict
linguistic analysis, e.g.) and auxiliary text (titles, divisions headers,
captions, tables, footnotes, bibliographic references, etc.).
- The scheme should be refinable, by providing tags at various levels of
specificity together with a taxonomy identifying the hierarchical relations
among them. For example, a word marked in italics could later be further
analyzed and identified as a highlighted word, and later more precisely marked
as a term, and still later further identified as a foreign term, etc.
- Minimum requirements for conformance to the standard must be made in view
of the costs of capture. Minimum requirements cannot include tagging that is
actually or even potentially costly. For example, requiring that italics are
disambiguated to the lowest level of the hierarchy would result in high costs
for data capture since it requires substantial hand intervention. Even
seemingly simple tagging, such as tagging paragraphs, can be costly depending
on the input, if, for example, line breaks are not differentiated from
paragraph breaks (as in electronic mail, etc.).
The CES must take into consideration processing considerations and needs, such
as the overhead of use of SGML mechanisms (e.g., entity replacement, use of
optional features), as well as concerns such as the ability to (efficiently)
select texts according to user-specified criteria; also, the need to use
special mechanisms, such as inter-textual pointers, linkage of related texts or
other sub-corpus segments, vs. constraints their use may incur (e.g., the use
of inter-textual pointers may demand that the entire corpus be available at all
times for processing).
A related concern is mappability to internal representation schemes that may be
used for local processing or special applications. Although ideally an encoding
standard would serve both the needs of interchange and local processing (which
will be eventually ensured by the coordinated development of a standard
encoding format and specifications for tool development), for the near term it
is likely that researchers will continue to use local formats. In addition,
existing commercially available SGML software is scarce and typically
expensive, and is therefore not widely available to the research community.
Although mappability to local formats cannot be a driving criterion for
encoding design, where possible it can be taken into account.
Absolute completeness of any markup scheme is impossible to achieve. Therefore,
it is essential that any encoding scheme be extensible.
As mentioned above, there is a tension between validatability and generality.
If the goal of validatability is served, DTDs will be more restrictive, and the
need for extensibility will be even greater. Therefore, it is essential that
systematic means for extension of the scheme are developed, which will ensure
that extensions are made in a controlled and predicatable way.
SGML is often criticized for its verbosity, since document size can be
dramatically increased by the addition of SGML tags. This is a particular
concern for annotated corpora, where each word (and possibly each morpheme) can
be marked for part of speech and/or other information, often increasing file
size by a factor of 10 or more. This can cause problems for various kinds of
processing (e.g., retrieval) as well as for interchange, since in the state of the
art it is still often problematic to transfer large files over data networks.
However, the costs and difficulties of handling large files are being reduced every day, so compactness is not necessarily an overriding concern for CES design, but may be taken into account as a secondary criterion.
There are several possible means to reduce the number of characters added to a text when markup is introduced:
- tag minimization, e.g., start and end tag omission, short start and
end-tag, minimization of attribute values, etc.;
- SGML entities used in place of any string, possibly including markup;
- DATATAG feature, which allows a certain character to be interpreted as the end tag of an element;
- non-SGML notations, involving the use of private, less verbose non-
SGML schemes within tags or as attribute values.
For more discussion of these mechanisms and examples, see excerpt from MUL/EAG-CES 3: Background and context
for the development of a Corpus Encoding Standard.
Each of these methods for markup reduction has drawbacks. For example, tag minimization can cause problems for some users without sophisticated software; the use of entities results in considerable processing overhead; some features such as DATATAG are not implemented in all SGML processors; private notations require the use of special software for processing, etc.
The CES makes recommendations concerning minimization based on
- the degree to which and the circumstances under which markup
minimization is important for corpus encoding; and
- an assessment of the the advantages and drawbacks of the various minimization
methods for reusability within the corpus research community.
There are two points of view concerning readability. One assumes that the text
will be captured, displayed, or in general dealt with using processing software
which could make the markup either invisible or human-readable; therefore,
readability need not be a concern. However, it can be argued that such software
is not readily available, or that no software will ever answer all the user's
needs. Therefore, there will always be a need for dealing directly with the
encoded text.
Note that readability is related to compactness in two ways, in part dependent
upon the object to be read: i.e., the original text or the text plus
markup. When minimization techniques are used to reduce or eliminate markup
in an encoded text, readability of the original text is likely to be enhanced.
Minimization may even facilitate the readability of the text plus markup in
some cases. On the other hand, when the object to be read includes text plus
markup, in many cases minimization techniques will decrease readability.
In general, readability is a secondary concern among encoding criteria, to be
aimed at only when other concerns are adequately addressed.
The CES is conformant to the TEI
Guidelines
for Electronic Text Encoding and Interchange (referred to as "TEI-P3" or the "TEI Guidelines")
developed by
the TEI.
The CES is instantiated using the TEI.2 DTD and the TEI customization mechanisms.
At present, the CES provides three different TEI customizations, each instantiated using the TEI.2 DTD and the appropraite TEI customization files, for use with
different documents:
- documents containing a primary data encoding,
including texts with gross structural markup only to
texts heavily and consistently marked for elements of relevance for language engineering;
- documents containing morphosyntactic annotation of the primary data, which is hyperlinked to that data;
- documents containing links indicating alignment between two documents.
For convenience, we also provide a version of each of these three TEI
instantiations as a stand-alone DTD,
together with a means to browse the element tree as a hypertext document.
Because the TEI Guidelines are intended to cover a wide range of applications, they offer means to encode a vast array of elements. In addition, because they are intended to be maximally flexible, they provide often several ways to encode the same phenomenon. Therefore, via the TEI customization mechanisms, the CES limits the TEI scheme in order to:
- include only the sub-set of the TEI tagset relevant for corpus-based work;
- make choices among encoding options, with an eye toward satisfying the criteria outlined in section 1.5, above.
The TEI scheme is not complete; many areas relevant to language engineering applications are not covered. In addition, there are areas the TEI is not intended to cover, such as precise specifications for many kinds of tag content. Therefore, the CES also uses the TEI customization mechanisms to specify:
- extensions to the TEI Guidelines to serve needs of language engineering.
- precise values for some attributes.
- required/recommended/optional elements to be marked.
- detailed semantics for elements relevant to language engineering.
We constrain or simplify the TEI specifications
as appropriate to serve the principles outlined in section 1.5, primarily in terms of element content, which is substantially simplified in the CES.
Depending on the particular needs for encoding corpora, we constrain or extend legal and required attributes and attribute values specified by the TEI.
We adopt the TEI use of element and attribute classes, implemented using SGML parameter entities. However, these element classes are simplified, forming a shallow hierarchy with no overlaps among classes.
We do not rename TEI elements except where confusion may arise; also,
three TEI-specific names are renamed to reflect their use in the CES
(i.e., <TEI.2> becomes <cesDoc>,
<teiCorpus.2> becomes <cesCorpus> and <teiHeader> becomes <cesHeader>).