[Mirrored from: Part 2. Recommendations common to all documents]
Corpus Encoding Standard
-
Document CES 1. Part 2. Version 1.1. Last modified 1 April 1996.
Part 2
Recommendations common
to all documents
Contents
The CES constitutes a TEI-conformant application of SGML
(ISO 8879). CES documents may be parsed using any SGML parser.
All elements in a document are delimited by the use of tags.
There are two forms of tag, a start-tag, marking the beginning of
an element, and an end-tag marking its end.
The CES uses the "reference concrete syntax'' of SGML, which specifies that
tags are delimited by the characters "<" and ">" and
contain the name of the element (its gi for generic
identifier). In end tags, the gi is preceded by "/". The gi may consist
of upper and lower case letters and the digits 0-9.
The CES adopts the strategy of the TEI application of SGML by extending the
legal length of delimeter names from 8 to 32 characters. Case is not
significant in tag or attribute names. However, we recommend the use of the
following conventions, following the TEI:
- Lower case letters are used for identifiers, unless they are derived from
more than one English word, in which case the first letter of the second word
is capitalized.
- Attributes are indicated within the start-tag, and take the form of an
attribute name, an equal sign and the attribute value, which may be a number, a
string literal or a quoted literal.
For the purposes of encoding the complexity and wide range of texts treated by
the TEI, the TEI has significantly extended its metalanguage level
specification beyond what is offered by SGML. For instance, the TEI provides
additional mechanisms for
All of these extensions are adopted in the CES.
SGML permits various kinds of minimization, or abbreviatory
conventions. The TEI interchange format prohibits the use of most minimization
techniques (e.g., short references, omission of generic identifiers in start
and end tags) allowed in ISO 8879. The CES adopts the TEI prohibition
against the use of minimization techniques in general:
- every non-empty element in the distributed form of a corpus must have both
a start-tag and an end-tag.
- all attributes are specified are supplied in the form
"attributeName=value''.
- end-tags are routinely omitted on empty elements.
- exceptions to this rule are laid out for pervasive
and commonly marked elements. For such elements, end tags may be omitted and
attribute values may be given without any associated attribute name, to serve
the interests of compactness and readability. Exceptions are explicitly
described in the encoding recommendations for these elements.
A universal character set (UCS) that will cover all languages is under
development by ISO and the
Unicode
consortium. The results of the work so far on this character set has been
approved as The Universal Multiple-Octet Coded Character Set standard ISO/IEC
10646-1. UCS will likely be the accepted encoding standard for
characters in the future.
UCS encodes each character in four bytes, thus providing a single
character set to encode all the worlds' languages.
However:
- controversy still exists over some details of the scheme;
- the standard is not complete;
- some languages are not yet covered;
- the standard is not yet supported in practice; only 8-bit character sets are generally supported.
Although there is little doubt that this standard will eventually
become the basis for character representation, its full specification and
implementation is long enough away that, for present purposes, it is necessary
to provide a temporary solution.
For corpora intended for use in language engineering applications, much
interchange will be accomplished via CD-ROM or ftp. Ftp allows binary
interchange and can be used to safely transmit any 8-bit character set.
Moreover, data interchange is becoming increasingly reliable, due to major international
efforts towards standardization such as the Internet effort. For example,
TCP/IP and many network applications (e.g., ftp, WWW, etc.) are "8-bit clean".
In addition, recent standards have been proposed to guarantee delivery by
automatically packing and unpacking data as required:
Even
when such these standards are not yet implemented, files can be safely
transferred by using universally available encoding programs such as
'uuencode'.
Therefore, we recommend that all data is distributed using the recommendations
below for character sets. In the case of blind interchange, data should be
encoded using 'uuencode'.
Our recommendation has the merit of being reasonably compatible with UCS, thus
facilitating future migration to that standard.
The CES recommendations have been adopted by the EAGLES Tool subgroup for its Guidelines for Linguistic Software Development--see especially Part 1-1: Characters.
The CES recommends the use of the ISO 8859-X series for all the
following scripts: Arabic, Cyrillic, Greek, Hebrew, Latin.
The following is a rough list of the languages accomodated in the ISO 8859
series. See also the graphic representation of the
code
tables.
- ISO-8859-1 - Latin 1
- Western Europe and Americas: Afrikaans, Basque, Catalan, Danish, Dutch,
English, Faeroese, Finnish, French, Galician, German, Icelandic, Irish,
Italian, Norwegian, Portuguese, Spanish and Swedish.
- ISO-8859-2 Latin 2
- Latin-written Slavic and Central European languages: Czech, German,
Hungarian, Polish, Romanian, Croatian, Slovak, Slovene.
- ISO-8859-3 - Latin 3
- Esperanto, Galician, Maltese, and Turkish.
- ISO-8859-4 - Latin 4
- Scandinavia/Baltic (mostly covered by 8859-1 also): Estonian, Latvian, and
Lithuanian. It is an incomplete predecessor of Latin 6.
- ISO-8859-5 - Cyrillic
- Bulgarian, Byelorussian, Macedonian, Russian, Serbian and Ukrainian.
- ISO-8859-6 - Arabic
- Non-accented Arabic.
- ISO-8859-7- Modern Greek
- Greek.
- ISO-8859-8 - Hebrew
- Non-accented Hebrew.
- ISO-8859-9 - Latin 5
- Same as 8859-1 except for Turkish instead of Icelandic
- ISO-8859-10 - Latin 6
- Latin6, for Lappish/Nordic/Eskimo languages: Adds the last Inuit
(Greenlandic) and Sami (Lappish) letters that were missing in Latin 4 to cover
the entire Nordic area.
A list of characters used by a large number of languages is
provided in
"Characters
and character sets for various languages " (Alvestrand, 1995).
See also "ISO
8859-1 National Character Set FAQ" (Gschwind, 1995).
Shortcomings of the ISO 8859 series
The ISO 8859 series lacks the ligatures Dutch ij, French oe and
,,German`` quotation marks, as well as several other characters.
There are also Bulgarian and Ukranian characters missing from ISO 8859-5.
[THIS SECTION IS UNDER DEVELOPMENT]
The recommendations above do not provide for Asian languages,
including Chinese, Japanese, and Korean. Independent standards have been
developed for these languages. The CES specifications for these cases are under development.
If it is necessary to encode a text in a language
not covered by the ISO 8859-X series, it is required to use
- an ISO standard character set, if one exists; or
- a Writing System Declaration (see TEI P3, chapter 25) documenting the use
of any non-ISO character set.
It is also required that the character set
used is fully documented in the header providing the encoding description for
the corpus; see the description of <wsdUsage>.
Note that the TEI provides several pre-defined Writing System Declarations, including:
- The official languages of the European community, using the character set ISO 8879-1;
- Hebrew (using ISO 8859-8);
- Russian (using ISO 8859-5).
Characters not available in the character set that has been selected for the
document as a whole must be represented by entity references,
which take the form of an ampersand (&) followed by a mnemonic for
the character, and terminated by a semicolon (;) where this is necessary
to resolve ambiguity. All entities used in a document must be declared in the
DTD.
We recommend the use of
ISO
entities. Standard public entity names can be declared by a reference to a
standard public entity, e.g.,
-
<!ENTITY % ISOLat1 PUBLIC "ISO 8879-1986//ENTITIES Added Latin
1//EN">
%ISOLat1;
<!ENTITY % ISOLat2 PUBLIC "ISO 8879-1986//ENTITIES Added Latin 2//EN">
%ISOLat1;
<!ENTITY % ISOGrk1 PUBLIC "ISO 8879-1986//ENTITIES Greek Letters//EN">
%ISOGrk1;
<!ENTITY % ISOGrk2 PUBLIC "ISO 8879-1986//ENTITIES Monotoniko Greek//EN">
%ISOGrk2;
<!ENTITY % ISOCyr1 PUBLIC "ISO 8879-1986//ENTITIES Russian Cyrillic//EN">
%ISOCyr1;
<!ENTITY % ISOCyr2 PUBLIC "ISO 8879-1986//ENTITIES Non-Russian
Cyrillic//EN">
%ISOCyr2;
etc.
Many of the characters that commonly need to be represented are included in the ISO entity sets ISOpub and ISOnum. These sets include, for example, the special characters "&" and "<" which are part of the
SGML markup syntax and cannot be included in an SGML document.
They also contain entities such as "—" (for the dash the width of an "m"), "£" (for British sterling), etc.
The ISOpub and ISOnum entity sets are declared as follows:
- <!ENTITY % ISOPUB PUBLIC "ISO 8879-1986//ENTITIES
Publishing//EN">
%ISOPUB;
<!ENTITY % ISONUM PUBLIC "ISO 8879-1986//ENTITIES
Publishing//EN">
%ISONUM;
Note that these entity sets are declared in all the CES DTDs.
If no standard entity name exists or a standard entity is to be renamed, normal
SGML syntax can be used to declare an appropriate entity, as follows:
- <!ENTITY foo '[unprintable]'> <!-- weird character
-->
Declaration of entities and entity sets not already included in the DTD for the document are added at the top of the encoded document, as in this example:
<!doctype cesDoc PUBLIC "-//CES//DTD//cesDoc//EN" [
<!ENTITY igcy "i`" --=small i grave, Cyrillic-- >
<!ENTITY Igcy "I`" --=capital I grave, Cyrillic-- >
<!ENTITY % ISOcyr1 PUBLIC
"ISO 8879-1986//ENTITIES Russian Cyrillic//EN" >
%ISOcyr1;
<!ENTITY % ISOcyr2 PUBLIC
"ISO 8879-1986//ENTITIES Non Russian Cyrillic//EN" >
%ISOcyr2;
]>
<cesDoc version="3.9">...
Notes:
- SGML entities should not be used to replace characters of the base
character set (this applies to local character sets only, not blind
interchange).
- Transliteration should not be used to replace appropriate character
sets, either for local processing or interchange.
When different character sets are mixed in a single document, three alternative
methods can be used (possibly in conjunction):
- Explicitly:
- A wsd attribute can be used on any tag to indicate that the tag's
content is encoded in the specified character set. The value of the attribute
is the character set name (ISO-8859-1, etc.). WSD stands for "writing system
declaration", borrowed from the
TEI
terminology.
- A lang attribute can be used on any tag to indicate that the tag's
content is in the specified language. This method assumes a mapping between
languages and character sets. The value of the attribute is composed of one of
the following (compatible with
"HyperText
Markup Language Specification Version 3.0" )
- a two-letter code from ISO 639 (e.g., "en" for English;
- a three-letter code from ISO 639-2 (e.g., "eng" for English);
- one of the above extended by a country code from ISO 3166 (e.g., "en.uk"
or "eng.uk" for English as spoken in the United Kingdom).
- Implicitly:
- All instances of a given element can be associated with a particular
character set, using the wsd attribute on the <tagUsage> element in the header.
- All instances of a given element can be associated with a particular
language (using using the lang attribute on
<tagUsage>) which is in turn associated with a particular character
set (using the wsd attribute on the corresponding
<language> element in the header).
These implicit methods are useful when there is a systematic
mapping between tags and character sets (e.g., a list of words in one character
set, with their translations in another).
The CES provides global lang and wsd attributes, as
well as appropriate mechanisms to document correspondences between languages or
tags with particular character sets in the CES header.
Note that the language tagging mechanism will still be valid with UCS. "Unicode
characters do not specify the language of the text they represent; that is,
they are completely language neutral. If the language of a character or
character string must be known to accomplish a particular type of process (e.g.
language sensitive collation), then a higher-level protocol must be used to
specify the language." [from Unicode's
"Basic
Principles"].
[THIS SECTION IS UNDER DEVELOPMENT]
The TEI provides a pre-defined Writing System Declaration (WSD) for transcribing the International Phonetic Alphabet. This is distributed by the TEI both as an SGML entity set and as a TEI Writing System Declaration documenting the entity set:
-
-//TEI P3: 1994//ENTITIES International Phonetic Alphabet//EN
The CES recommends using the SGML entities and providing the TEI WSD (with reference to it in the <wsdUsage> element in the header) when the IPA system is used in a document.