[Mirrored from: Part 3. The Header]
Corpus Encoding Standard
-
Document CES 1. Part 3. Version 1.4. Last modified 6 April 1996.
Contents
The header provides information about the electronic text that has been encoded, including not only its title, author, etc. but also information about its encoding. The TEI header has provided the first means to document electronic texts, which has been widely adopted and adapted for use in text and corpus encoding.
The CES adopts the use of the TEI header, customizing it using the TEI customization mechanisms to suit the specific needs of corpus-based research. In the CES (as in the TEI) headers are provided for the SGML document containing the entire corpus, as well as for each individual text within a corpus.
The CES header is, for the most part, a subset of the TEI header (see TEI P3, chapter 5, "The Header",
and chapter 23, "Language Corpora"). There are the following exceptions:
- elements have been added for more precision in the specifications;
- attributes have been added to existing elements;
- attribute values have been constrained to allow only a given set of values;
- element content models are simplified, to contain either a sequence of tags in sub-categories, or plain text (PCDATA).
The minimal requirements of the CES header are the same as those for the TEI header.
The CES header needs attention to determine exactly which elements and information are appropriate for corpora. We intend to develop a more constrained model with a precise template, to facilitate and regularize the creation of corpus and text headers.
Three global attributes are defined, which may appear on any element in the header:
-
- id
- a unique identifier for the element bearing the ID value.
- n
- a number or other label for the element, not necessarily unique within the corpus.
- lang
- indicates that the tag's content is in the specified language. The value of the lang attribute is composed of one of the following:
- a two-letter code from ISO 639 (e.g., "en" for English;
- a three-letter code from ISO 639-2 (e.g., "eng" for English);
- one of the above extended by a country code from ISO 3166 (e.g., "en.uk" or "eng.uk" for English as spoken in the United Kingdom).
Note that the values for the lang attribute are compatible with HyperText Markup Language Specification Version 3.0".
The global attributes for elements in the header are defined at the
top of the header.elt
and
represented by an entity, A.HEADER
. This entity is used
to represent the list of global attributes on the attribute
declarations for most elements in the header of all CES DTDs.
Each text in the corpus (i.e. each <cesDoc> element) has its own
header, referred to as a text header.
The whole corpus also has
a header, referred to as the corpus header, which contains
information applicable to the whole corpus (possibly with some local
overriding). Both corpus and text headers are represented by
<cesHeader> elements. The type attribute is used to
distinguish the two.
The root of the CES header element tree is the <cesHeader> element, defined as follows:
- <cesHeader>
- contains the descriptive and declarative information making up an "electronic title page" prefixed to every text, or to the corpus as a whole.
- type
- specifies the kind of document to which the header is attached.
- CORPUS the header is attached to the corpus.
- TEXT* the header is attached to a single text.
- creator
- specifies the agency responsible for creating the header.
- version
- specifies the version and revision of the CES
header.elt used to encode this header. This number is found near the top of
the header.elt itself.
- status
- specifies the revision status of the header.
- NEW* this is the first version of the header
- UPDATE header has been updated.
- date.created
- specifies the date on which the header content was created.
- date.updated
- specifies the date on which the header content was last updated.
The
<cesHeader>
element contains the following four elements:
- <fileDesc>
- contains a full bibliographic description of the corpus itself or of a
text within it.
- <encodingDesc>
- documents the relationship between an electronic text and the source
or sources from which it was derived.
- <profileDesc>
- provides further information about various aspects of a text,
specifically the language used, the situation and date of its production, the
participants and their setting, and a descriptive classification for it.
- <revisionDesc>
- summarizes the revision history for a file.
These elements
are tagged as follows:
<cesHeader>
<fileDesc></fileDesc>
<encodingDesc></encodingDesc>
<profileDesc></profileDesc>
<revisionDesc></revisionDesc>
</cesHeader>
The file description is the first of the four main constituents of the header
and is represented by the <fileDesc> element and the only one that is required.
The file description documents the
electronic file itself, i.e. (in the case of a corpus header) the whole corpus,
or (in the case of a text header) the individual text to which the header applies.
It contains the following elements:
- <titleStmt>
- groups information concerning the title of the corpus or the individual text and its
constituent texts.
- <editionStmt>
- contains any additional information relating to a particular version
of a text.
- <extent>
- provides the size of the electronic text as stored on
some carrier medium.
- <publicationStmt>
- groups information concerning the publication or distribution of the
corpus and its constituent texts.
- <sourceDesc>
- supplies a bibliographic description of the copy text(s) from which
an electronic text was derived or generated. Further detail is given in the
following subsections. Note that these relate only to the electronic file (the
corpus text itself) --- bibliographic and other details of the written or
spoken text from which it derives are given in the source description .
Note that the <titleStmt> describes the machine-readable file,
while the source text is specified in the <sourceDesc>. The title
in the <titleStmt> should indicate that this is a machine-readable
version and should not be identical to the title of the source text.
<titleStmt>, <publicationStmt>, and <sourceDesc> are required.
The minimal header has the following structure:
<cesHeader version="2.0">
<fileDesc>
<titleStmt>
<h.title></h.title>
</titleStmt>
<publicationStmt>
<distributor></distributor>
<pubAddress></pubAddress>
<availability></availability>
<pubDate></pubDate>
</publicationStmt>
<sourceDesc>
<biblStruct>
<monogr>
<h.title></h.title>
<h.author></h.author>
<imprint>
<pubPlace></pubPlace>
<publisher></publisher>
<pubDate></pubDate>
</imprint>
</monogr>
</biblStruct>
</sourceDesc>
</fileDesc>
</cesHeader>
Note that if the lang or wsd attributes are used on elements in the main text, it is required to include a <profileDesc> element containing
<langUsage> (for use of lang) and/or <wsdUsage> (for use of wsd).
This element consists of a <h.title> element followed by zero or
more <respStmt> elements. These sub-elements are used throughout
the header, wherever the title of a work or a statement of responsibility is
required.
- <h.title>
- the title of the electronic file, including alternative titles or
subtitles.
- <respStmt>
- supplies information about any person or institution responsible for
the intellectual content of a text, edition, or electronic transcription.
<respStmt> in turn contains the following elements:
-
- <respType>
- contains a phrase describing the nature of a person's or institution's
intellectual responsibility.
- <respName>
- the publisher of the corpus or text expressed as the proper name of a person, place or institution.
In the corpus header, the version
attribute on the <editionStmt> element is used to indicate both a version number and a revision number, in
the form "version.revision", where "version" changes if texts are added to or
removed from the corpus, and "revision'' changes if amendments are made within
texts or the corpus header.
In individual text headers, the version
attribute carries only a revision number.
This tag can be empty. For example:
<editionStmt version='1'>
This element corresponds to the TEI <editionStmt>, except that its
content is an unstructured note.
This element corresponds to the TEI <extent> element in that it
describes the number of words in the whole corpus or in an individual text. It
differs in that it contains specific tags for specifying the size of the text
or corpus in terms of words and bytes.
-
- <extent>
- describes the approximate size of the electronic text as stored on
some carrier medium, specified in words (corpus header) and additionally in Kb
(corpus texts).
The <extent> tag contains:
- <wordCount>
- contains the count of words in the text.
- <byteCount>
- contains the count of bytes in the file containing the text together with its markup.
- units
- gives the unit in which the bytecount is measured.
- BYTES bytes
- KB* kilobytes
- MB megabytes
- GB gigabytes
- <extNote>
- a descriptive note supplying additional information of any kind
relating to an extent information provided within a corpus or text header.
For the purposes of the word count value, a "word" is considered to be an orthographic word--i.e., a string of characters surrounded by blanks. Punctuation not surrounded by white space is not considered as a word. This criterion is used as a default since this sort of count can be achieved fairly simply by automatic means. If any other definition is used it should be documented in the optional <extNote> tag; e.g.,
<extNote>Punctuation marks counted separately in the wordcount.</extNote>
The <bytecount> tag gives the size of the text including its tags, in its representation as a text file encoded in an 8-bit ISO
character set, which is useful for calculating media requirements or file
download times.
This corresponds to the TEI <publicationStmt> but has a narrower
focus, since it relates only to the public availability of the electronic
text.
It contains the following sub-elements:
- <distributor>
- gives the name of the person or institution who distributes the text or corpus.
- <pubAddress>
- contains a postal address of the
distributor.
- <telephone>
- gives the telephone number in of the person or institution who distributes the text or corpus, in format conformant to ITU-T/CCITT Recommendation E.123.
- <fax>
- gives the fax number of the person or institution who distributes the text or corpus, in format conformant to ITU-T/CCITT Recommendation E.123.
- <eAddress>
- gives an electronic address of the person or institution who distributes the text or corpus. Note that more than one occurrence of this tag can appear, so that multiple addresses (possibly of different types) can be included.
Attribute:
- type
- gives the type of the electronic address (email address, web site, ftp site, etc.). Suggested values include:
- EMAIL* the value is an electronic mail address.
- WWW the value is a web site address.
- FTP the value is an ftp address.
- <idno>
- supplies a standard (e.g., ISBN) number used to identify a bibliographic item.
- type
- a name or abbreviation (e.g., ISBN) identifying what type of identifying number is given. Unless provided explicitly the default value is:
- ISBN* the value is an ISBN number.
- <availability>
- supplies information about the availability of a text, for example,
any restrictions on its use or distribution, its copyright status, etc.
- region
- specifies the territories within which rights in the
electronic text apply. Suggested values include:
- WORLD* the text is freely available.
- EU European Union only
- status
- supplies a code identifying the current availability of the
text. Values are:
- RESTRICTED the text is not freely available.
- UNKNOWN* the status of the text is unknown.
- FREE the text is freely available.
- <pubDate>
- the publication date expressed in any format
- value
- specifies standard value for this date in ISO 8601 (Representation of dates and times)
format
This element corresponds to the TEI <sourceDesc>, except that its
content is constrained to include only the following possible sub-elements:
- <biblStruct>
- contains a structured bibliographic citation, in which only
bibliographic sub-elements appear and in a specified order.
- <biblFull>
- contains a bibliographic citation for a text which has been previously encoded in electronic form. This element contains the same elements as the
<fileDesc> element, and is intended to include the header of the electronic text from which the current document is derived.
The headers of
individual texts will each contain at least one of the above elements to
specify their source. When a particular text contains items derived from more
than one bibliographic source or recording, all relevant sources for which
information is available are listed in the text header, and individual
<div> elements
associated with the correct citation or recording by means of the decls
attribute.
If an electronic text has been derived from a previous electronic version of the text, then the source description will contain a <biblFull> element. If this version had itself been derived from another electronic version, then this <biblFull> element could contain yet another <biblFull> element, and so on for as many recursive levels as required. If electronic text described in any <biblFull> element is derived from a
print source, it contains a <biblStruct> element describing that source.
The
<biblStruct> element
The <biblStruct> element has the following component sub-elements:
- <analytic>
- contains bibliographic elements describing an item (e.g. an article or
poem) published within a monograph, journal, or periodical and not as an
independent publication.
- <monogr>
- contains bibliographic elements describing an item (e.g. a book or
journal) published as an independent item (i.e. as a separate physical
object).
At least one <monogr> element must be present in a
<biblStruct> element. It may contain the following elements:
- <h.title>
- the title of a work.
- <h.author>
- in a bibliographic reference, contains the name of an author
(personal or corporate) of a work; names should be given in a canonical form,
with surnames preceding forenames.
- <respStmt>
- supplies information about any person or institution responsible for
the intellectual content of a text, edition, or electronic transcription.
- <edition>
- provides bibliographic details for an edition of some text.
- <imprint>
- groups information relating to the publication or distribution of a
bibliographic item.
- <biblScope>
- defines the scope of a bibliographic reference, for example as a list
of page numbers, or a named subdivision of a larger work.
- type
- identifies the type of information conveyed by the element.
- PP the element contains a page number or page range.
- VOL the element contains a volume number.
- ISSUE the element contains an issue number, or volume and issue numbers.
- <biblNote>
- a descriptive note supplying additional information of any kind
relating to a bibliographic item described within a corpus or text header.
Published texts must contain at least one <imprint> element,
which can contain the following elements:
- <publisher>
- proper name of a person, place or institution.
- type
- categorises the name. Legal values are:
- PERSON name of a person
- PLACE name of a place
- ORG name of an organization article in a periodical
- <pubDate>
- a calendar date in any format.
- value
- specifies standard value for this date in ISO 8601 format
- <pubPlace>
- place of publication for a book, article, etc.
The
<analytic> element is used when multiple monographic records are
grouped together into single items. When the item described by a bibliographic
citation forms a part of some other bibliographic item (as, for example, a
newspaper article within a newspaper, or a journal article within a
collection), a monographic description should be given for the newspaper or
collection, prefixed by an analytic description for the individual component,
enclosed within an <analytic> element. This contains a mixture of
the elements <h.author> <respStmt> and
<h.title> in any order and repeated as necessary.
The second major component of the header, the encoding description, contains
information about the relationship between an encoded text and its original
source and describes the editorial and other principles employed throughout the
corpus.
The <encodingDesc> element has the following six components:
-
- <projectDesc>
- describes in detail the purpose for which an electronic file
was encoded.
- <samplingDecl>
- contains a prose description of the rationale and methods used in
sampling texts in the creation of the corpus.
- <editorialDecl>
- provides details of editorial principles and practices applied
during the encoding of a text.
- <tagsDecl>
- provides detailed information about the tagging applied to an SGML
document.
- <refsDecl>
- specifies how canonical references are constructed for this text.
- <classDecl>
- contains a series of <category> elements, defining the
classification codes used for texts within the corpus.
This element provides information about the project for and by which the text or corpus was created, together with any other relevant information concerning the
process by which it was assembled or collected. The content of this element is an unstructured note. Example:
<projectDesc>
The MULTEXT project is assembling a corpus consisting of
mono-lingual texts in seven Eastern and Western European
languages, together with parallel translations in each of
these languages. The original texts were acquired in various
forms and marked up for conformance with the MULTEXT/EAGLES
Corpus Encoding Standard, to test and validate that scheme.
MULTEXT has also developed a suite of annotation tools which
have been tested on the texts in the corpus.
</projectDesc>
A minimal encoding description can contain only the <projectDesc> element. In this case, a prose description of the encoding methods can be provided. If documentation of encoding principles exists in another location (a manual, etc. in printed form, at a given URL, in an ftp site, etc.) this information should be provided.
If no <conformance> element is provided in an <editorialDecl> element within the encoding description, the CES conformance level must be provided here.
This is also an unstructued note, which contains information about the methods for text sampling in the corpus. This element is relevant only in the corpus header.
This element provides details about the systematic inclusion or exclusion of portions of texts, the rationale, and the means by which this is noted in the encoding, if any. For example (adapted from English-Norwegian Parallel Corpus Project
manual):
<samplingDecl>
The texts of the core corpus are mostly extracts from books.
The extracts are between 10,000 and 15,000 words long (30 - 40
pages), and are taken from the beginning of the texts. The front
matter, prefaces, forewords, list of contents, etc., are not
included in the extracts. In some cases, introductions have been
left out as well, e.g. introductions by scholars to works of
fiction.
Omission of passages in the text may be marked by an
<omit> tag.
</samplingDecl>
The <editorialDecl> element contains the following elements, each
specifying a particular kind of editorial practice used for some portion of the
corpus.
Where the same principles apply across the whole corpus (e.g., for the
<segmentation> element), they can be documented only once within the
corpus header.
Where different parts of the corpus apply different practices (as for example
with the <quotation> or <hyphenation> elements), all
possible practices can be defined in the corpus header, and particular parts of the corpus can specify the editorial practices applicable to
them by using the
decls
attribute. When this method is used, if a
practice is not explicitly associated with a part of the corpus in this way, it
is assumed not to apply to it.
-
- <conformance>
- provides the CES level of conformance for the text or corpus.
- level
- gives the level of CES conformance (legal values are 1, 2, or 3).
- <transduction>
- describes the principles according to which the text has been
transduced, either in transcribing it from audio tape to written form, or in
converting from an electronic original.
- <correction>
- specifies a set of correction practices applied in creating one or more
components of the corpus.
- <quotation>
- specifies editorial practice adopted with respect to quotation marks
in the original.
- marks
- indicates whether or not quotation marks are retained as tag
content in the text.
- NONE no quotation marks have been retained
- SOME some quotation marks have been retained
- ALL* all quotation marks have been retained
- form
- specifies how quotation marks are indicated within the
text.
- STD use of quotation marks has been standardized; open and close quote
marks are distinct.
- NONSTD open and close quote marks are represented indiscriminately by the
- UNKNOWN* use of quotation marks is unknown.
- <hyphenation>
- summarizes the way in which end-of-line hyphenation in a source text
has been treated in an encoded version of it.
- <segmentation>
- describes the principles according to which the text has been
segmented, for example into sentences, tone-units, graphemic strata, etc.
- <normalization>
- specifies a set of normalization practices applied in creating one or more
components of the corpus.
- method
- indicates whether normalization made without notation or made
by including editorial tags.
- TAGS normalization indicated with tags
- SILENT* normalization made silently
This element is used differently in corpus and in text headers. In the corpus
header, it is used to list all the element names actually used within the
corpus, together with a brief description of its function. In text headers, the
same element is used to specify the number of SGML elements actually tagged
within each text. In both cases it consists of a number of
<tagUsage> elements, defined as follows:
-
- <tagUsage>
- supplies information about the usage of a specific element within the
corpus or text with which this header is associated.
- gi
- the name (generic identifier) of the element indicated by the
tag.
- occurs
- specifies the number of occurrences of this element within
the text.
- wsd
-
can be used on a <tagUsage> element to indicate that for
every appearance of the described element in the text, the content
defaults to the specified character set. Therefore the declaration
<tagUsage gi=term occurs=5 wsd="ISO 8859-5">
indicates that the content of all <term> elements is in the ISO 8859-5 character set.
Note that the global attribute lang can similarly be used in a
<tagUsage> element to indicate that for
every appearance of the described element in the text, the content
defaults to the specified language.
In the corpus header, each <tagUsage> element
contains a brief description of the element specified by its gi
attribute; the occurs attribute is not supplied. In text
headers, the <tagUsage> elements may be empty, but the
occurs attribute is always supplied.
A typical written text has a tag declaration like the following:
<tagsDecl>
<tagUsage gi=name occurs=256>
<tagUsage gi=div occurs=7>
<tagUsage gi=head occurs=7>
<tagUsage gi=p occurs=705>
<tagUsage gi=reg occurs=2>
<tagUsage gi=sic occurs=1>
<tagUsage gi=body occurs=1>
</tagsDecl>
A PERL script to automatically generate <tagUsage> elements
with appropriate values for tags in any SGML text is available at
<URL: http://www.cs.vassar.edu/~priestdo/research/scripts/tagusage.txt>
This element is useful for encoding corpora since it provides information about references which are often used in the alignment of parallel texts. In particular, it is common to use ID values on tags marking paragraphs and sentences as references in links associating two parallel texts. See for example, the
English-Norwegian Parallel Corpus Project
and
The Lingua Parallel Concordancing Project.
<refsDecl>
A reference system is built up using the identifiers of the
following text units: text, division, paragraph, s-unit.
Each nested division has an identifier which is built up by
successively adding to the identifier of the text. Each
paragraph has an identifier which adds yet another layer to the
immediately superordinate identifier. S-units are numbered
within the nearest division, as shown above. After alignment,
each s-unit in the core corpus has a "corresp"
attribute containing a reference to the corresponding unit(s) in
the parallel text.
</refsDecl>
The following scheme outlines means to define a set of text categories for
classifying texts in the corpus. A standardized set of text categories is under
development by the EAGLES Corpus Working Group on Text Typology, which may eventually eliminate the need to explicitly provide a descriptive taxonomy in
the corpus header.
The standard text categories and means to use them to classify texts in the
corpus will be specified in the final CES recommendations. The following can be
used to extend that taxonomy where necessary.
The <classDecl> element contains the descriptive taxonomy used to
classify texts within the corpus. It occurs once, in the corpus header, and
consists of
one or more <taxonomy> elements.
The <taxonomy> element in turn contains
a set of <category> elements, each representing a
particular textual classification feature and a value for that feature.
- <taxonomy>
- defines a typology used to classify texts.
- <category>
- contains an individual descriptive category or feature-value pair.
The global id attribute is required for the <category>
element, since it is used to associate a <catRef> within a text
header with the descriptive category appropriate to it. The category element
contains a set of <catDesc> elements:
- <catDesc>
- describes a category within the text typology, in the form of a brief
prose description.
The <catDesc> element is used to contain
the value for a feature within a <category>, unless that category
is further subdivided, in which case a nested <category> element
may be used.
Within the <textClass> element of the header for each text, a
<catRef> element is provided, the target attribute of which
lists the identifiers of all <category> elements applicable to
that text.
When a standard set of text categories is developed, it is anticipated that an
attribute on <textClass> will provide the category. Unless the
standard categories are extended, no pointer to <category>
elements in the corpus header will be required.
The third component of the header is the profile description. The
<profileDesc> element has the following components:
- <creation>
- contains information about the origination of a text.
- <langUsage>
- groups information describing the languages, sublanguages, registers,
dialects etc. represented within a text.
- <wsdUsage>
- groups information describing the character set(s) used within a text.
- <textClass>
- groups information which describes the nature or topic of a text in
terms of a standard classification scheme, thesaurus, etc.
- <translations>
- groups information about existing translations of the text.
- <annotations>
- groups information about existing annotation files associated with the text.
These
components appear in individual text headers, since they describe features of
particular texts.
This element is used to record details concerning the origination of the text, whether or not covered elsewhere.
This element contains one or more
<language> elements, each identifying a language used on the text:
- <language>
- characterizes a language, sublanguage, register, dialect,
etc., used within a single text.
- iso639
- gives the standard language code from ISO 639 in one of the following forms:
- a two-letter code from ISO 639 (e.g., "en" for English;
- a three-letter code from ISO 639-2 (e.g., "eng" for English);
- one of the above extended by a country code from ISO 3166 (e.g., "en.uk" or "eng.uk" for English as spoken in the United Kingdom).
- type
- indicates the type of language, e.g., sublanguage, dialect,
etc.
Example:
<langUsage>
<language id="fr" iso639="fr">French</language>
<language id="en" iso639="en">English</language>
<language id="la" iso639="la">Latin</language>
</langUsage>
The value of the id attribute on any <language> element should be given as a value for the global lang attribute when it is used on a tag in the text or header to refer to this language.
For example,
She ate <foreign lang=fr>croissants</foreign>
When more than one character set is used in a text, the wsd attribute should be used on each <language> tag to associate the language with a particular character set.
This element contains one or more
<writingSystem> elements, each identifying a character set used on the text:
- <writingSystem>
- characterizes a character set used within a single text.
Example:
<wsdUsage>
<writingSystem id="ISO 8859-1">ISO character set for western
European languages</writingSystem>
<writingSystem id="ISO 8859-5">ISO character set for
Cyrillic</writingSystem>
</wsdUsage>
The value of the id attribute on any <writingSystem> element should be given as a value for the global wsd attribute when it is used on a tag in the text or header to refer to this character set.
For example,
This is a patch of Cyrillic:
<foreign lang=bu wsd="ISO 8859-5">
Големия
брат
те наблюдава
</foreign>
When a Writing
System Declaration describing a transcription scheme is provided
as an auxiliary document, the value of the wsd attribute on the
<writingSystem> element must be an entity that points to
this document. Usually, the entity expands to be the name of the file
in which the Writing System Declaration is stored. Note that for this
reason, the type of the wsd attribute on the
<writingSystem> element is ENTITY (indicating that its
value must be an SGML entity). In all other instances, whether in the
header or text, the type of the wsd attribute is CDATA.
This element contains references to
the text classification scheme and descriptive keywords which together describe
the text concerned. The following elements are used for these purposes:
-
- <catRef>
- specifies one or more defined categories within some taxonomy or text
typology.
- target
- identifies the text category or categories, by means of an IDREF pointing to one or more <category> elements defined in the corpus header.
- scheme
- identifies the classification scheme.
- <h.keywords>
- contains a list of keywords or phrases identifying the topic or
nature of a text, each of which is tagged as a term. To be provided by EAGLES/PAROLE.
The <h.keywords> element contains one or more technical terms:
-
- <keyTerm> contains a technical term or phrase, particularly in a
list of descriptive keywords.
This element groups information about translations of the text which exist, usually within the same corpus. The following elements are used for these purposes:
-
- <translation>
- gives information about a translation of the text. The global
lang attribute and the wsd attribute are required on
this tag. Additionally, this tag has the following optional
attribute:
- trans.loc
- provides, in an entity
reference, information (path/file name, URL, etc.) about the location
of the the translation.
- <translator>
- gives the name of the translator.
This element groups information about annotation documents associated with the text. The following elements are used for these purposes:
-
- <annotation>
- gives information about an annotation file associated with the text.
Attributes:
- type
- indicates the type of annotation. Values include:
- SEGMENT annotation file contains segmentation into sentences and words.
- GRAM annotation file contains morpho-syntactic category information for the words in the text.
- ALIGN annotation file contains alignment links to a parallel translation.
- ann.loc
- provides, in an entity reference, information (path/file name, URL, etc.) about the location of the annotation file.
- trans.loc
- for annotation files containing alignment information, provides, in an entity reference, information (path/file name, URL, etc.) about the location of the file containing the aligned text.
The revision description is the fourth element in the header. It is used to
record details of any significant change to the corpus. The
<revisionDesc> element has the following component:
- <change>
- summarizes a particular change or correction made to a particular
version of an electronic text which is shared between several
researchers.
Multiple <change> elements are provided for; one should appear per change.
Unlike its counterpart in the TEI scheme, the
<change> element must here contain
-
- <changeDate>
- gives the date of the change.
- value
- specifies standard value for this date in ISO 8601 format
- <respName>
- specifies the person responsible for the change.
- <h.item>
- specifies the nature of the change(s). One or more occurrences of this element may appear within each <change> element.
When any significant change is made to any component of the corpus, the
following steps should be taken:
- a <change> element is added to the
<revisionDesc> of the text affected
- the update attribute of the text header is changed to the date of
the change
- the value of the status attribute of the text header is set to
UPDATE
- the revision number specified on the version attribute of the
<editionStmt> of the corpus header is incremented
The decls attribute is specified for the element <body> and
the larger division elements (<div>).
It is used for two purposes:
- to supply a specific title for parts of composite works;
- to specify encoding or other declarations applicable to all or part of a
text where a number of possibilities have been provided for in the
header.
Its value is a list of identifiers, each of which has been supplied
elsewhere in a text or corpus header as the identifier for one of the following
elements: <biblStruct>, <editorialDecl> and its
constituents (<correction>, <hyphenation>,
<quotation>, <segmentation> and
<transduction>), and <textClass>.
For these elements, the corpus header will normally contain several mutually
incompatible options, for example, several editorial declarations. Individual
texts, or portions of texts, specify explicitly which of the available options
applies to them by using the decls attribute. In cases where the set
of declarable elements applies only within portions of a single text, they will
be specified in the text header rather than the corpus header.
Declarable elements, once specified, are inherited by all sub-components. That
is, if the decls attribute of a <body> element specifies a
particular value for some declarable element, that value is understood to apply
to all components of the text unless over-ridden. If the decls attribute
of a <div> within that text specifies a different value, the new
value applies to the contents of that <div> only; the value
specified by the <body> applies to all subsequent
<div> elements in the same text, unless they also specify a
different decls value.
For non-declarable elements, the header of an individual text will specify only
those respects (if any) in which it differs from the defaults stated in the
corpus header.
This is a simplification of the decls mechanism described in
the TEI Guidelines.
<cesHeader version="2.0">
<fileDesc>
<titleStmt>
<h.title>Machine-readable version of 1984, ch. 1</h.title>
<respStmt>
<respType>typed in and marked with CES tags </respType>
<respName>A. Student</respName>
</respStmt>
</titleStmt>
<extent>
<wordcount>6571 </wordcount>
<bytecount units="bytes">6571 </bytecount>
</extent>
<publicationStmt>
<distributor>Laboratoire Parole et Langage, CNRS</distributor>
<pubAddress>29, avenue Robert Schuman
Aix-en-Provence, France</pubAddress>
<telephone>+33 42 95 36 33</telephone>
<fax>+33 42 59 50 96</fax>
<eAddress>phonetic@univ-aix.fr</eAddress>
<availability status=restricted>
internal use only--cannot be distributed</availability>
<pubDate>6571</pubDate>
</publicationStmt>
<sourceDesc>
<biblStruct>
<monogr>
<h.title>Nineteen Eighty-four</h.title>
<h.author>George Orwell</h.author>
<imprint>
<pubPlace>New York</pubPlace>
<publisher>New American Library</publisher>
<pubDate>1949; reprinted 1961</pubDate>
</imprint>
</monogr>
</biblStruct>
</sourceDesc>
</fileDesc>
<encodingdesc>
<projectdesc>
This English version of the first chapter of Orwell's 1984 is
encoded for use in the MULTEXT-EAST project. The English is
to serve as the base for the parallel corpus, and will be aligned
to versions of the text in Romanian, Bulgarian, Estonian,
Slovenian, Czech, and Hungarian.
</projectdesc>
<editorialdecl>
<conformance level=1>CES Level 1</conformance>
<correction status=medium method=silent></correction>
<quotation marks=none form=std>Rendition attribute values on Q
and QUOTE tags are adapted from ISOpub and ISOnum standard
entity set names
</quotation>
<segmentation>Marked up to the level of paragraph plus
marking of particular sub-paragraph elements: NAME, DATE,
FOREIGN.
</segmentation>
</editorialdecl>
<tagsdecl>
<tagusage gi=body occurs=1></tagusage>
<tagusage gi=date occurs=5></tagusage>
<tagusage gi=div occurs=2></tagusage>
<tagusage gi=foreign occurs=4></tagusage>
<tagusage gi=hi occurs=4></tagusage>
<tagusage gi=name occurs=149></tagusage>
<tagusage gi=note occurs=1></tagusage>
<tagusage gi=num occurs=2></tagusage>
<tagusage gi=p occurs=41></tagusage>
<tagusage gi=ptr occurs=1></tagusage>
<tagusage gi=q occurs=22></tagusage>
<tagusage gi=quote occurs=3></tagusage>
</tagsdecl>
</encodingdesc>
<profiledesc>
<langusage>
<language id="fr" iso639="fr">French</language>
<language id="en" iso639="en">English</language>
<language id="la" iso639="la">Latin</language>
<language id="ns">Newspeak</language>
</langusage>
</profiledesc>
</cesHeader>
The CES Header element definitions
The CES Header element definitions in hypertext navigable format