Pino, Marta: SGML
Encoding two large Spanish corpora with the TEI scheme: design and
technical aspects of textual markup
[Mirrored from:
http://www.cs.vassar.edu/~ide/DL96/pino.txt]
Marta Pino. Computational Linguistics Department. Instituto de
Lexicografía. Real Academia Espanñola. E-mail:
mpino@crea.rae.es
1. Introduction
The Lexicographic Institute of the Royal Spanish Academy is compiling
two large corpora: a reference corpus of modern Spanish, called CREA
(Corpus de Referencia del Español Actual), and a historical
corpus, known as CORDE (Corpus Diacrónico del Español).
CREA is a monitor corpus that covers the last 25 years of the language.
This means that once the corpus is completely compiled, it will cover
all the varieties of Spanish language use from 1975 to 2000. It will
contain 200 million words of running text, providing an empirical basis
for lexicographic and grammatical research. In the present stage, it has
8 million words, partially encoded. CORDE is a corpus of 80 million
words that covers the rest of the history of Spanish: from the origins
to 1975. Since CREA is a monitor corpus, it will periodically pass the
oldest texts to CORDE.
Both corpora are being encoded and morphosyntactically tagged. In the
future there may be also syntactical and pragmatical information
associated to the texts.
The aim of this paper is to show the main principles of the encoding
scheme applied to these corpora, focusing on some particular encoding
problems and their TEI or non-TEI solutions.
2. Overall structure of the encoding scheme for CREA and CORDE
2.1. Structural classification of the TEI.2 documents
Within each corpus there are two main types of TEI.2 documents, which
correspond to the difference between unitary (UT) and composite texts
(CT). On the one hand, there are autonomous textual units, like single
books or any other object published independently. These ones are called
"unitary texts", and correspond to TEI.2 documents that have a structure
composed of a header and a text, subdivided into front, body and back.
On the other hand, there are texts that, although constituting also
independent objects, have a more complex structure, made of different
and relatively single texts, like newspapers, magazines or anthologies.
Since these texts are made of texts, they are "composite texts", and
give rise to TEI.2 elements that consist of a header and a text, which
in its turn includes a front and a group of texts with their own
structure of front, body and back. In this paper, we will use the term
"nested texts" (NT) to refer to the texts included into a composite
text, to differentiate them from the unitary and the composite ones.
2.2. General structure of each corpus as an SGML document
In order to convert a corpus into an SGML document, it has been
necessary to add certain type of markup to the texts, but also to
associate the corpus with an SGML declaration and a formal definition of
type of document (DTD). Figure 1 shows the way the texts are organized
in the corpus. There is a header for the whole corpus and a series of
TEI.2 elements. Some of these TEI.2 documents are unitary, and some of
them composite.
Fig. 1. Overall structure of a corpus as an SGML document:
[instance of unitary text]
...
...
...
...
...
...
[instance of composite text]
...
...
...
[instance of component text]
...
...
...
[instance of component text]
...
...
...
-->
As the figure shows, there is another component, apart from the corpus,
within the SGML document. It consists, on the one hand, of an SGML
declaration, which contains certain technical details concerning the
variety of SGML parameters selected, and some other codes specially
useful for text interchange. On the other hand, there is a DTD
associated to the corpus, which is the document where all the SGML
elements and entities used in the texts are declared. Here are some
examples of these two aspects:
Fig. 2. Fragment of the TEI SGML Declaration adopted for CREA and CORDE:
Fig. 3. Fragment of the DTD written for the two corpora:
The corpus itself consists of a TEI header and a series of TEI.2
documents, which are also divided into TEI header and text (unitary
TEI.2 documents) or into TEI header and group of texts (composite TEI.2
documents). The next sections will analyse differe
nt encoding aspects of the corpus.
2.3. Types of references associated to the corpus
As figure 1 showed, there are some element with an identifier
attribute. The function of this information is to differentiate parts of
the corpus by means of an unique code. But not all the references used
in the corpus are of this type. We have distinguished four main types of
references, that correspond to the next four paragraphs.
2.3.1. Identification references
The first type of reference is the one that allows to identify TEI.2
documents or nested texts within a corpus. It is an unique code assigned
to each text (nested, composite, unitary) with the aim of making it
recognizable within the two corpora. The codes must also be different
between the two corpora, so that there is no collision when a text from
CREA passes to CORDE. These references are constructed by the design
department of the corpora, and consist of the following parts:
In CREA, the code for unitary or composite, but not journalistic, texts
specifies the corpus name, the medium, the superfield and thematic area,
and the number within the thematic area, followed by an optional second
thematic area:
CR.L.1.01.001
CR.L.1.01.001.2.01
The code of composite journalistic texts indicates the name of the
corpus, the medium, the title and date of the publication, and the
number within this title:
CR.P.PA1995.001
The code for an analytic text within a composite journalistic text is
like the last one, but with the addition of a number:
CR.P.PA1995.001.001
In CORDE, the system is the same for journalistic texts, but not for the
rest. This corpus specifies the name of the corpus, the age or period
the text belongs to, the genre and subgenre, the number within the
subgenre, and an optional second genre or sub
genre:
CO.M.11.A11.001
CO.M.11.A11.001.12.A12
2.3.2. Internal location references
The second type of reference provides a code that serves to find a
fragment of text within the corpus. Every part of the texts will have
this information, so that any example or fragment extracted from the
corpus can be recognizable for any researcher. These reference consists
of the following data:
In unitary or composite non-journalistic texts, this reference specifies
the name of the text, a number assigned to this name, and the page of
the example:
ade001:56 El adefesio, by Rafael Alberti.
cas001:80 Castilla, by Azorín.
In composite journalistic texts, the code specifies the name of the
publication, the year, and the number within the year:
vo1995.001 (First number of 1995 included in the corpus of the
newspaper La Voz de Galicia)
Nested texts add always a number before the indication of page:
vo1995.001:78
2.3.3. References of the TEI header
Some parts of the TEI header have also a code. This is intended to be
linked to the texts of the corpus they correspond to. The fact that the
TEI header can be presented as an independent file makes necessary the
introduction of a code to link the bibliographical information to the
texts. These are the references contained within the TEI headers of the
TEI.2 documents.
Within the element teiHeader, the reference starts with "th", followed
by the code of the text. Examples:
thade001
thcas001
thvo1995.001
thvo1995.001.001
Within the element fileDesc, the reference starts with "th", followed by
the code of the text. Examples:
fdade001
fdcas001
fdvo1995.001
fdvo1995.001.001
Within the elementsourceDesc, the reference starts with "th", followed
by the code of the text. Examples:
sdade001
sdcas001
sdvo1995.001
sdvo1995.001.001
Within the elementencodingDesc, the reference starts with "th", followed
by the code of the text. Examples:
edade001
edcas001
edvo1995.001
edvo1995.001.001
Within the element teiHeader, the reference starts with "th", followed
by the code of the text. This part is not yet developed, for no text has
been revised for the moment. Examples:
rdade001
rdcas001
rdvo1995.001
rdvo1995.001.001
This aspect will be developed below with more detail.
2.3.4. References of structural or non-structural textual markup
Some elements that occur within the texts have also identifiers or
number attributes:
pbhandnote
3. Classification of the texts
3.1. Text typologies in CREA and CORDE
There are several taxonomies that classify the texts of the two corpora
in different parameters. The corpus CREA has three different taxonomies,
called "crea", "medio" and "oral". The corpus CORDE has four: "corde",
"medio", "modal" and "epoc".
The taxonomy "crea" of the corpus CREA classify texts in superfields and
thematic areas (see figure below). The taxonomy "corde" of the corpus
CORDE classify also the text, but with other kind of criteria: genre and
subgenre, instead of thematic area.
The most important taxonomies are "crea and "corde". The taxonomy "oral"
is not yet developed, and differentiates types of oral texts. All the
three agree with the design principles of the corpora, since categories
are the basis for sampling and organiza
tion of the texts.
Fig. 4. Example of the taxonomy "crea":
Ciencias y tecnología
Biología
Veterinaria
Ecología
Tecnología
Física
Agricultura, ganadería,
pesca
Meteorología
Redes de
comunicación
Geología
Química
Informática
Ciencias sociales, creencias y
pensamientos
Religión
Lingüística,
Lenguaje
Historia
Sociología
Literatura
Memorias, testimonios
Erotismo, sexología
Psicología
Ética
Geografía
Problemática social
Civilización,
etnología
Antropología
Mitología
Folklore
Educación
Mujer
Fig. 5. Example of the taxonomy "corde":
Prosa
Prosa lírica
Prosa narrativa
Prosa narrativa breve
Relato breve tradicional
Relato breve culto
Prosa narrativa extensa
Relato extenso novela y otras formas
similares
Relato extenso diálogo y
miscelánea
Otros
other categories
The other taxonomies differentiate less categories. Thus "medio"
differentiates in CREA four categories, and " in CORDE only three;
"modal" is exclusive of CORDE and opposes verse to prose; "epoc" is used
only in CORDE, and differentiates Middle Ages, Gol
den Age and Contemporary texts.
The classification codes assigned to each text are declared in an
element of the profileDesc of the TEI header, called textClass. This is
way the categories are declared:
Fig. 6.
As the figure shows, the element textClass contains several catRef, each
of them specified in scheme and target. These two attributes associate a
value (target) to a particular taxonomy (scheme).
There are other classification systems applied to the corpus, like the
ISBN, the ISSN or de Spanish "Depósito Legal". These values are
declared within the sourceDesc, in an element called idno. This element
has an attribute of type, which specifies the category. Example:
Fig. 7
Code ISBN
Code ISSN
Code Depósito Legal
3.2. Movement of texts from CREA to CORDE
As it has been said before, CREA is a monitor corpus that passes
periodically all the oldest texts to CORDE. The need of moving texts
from one corpus to the other makes necessary to define a system that
converts the old codes of the text into the new ones
.
The system we have designed consists of associating a CORDE code to each
CREA text in an idno element, as the figure shows:
Fig. 8
CORDE code assigned to a CREA text
This assignment is made at the same time that the TEI header of the text
is written. This makes sure that the movement of texts can be made
automatically. The only thing that a program must do is to change the
identifying reference of the text into the value this idno element used
to have. The result is a text with a new identifying reference and a new
idnovalue, which is the original identifying reference in the corpus
CREA.
4. Bibliographical information included in the TEI header and indexed
in the data base (COSMAS 2.0)
4.1. Elements of the TEI header structure defined in the TEI Guidelines
4.1.1. TEI header of the corpus
There is some general information declared within the TEI header
associated to the whole corpus. The main ones are parcelled out among
the following elements:fileDesc, encodingDesc, and revisionDesc.
The fileDesc indicates the title, the responsible, the edition number,
the publication status and the extent of the corpus.
The encodingDescinforms about the aims of the corpus as a project, the
sampling principles, the editorial principles in correction, quotation,
hyphenation, segmentation and interpretation, and finally about the
taxonomies used to classify texts.
The revisionDesc, which has not been developed yet, will inform about
revisions of the corpus.
4.1.2. TEI header of each text
Each text has its own TEI header, as a way to introduce some
bibliographical information concerning the electronic texts and their
source editions.
The fileDesc includes information about the electronic text, such as
title, edition number and extent, and also information about the source
edition of the text, that is to say, the paper or electronic version of
the text included in the corpus. In unitary texts, the source
description informs about the title, the author, the editor, the
publisher, the publication place and date, the year of the first
edition, the pages it has, some classification codes, such as isbn, dl,
ccorde (for CREA texts), and some data concerning the series a text can
belong to. There is also a field for notes. In composite texts, there
is a need of introducing several source descriptions, one referred to
the whole composite text (monogr), and one for each nested text
(analytic followed by a monogr element). ). Within each source
description of a composite text, the data provided are more or less the
same as for a unitary text, although there are some changes. The
identifiers differentiate the several source descriptions and link them
to the part of text the correspond to (see above).
The encodingDesc declares the tags actually used in the text, shows the
location references that this text will have, and describes the profile
of the text, that is to say, some creation details, the language used in
it, the classes the text belong to, acc
ording to the taxonomies defined in the TEI header, the list of hands
that take part in the text, and some other details if the text is
spoken. The revisionDesc will describe the changes operated within the
text, once the corpus is compiled and revision starts.
4.2. Elements added to the TEI header structure
All the elements described follow the TEI scheme. However, it has been
necessary to modify the Guidelines in certain points, in order to
include some additional data which are considered important for the
purposes of the corpus.
First of all, some data over the origin, the country and the sex of the
author, providing they are known, have been added by means of an idno
element, within the sourceDesc, as next figure shows:
Fig. 9
(Spanish or Latin-american??)
Secondly, the date of nested texts is considered important, particularly
when there is a long chronological distance between nested texts
belonging to the same composite text:
Fig. 10
Date of the nested text
When it is preferable to classify nested texts by their own date,
instead of taking the date of the composite text, then nested texts are
treated as independent texts belonging to the same collection.
Another unsolved problem in the TEI is the indexing of the date of
creation of a text when it is the traditional reference date for it. The
solution for CREA and CORDE is to interpret date n=1.0 as the date of
the first edition or the date of creation of the text, depending on
which the reference date is. This is the way to index only one original
date (date n=1.0), and one date of the source edition (date n=x.y),
which can be the same or not. Any other explanation can be made within
the element creation or in note.
5. Types of structural information included in the texts
The main problem of the structural markup of the texts is that there are
very different types of text within the corpus. This is the reason why
the DTD of these corpora do not chose only one possibility of structure,
such as prose, verse, drama, but introduces elements from all of them,
making many different combinations acceptable. This encoding scheme is
similar to the one found in the TEI Lite, although it adds some elements
not included in that DTD.
The main structural division of a text is the unit divnumbered from 1 to
7 to indicate the division level. The element textmay have or not
divisions, depending on its internal organization. It is necessary to
adapt always to the original structure of the
text.
6. Types of non-structural information included in the texts
A text of CORDE can be in prose or in verse. It will select sor
lunits depending on this condition. In CREA this problem does not exist,
since there is no verse. Drama and spoken texts can also be found, which
are quite special from the structural point of view. And, of course,
there can be very different mixtures, like prose into verse or written
parts in spoken texts. In these cases, a particular division of a text
changes the elements used in other divisions to respond to the
requirements of the new text. As it can be seen, text modalities are not
considered as watertight compartments.
Basically, these are the main non structural elements found in the texts
of the corpora:
p and s in prose texts;
l in verse texts;
u and s in spoken texts ;
sp, p and s in drama texts.
Apart from these elements, there are other kinds of non-structural
markup within the texts. First of all, there are some elements used to
separate highlighted parts of a text from the rest of the corpus. There
have been some changes in this point, since at the beginning all the
highlighted expressions used to be interpreted, and now some of them are
only being treated as emphatic (emph) elements. This change has been
motivated by the need of finishing the first stage of these corpora by
the end of 1997. In the next stage, some of the emph elements will be
differentiated according to the previous encoding proposal. The
categories for highlighted text distinguished now are the following:
cit
quote
emph
q
Other SGML and TEI elements found in the texts of the corpora are these:
abbr
note and anchor
list
table
formula
caption
corr sic=
sic
add
del
restore
gap
supplied
It has been necessary to change the TEI scheme in the encoding of some
tables, since it resulted complicated and slow. Some tools convert
automatically tables into the TEI scheme. However, it is very common to
find complicated tables, specially in old texts, that do not adapt to
the typical grid. The correct markup of this special cases requires too
much time and efforts. Consequently, a little change has been
introduced: the element table can have plain text as content.
Apart from the tags we have described, there are others that can be
found only in certain types of text. Thus, in spoken texts, there are
these special tags:
u, for each utterance;
pause, for pauses;
vocal, for expressions that are not lexical units, but communicate
something;
kinesic, for gestures or movements of the participant;
event, for noises or other non communicative events;
In dramatic texts, there can be also a cast list before the play itself.
This unit requires tags such as castList, castItem, castGroup, role, and
roleDesc. In the text of the play, elements such as sp, speaker and
stageare very common.
7. Conversion of the text to a plain text format
7.1. Treatment of hard, soft and end-of-line hyphens
In the texts of the corpora, end-of-line hyphens are suppressed.
Accordingly, a line like this one:
Fig. 11
¿Habrá en este comentario una crítica velada a mi
apariencia?, pensó Onofre Bouvila al oír lo que
decía el señor Braulio. Aunque la actitud cordial del
fondista parecía desmentir esta suposici&oacut
e;n, la susceptibilidad de Onofre Bouvila estaba plena-
mente justificada
will be encoded like this:
Fig. 12
¿Habrá en este comentario una crítica velada
a mi apariencia?,
pensó Onofre Bouvila al oír lo que
decía el señor Braulio. Aunque la actitud cordial del
fondista parecía desmentir esta s
uposición, la susceptibilidad de Onofre Bouvila estaba plenamente
justificada.
Hard hyphens such as the ones found in composite words, intervals or
dates, are preserved as normal hyphens. Therefore, expressions like the
following will appear in the corpus this way:
Fig. 13
Composite words:
Original text:
El vestíbulo era pequeño: sólo cabían
allí un mostrador de madera clara con su escribanía de
latón y su libro-registro.
Encoded text:
El vestíbulo era pequeño: sólo cabían
allí un mostrador de madera clara con su escribanía de
latón y su libro-registro.
Dates:
2-5-95
Intervals:
pages 2-20
Soft hyphens used in direct speech or thinking are replaced by the tag q
. Example:
Fig. 14
Original text:
Es este barrio ruin lo que nos obliga a poner unos precios muy por
debajo de la categoría del establecimiento -se lamentó.
Fig. 15
Encoded text:
Es este barrio ruin lo que nos obliga a poner unos precios muy por
debajo de la categoría del establecimiento
se
lamentó.
There are also soft hyphens that correspond to items of lists. These
ones are replaced by the tag item. The rest of the soft hyphens, which
indicate the beginning of a parenthetical comment, are replaced by low
hyphens followed by space, so that they can be distinguished from hard
hyphens.
7.2. Treatment of quotation marks, italics, bold face, small capitals,
capitals or underlined characters
Any highlighted piece of text will be encoded with the tag emph, unless
it is a quotation (quote, cit), or direct speech (q). This implies that
the quotation marks will be suppressed, since the information is
preserved by other means. No difference will be made between types of
rendition. There is an attribute available for this purpose, but it
requires manual intervention to some extent, since there are several
typographical combinations.
Examples:
Rosa, sueño de nadie bajo tantos
párpados
, escribe Rilke.
Todo joven es un parvenu de la
fisiología.
Me interesa menos el habla del conjunto de la
población que lo que podríamos llamar
escribidura particular de ese pequeño sector de
hombres públicos.
Todos han venido esta tarde
There is a program that introduces the tag emph where there is any
special typographical rendition. The manual correction of the text
allows the replacement of emph by quote, cit or q whenever necessary.
As it will be shown below, two copies of the text are saved: an already
revised copy of the text is saved in a non ISO 646 character set,
without SGML markup, and a copy that will make part of the corpus is
saved as a 7 bit text fully encoded.
7.3. Treatment of characters not included in the set ISO 646
Characters non belonging to ISO 646 are not valid for interchange. For
this reason, it is necessary to convert them into another format. The
standard writing system declaration useful for Spanish is the entity set
known as "Latin-1", described in ISO 8859-1. An automatic character
conversion is always made, in the corpora, from the special Spanish
orthographical signs to the SGML standard entities. This process
operates once the text has been revised.
8. Technical issues: mechanism to introduce markup in the texts:
people, hardware, software.
The CREA and CORDE research group have 14 people working in the
department of introduction of texts. This section is charged with the
introduction and encoding of the texts of the corpus.
The compiling process has several stages. First of all, there is a
conversion of medium, from paper to electronic format, by means of an
OCR. The second stage is an automatic introduction of s, p, pb and emph
tags. The third step requires more human than machine work, since it is
the revision and correction of the texts after the first two processes.
After the correction, all the non-ASCII characters are automatically
converted into ISO 8859-1, so that the text can already be exported to
an SGML editor, and interpreted according to the DTD of the corpus. In
the fifth stage, new bibliographic, structural and non structural markup
is introduced within the SGML editor. Once this is finished, the SGML
text is validated and stored.
After this process, there is an automatic tagging of the texts, which is
followed by a new validation. Syntactical analysis will be added with a
parser, just before the last validation of the text.
9. Conclusion
This paper has tried to show that the TEI scheme results very suitable
to encode large amounts of electronic text, like in the case of the
Spanish corpora. The TEI provides encoding solutions for many different
types of application, but it is almost impossible to use the whole tag
set in a particular text or collection of texts. The use of the TEI
requires a thorough analysis of its principles and a selection of a
reduced tag set, according to the purposes of the text that is going to
be encoded.
Some aspects of the encoding principles of the Spanish large corpora
have been described, such as the structure of the textual documents, the
classification and referential systems, the bibliographical informations
stored and processed, the internal tags, the movement of texts from one
corpus to the other, and the conversion of texts to 7 bit ASCII format.
All the solutions to encoding problems have followed TEI principles to
some extent, although some of them are not treated in the TEI
Guidelines.
It has been necessary to develop some tools to make the markup process
easier. All the tags that can be reduced to formal rules are being
introduced automatically. The selection of tags should always take into
account the cost in time and efforts of the introduction of tags in the
texts. A balance between information retrieval possibilities and markup
efforts should be found.
Similarly, a good SGML parser and data base should be used to edit,
store and retrieve encoded information. If all these conditions are
fulfilled, TEI scheme results a good standard basis for the edition and
interchange of electronic texts.
10. References
Bryan, M. (1988). SGML: An author's guide. New York: Addison-Wesley.
Burnage, G., Dunlop, D. (1993). "Encoding the British National
Corpus",in J. Aarts, P. de Haan and N. Oostdijk (eds.), English Language
Corpora: Design, Analysis and Exploitation, Amsterdam: Rodopi.
Burnard, L., Sperberg-McQueen, C. M. (1995).TEI Lite: An Introduction to
Text Encoding for Interchange.Document No: TEI U 5, Groningen:
Groningen University.
Burnard, L. (1992). "The Text Encoding Initiative: a progress report",in
G. Leitner (ed.), New Directions in English Language Corpora.
Methodology, Results, Software Developments, Berlin: Mouton de Gruyter.
Burnard, L. (1995). "The Text Encoding Initiative: an overview", in
Leech, G., Thomas, J. (eds.), Spoken English on Computer: Transcription,
Markup and Applications, Harlow: Longman.
Burnard, L. (1987). "CAFS: a new solution to an old problem", in W.
Meijs (ed.), Corpus Linguistics and Beyond. Proceedings on the Seventh
International Conference on English Language Research on Computerized
Corpora, Amsterdam: Rodopi.
Cover, R. (1991). "The progress of SGML (Standard Generalized Markup
Language): extracts from a comprehensive bibliography"Literary &
Linguistic Computing, 6/3, 197-209.
Goldfarb, C.F. (1990). The SGML HandbookOxford: Clarendon Press.
Ide, N., Véronis, J. (1994). "Corpus Encoding",EAGLES Document
EAG-CSG/IR-T2.1 in EAGLES Interim Report.
Ide, N., Véronis, J. (1995). The Text Encoding Initiative:
Background and contexts.Computers and the Humanities, 29, 1-3.
Johansson, S. (1994). "Continuity and change in the encoding of computer
corpora", in N. Oostdijk, P. De Haan (eds.), Corpus-based Research into
Language, Amsterdam/Atlanta: Rodopi.
Johansson, S. (1993). "Some aspects of the recommendations of the Text
Encoding Initiative, with special reference to the encoding of language
corpora", in M. Kyt, M. Rissanen, S. Wright (eds.), Corpora Across the
Centuries. Proceedings of the First Inter
national Colloquium on English Diachronic Corpora, 25-27 March 1993.
Pino, M. (1996)Manual de codificación textual para los corpus
CREA y CORDE. Normas de marcación en SGML según las
recomendaciones de la TEI. Versión 1.0.Internal document of the
Lexicographic Institute of the Spanish Royal Aca
demy.
Pino, M. (1996)'Document Type Definition' para los corpus CREA y CORDE.
Versión 1.0.Internal document of the Lexicographic Institute of
the Spanish Royal Academy.
Sperberg-McQueen, C.M., Burnard, L. (eds.) (1994). Guidelines for
Electronic Text Encoding and Interchange. TEI-P3.Chicago / Oxford: Text
Encoding Initiative.
Van Herwijnen, E. (1994) Practical SGML.Boston: Kluwer Academic
Publishers