[Back to main SGML Page]
Living with the Guidelines: An Introduction to TEI Tagging
[An Excerpt]
Lou Burnard & C. M. Sperberg-McQueen
Note: This document is an excerpt prepared for the 1993 University
of Virginia Rare Books School from a longer document written in
1990. It summarizes some basic issues and problems in the encoding
of electronic texts for interchange, and provides a brief introduction
to the recommendations of the Text Encoding Initiative. [David
Seaman, University of Virginia, July 18, 1993]
What is the TEI?
The Text Encoding Initiative is a cooperative effort in the textual
research community to develop and disseminate guidelines for the
encoding and interchange of machine-readable texts. It is sponsored
by the Association for Computers and the Humanities, the
Association for Computational Linguistics, and the Association for
Literary and Linguistic Computing. Funding comes in part from the
U.S. National Endowment for the Humanities, Directorate XIII of the
Commission of the European Communities, the Andrew W. Mellon
Foundation, and the Social Science and Humanities Research Council
of Canada.
The TEI Guidelines are designed to be used in a broad range of
applications; they are flexible, easy to use, and extensible.
What is Markup?
By markup we mean all the information contained in a computer
file other than the text itself, by means of which computer programs
are able to manipulate texts in useful ways; the term is borrowed from
the history of printing, where markup referred to the notations made
in the margins of a text to guide the compositor in the layout of the
text. Markup intended to specify the proper layout or presentation of
a text is still the most common type of markup in computer files.
Because computers can be used for far more than printing the
text out on paper, however, markup can be used to guide processing
of any type, not just printing. Markup of texts for research purposes
may frequently specify not the proper font and leading for a text but
(for example) its rhetorical or syntactic structure. Indeed, any aspect
of a text of importance to a researcher can be signalled by markup so
that software can treat it in an appropriate way.
In general, markup in the TEI scheme is not intended as a way of
controlling any one piece of software. Although convenient, such
markup gets in the way as soon as one wishes to use some other
program to work on the text. It also makes it difficult to change the
way one treats all the pieces of text of a certain type. It is easier to
work flexibly with text, and easier to use many different kinds of
software with the same machine-readable text, if (a) the markup in a
text is clearly distinguishable from the text itself, and (b) the markup
specifies not how to process the bit of text being marked (procedural
markup) but what it is (descriptive markup). Given markup which
describes the text itself, rather than what a particular program is to do
with it, any piece of software can decide for itself how to process the
text. A common method is to use a lookup table which associates the
generic markup tags of the text with specific processing instructions;
by analogy with similar shorthands used in publishing, such tables are
often called style sheets.
What is SGML?
The Standard Generalized Markup Language (SGML) is a
language for defining markup languages, i.e. sets of markup tags with
rules defining when they are applicable and how they can interrelate.
SGML does not itself define a markup language. It merely allows its
users to define one. Using SGML, for example, one may specify that
a novel must begin with front matter, followed by a body which
consists of a series of chapters. And so on.
The TEI encoding scheme uses SGML to define a set of markup
tags, and to define how they can be used.
There are three characteristics of SGML which distinguish it
from other markup languages: it is designed for descriptive rather
than procedural markup; it allows one to define distinct document
types with distinct rules for their structures and the markup they can
contain; and it is independent of any one system for representing
characters. The terms procedural and descriptive markup have
already been encountered. The notion of document types allows
SGML to verify that the markup in a text actually follows the rules
laid down (by the user) for that type; equally important, it allows
software developers to exploit the knowledge about text structures
which is embodied in the rules for different document types, and to
create more intelligent software as a result. SGML's independence of
specific character sets is important for its role in the interchange of
documents among scholars using different types of machines.
SGML-based markup languages, including that of the TEI,
regard text not as an undifferentiated sequence of words, much less of
bytes, but as a consistently arranged hierarchy of many different
units, of different types or sizes. A prose text such as this one might
be divided into sections, chapters, paragraphs, and sentences. A verse
text might be divided into cantos, stanzas, and lines. Once printed,
sequences of prose and verse might be divided into volumes,
gatherings, and pages. Unlike other markup languages which share
this view of text as a complex hierarchical structure, SGML and the
TEI allow more than one single hierarchical structure to be discerned
and marked up in a single text.
The technical term used in the SGML standard for a textual unit,
viewed as a structural component, is element. Different types of
elements are given different names, but SGML provides no formal
way of expressing the meaning of a particular type of element, other
than its relationship to other element types. That is, all one can say
about an element called (for instance) "blort" is that instances of it
may (or may not) occur within elements of type "farble," and that it
may (or may not) be decomposed into elements of type "blortette." It
should be stressed that the SGML standard is entirely unconcerned
with the semantics of textual elements: these are application
dependent. It is up to the creators of SGML conformant tag sets (such
as the TEI Guidelines) to choose intelligible names for the elements
they identify and to document their proper use in text markup.
From the need to choose element names indicative of function
comes the technical term for the name of an element type, which is
generic identifier, or GI.
Within a marked up text (a document instance, each element
must be explicitly marked or tagged in some way. The standard
provides for a variety of different ways of doing this, the most
commonly used being to insert a tag at the beginning of the element
(a start-tag) and another at its end (an end-tag). The start- and end-tag
pair are used to bracket off the element occurrences within the
running text, in rather the same way as different types of parentheses
or quotation marks are used in conventional punctuation. For
example, an embedded speech element in a text might be tagged as
follows:
... Rosalind's remarks <speech>This is the silliest stuff
that ere I heard of!</speech> clearly indicate ...
As this example shows, a start-tag takes the form <name>, where < is
a string indicating the start of the start-tag, "name" is the generic
identifier of the element which is being delimited, and > is the string
indicating the end of a tag. An end-tag takes the form </name>,
where </ is a string marking the start of an end-tag, "name" is the
generic identifier of the element being closed and, as before, > is the
string indicating the end of a tag.
Other than start-tags and end-tags, only one type of SGML
markup need concern us here: SGML entity references. SGML
entities are a simple and flexible method of encoding and naming
arbitrary strings of characters. An SGML entity has a name and a
definition. When an entity is referred to in an SGML document, its
name appears in the document; in the output, the SGML processor
replaces the name of the entity with its definition.
A Short Example
In this example, we first demonstrate how a passage of prose
might be entered by someone aware of the need to be faithful to
typographic appearances, but with little sense of the purpose of
mark-up. In an ideal world, such output might be generated by a very
accurate optical scanner. It attempts to be faithful to the appearance
of the printed text, by retaining the original line breaks, by
introducing blanks to represent the layout of the original headings and
page breaks, and so forth. Where characters not available on the
keyboard are needed (such as the accented letter `a' in `faa``l' or the
long dash), it attempts to mimic their appearance. Such tricks are
rarely portable and for analytic purposes, their use introduces
needless complications.
CHAPTER 38
READER, I married him. A quiet wedding we had: he and I, the par-
son and clerk, were alone present. When we got back from church, I
went into the kitchen of the manor-house, where Mary was cooking
the dinner, and John cleaning the knives, and I said --
`Mary, I have been married to Mr Rochester this morning.' The
housekeeper and her husband were of that decent, phlegmatic
order of people, to whom one may at any time safely communicate a
remarkable piece of news without incurring the danger of having
one's ears pierced by some shrill ejaculation and subsequently stunned
by a torrent of wordy wonderment. Mary did look up, and she did
stare at me; the ladle with which she was basting a pair of chickens
roasting at the fire, did for some three minutes hang suspended in air,
and for the same space of time John's knives also had rest from the
polishing process; but Mary, bending again over the roast, said only --
`Have you, miss? Well, for sure!'
A short time after she pursued, `I seed you go out with the master,
but I didn't know you were gone to church to be wed'; and she
basted away. John, when I turned to him, was grinning from ear to ear.
`I telled Mary how it would be,' he said: `I knew what Mr Ed-
ward' (John was an old servant, and had known his master when he
was the cadet of the house, therefore he often gave him his Christian
name) -- `I knew what Mr Edward would do; and I was certain he
would not wait long either: and he's done right, for aught I know. I
wish you joy, miss!' and he politely pulled his forelock.
`Thank you, John. Mr Rochester told me to give you and Mary
this.'
I put into his hand a five-pound note. Without waiting to hear
more, I left the kitchen. In passing the door of that sanctum some time
after, I caught the words --
`She'll happen do better for him nor ony o' t' grand ladies.' And
again, `If she ben't one o' th' handsomest, she's noan faa\l, and varry
good-natured; and i' his een she's fair beautiful, onybody may see
that.'
I wrote to Moor House and to Cambridge immediately, to say what
I had done: fully explaining also why I had thus acted. Diana and
474
JANE EYRE 475
Mary approved the step unreservedly. Diana announced that she
would just give me time to get over the honeymoon, and then she
would come and see me.
`She had better not wait till then, Jane,' said Mr Rochester, when I
read her letter to him; `if she does, she will be too late, for our honey-
moon will shine our life long: its beams will only fade over your
grave or mine.'
How St John received the news I don't know: he never answered
the letter in which I communicated it: yet six months after he wrote
to me, without, however, mentioning Mr Rochester's name or allud-
ing to my marriage.
This transcription suffers from a number of shortcomings:
* it was taken, without much thought, from an inexpensive readily
available paperback edition, which means its text is reasonable
but not authoritative; for the same amount of effort in
transcription, a critical text could have been used which would be
more useful to others (we assume that the purpose of the
transcription is to study BrontƯ's text -- if the point is to study the
particularities of the paperback edition, of course, then the
paperback should be transcribed)
* the page numbers and running titles are intermingled with the
text in a way which makes it difficult for software to disentangle
them
* no distinction is made between single quotation marks and
apostrophe, so it is difficult to know exactly what passages are in
direct speech
* the preservation of the copy text's hyphenation means that
simple-minded search programs will not find the broken words
* the accented letter in "faa``l" has been rendered by an improvised
key sequence which follows no standard pattern and will be
processed correctly only if the transcriber remembers to mention
it in the documentation (sad experience suggests that he or she
will quite likely forget)
We now present the same passage, tagged at a minimal level of
detail using the tag set recommended by the Guidelines. Paragraph
divisions (implied by indented lines in the first example) have been
marked explicitly; apostrophes are distinguished from closing
quotation marks; and the accented letter has been represented by an
entity reference. The long dash, represented above by two
consecutive hyphens, has also been rendered by an entity reference.
Because we are interested in BrontƯ's text, not in the printing of one
particular edition, the appearance and form of the chapter heading,
running titles, etc. have not been transcribed. To make it easier to
proofread and to refer to the copy text, its page divisions have been
marked with an empty <pb> tag. To simplify searching and
processing, the lineation of original has not been retained and words
broken by typographic accident at the end of a line have been
re-assembled without comment. For convenience of proofreading, a
new line has been introduced at the start of each paragraph, but the
indentation is removed.
<![ CDATA [
<pb n='474'>
<div1 name=chapter n='38'>
Reader, I married him. A quiet wedding we had: he and I, the parson
and clerk, were alone present. When we got back from church, I went
into the kitchen of the manor-house, where Mary was cooking the dinner,
and John cleaning the knives, and I said —
<q>Mary, I have been married to Mr Rochester this morning.</q>
The housekeeper and her husband were of that decent, phlegmatic
order of people, to whom one may at any time safely communicate
a remarkable piece of news without incurring the danger of
having one's ears pierced by some shrill ejaculation and
subsequently stunned by a torrent of wordy wonderment. Mary did
look up, and she did stare at me; the ladle with which she was
basting a pair of chickens roasting at the fire, did for some
three minutes hang suspended in air, and for the same space of
time John's knives also had rest from the polishing process; but
Mary, bending again over the roast, said only —
<q>Have you, miss? Well, for sure!</q>
A short time after she pursued, <q>I seed you go out with the
master, but I didn't know you were gone to church to be wed</q>;
and she basted away. John, when I turned to him, was grinning
from ear to ear. <q>I telled Mary how it would be,</q> he said:
<q>I knew what Mr Edward</q> (John was an old servant, and had
known his master when he was the cadet of the house, therefore
he often gave him his Christian name) — <q>I knew what Mr
Edward would do; and I was certain he would not wait long
either: and he's done right, for aught I know. I wish you joy,
miss!</q> and he politely pulled his forelock.
<q>Thank you, John. Mr Rochester told me to give you and Mary this.</q>
I put into his hand a five-pound note. Without waiting to hear
more, I left the kitchen. In passing the door of that sanctum some
time after, I caught the words —
<q>She'll happen do better for him nor ony o' t' grand ladies.</q> And
again, <q>If she ben't one o' th' handsomest,
she's noan faàl, and varry good-natured; and i' his een
she's fair beautiful, onybody may see that.</q>
I wrote to Moor House and to Cambridge immediately, to say what I
had done: fully explaining also why I had thus acted. Diana and
<pb n='475'> Mary approved the step unreservedly.
Diana announced that she would just give me time to get over the
honeymoon, and then she would come and see me.
<q>She had better not wait till then, Jane,</q> said Mr
Rochester, when I read her letter to him; <q>if she does, she
will be too late, for our honeymoon will shine our life long:
its beams will only fade over your grave or mine.</q>
How St John received the news I don't know: he never answered the
letter in which I communicated it: yet six months after he wrote to me,
without, however, mentioning Mr Rochester's name or alluding to my
marriage.
]]>
What is a TEI Text?
What does it mean to say that a text is "TEI conformant?" A full
answer to this question involves an understanding of the various
contexts or environments in which electronic texts may be used. At
one extreme, a text may be prepared using a particular version of a
particular software package on a particular machine, for use with that
software package only. Its users and preparers may never have any
intention of sharing the text with others, nor of using any texts
prepared elsewhere. At the other extreme, a text may be prepared on
many different systems as part of a co-operative data capture exercise,
for use by several different people, all with differing objectives and
different software systems. Most projects fall between these two
extremes, often with different priorities at different times. How does
the TEI project help either of them?
As we suggested above, encoding a text is fundamentally a
process of deciding which textual features should be distinguished by
markup of some kind, and of deciding on a suitable markup for them.
The TEI Guidelines may be thought of as a codification of the
distinctions which have been found helpful by most people most of
the time when faced with this task.
The Notion of Conformance
Returning to the question of conformance: if the Guidelines do
not require that every distinction they specify be made in encoding a
text, what in fact do they require? They say, in effect, if you wish to
distinguish this feature in your text, then this is the tag you should use
to identify it, and (possibly) this is the way that this textual feature
should be related to other textual features in the text. If for example,
you wish to distinguish proper names that are embedded in your text,
the Guidelines advise you to use the tag for the purpose: they do not
propose that all proper names in a text should be marked however. A
TEI-conformant text must, as a minimum, be parseable by an SGML
processor using one or other of the published TEI document type
definitions (DTDs).
Conformance in Different Environments
Strict conformance to the TEI interchange format may be desired
or required when you are sending files to someone about whose
system you know no details, when you are depositing a text in a text
archive, or when you are working with software which accepts only
TEI interchange-format texts.
In many cases, a less strict adherence to the rules of the TEI
interchange format may be appropriate. If you have SGML software,
for example, then it is unnecessary to limit yourself, in the work you
do on your own machine, to the subset of SGML features allowed in
TEI interchange-format documents, since it is easy to use SGML
software to produce an interchange-format version of any SGML
document which uses the TEI document type declarations. You may
use some other software, on the other hand, which accepts most
TEI-conforming documents, but places some further restriction on the
SGML features which can be accepted. In this case prudence will
dictate that you restrict yourself to the SGML features your software
can handle.
If you do not have SGML software, you may wish to use some
markup scheme designed around the software you use most: Word
Perfect or Nota Bene users might develop a set of Word Perfect styles
or Nota Bene styles corresponding to the TEI tags they use most
often. As long as the mark-up scheme you use makes at least the
same set of distinctions as those recommended by the Guidelines,
then it will be simple to translate from your local scheme to the TEI
scheme, and back.
The construction of a sensible local scheme depends entirely on
the hardware and software you are using. What makes sense for a
Macintosh user who shuttles constantly between Word and
Hypercard, will not necessarily be the best approach for a PC user
who seldom leaves Nota Bene, and neither will necessarily be apt for
someone using a VAX. We have tried to be non-partisan in our
examples of the many possible shortcuts and keyboarding
conventions available, but you should remember that these are only
examples of techniques we have found useful -- your environment
will be different, and you will probably find better horses for your
courses.
Character Sets and Conformance
Character set incompatibilities pose serious problems for the
exchange of machine-readable texts among scholars; many common
methods of exchanging texts fail for texts which contain characters
other than the twenty-six basic letters of the Latin alphabet, the ten
Arabic numerals, and some common punctuation marks. Accented
characters, braces and brackets, and many other characters may not
arrive at all, or may arrive as undecipherable nonsense. The TEI
Guidelines define a "safe" set of characters for interchange using
today's systems, and recommend the use of entity references for all
other characters. Because the shortcomings of current systems will
not (we hope!) be with us forever, however, adherence to these
restrictions is not a necessary part of TEI-conformance, though it may
be highly desirable in certain situations.
In your own work on your own machine, however, there is no
reason not to use all the characters available in your machine's
character set. When you wish to exchange texts with users of other
systems, you can transform any such characters into SGML entity
references, by using a simple global search and replace function for
example.
Just as special purpose programs may be needed to convert from
the form in which it is convenient to enter text into a TEI-conformant
one, so it is likely that special-purpose programs will be developed to
convert a TEI-conformant text into one that can be reliably
transported across networks, possibly involving some data
compression as well as translation of "awkward" characters, together
with similar programs to do the opposite. Such programs have yet to
be written however.
The Structure of a TEI Text
All TEI-conformant texts contain (a) a TEI header and (b) the
transcription of the text proper. The TEI header provides information
analogous to that provided by the title page of a printed text. It
contains a description of the machine-readable text <fileDesc>, a
description of the methods and editorial principles which governed
the transcription or encoding of the text <encDesc>, a description of
non-bibliographic elements (optional) <profDesc>, and a revision
history <revDesc>.
In the TEI document type declarations, the text transcription is always
divided into <front>, <body>, and <back> sections, of which the first
and the last are optional. The overall structure of a typical TEI text,
therefore, is this:
<![ CDATA [
<TEI.1>
<TEI.Header>
<fileDesc>
<!-- ... bibliographic description of an electronic file -->
</fileDesc>
<encDesc>
<!-- ... Description of the encoding conventions -->
</encDesc>
<profDesc>
<!-- ... Description of the non-bibliographic aspects of a text
(language, purpose, genre, etc) -->
</profDesc>
<revDesc>
<!-- notes on revisions to the electronic text -->
</revDesc>
</TEI.Header>
<text>
<front>
<!-- front matter ... -->
</front>
<body>
<!-- body of text ... -->
</body>
<back>
<!-- back matter ... -->
</back>
</text>
</TEI.1>
]]>