SGML: Giordano on TEI Headers


The Documentation of Electronic Texts Using Text Encoding Initiative
Headers: A Introduction. 

Richard Giordano 
Lecturer, Information Systems
Department of Computer Science
University of Manchester 
Oxford Road
Manchester M13 9PL 
United Kingdom
email: rich@cs.man.ac.uk
                          
Biographical statement:  

the author was a member of the Text Documentation Committee since it was
established in 1988.  He has held positions at Butler Library of Columbia
University, Computing and Information Technology at Princeton University,
and the Laboratory for Research in Academic Information at the Johns
Hopkins University Medical School.  His teaching positions have included
Barnard College, Princeton and, since 1990, the Computer Science Department
at the University of Manchester.  His most recent publications is, Martin
Lea and Richard Giordano, "Frameworks for modelling the real world: object
orientation, bandwidth and representations of the group," in Social science
research, technical systems and cooperative work (Paris: CNRS, 1993).

KEYWORDS: Text Encoding Initiative; TEI header; Header; Electronic
Documentation; Cataloging; SGML.

Abstract:

 The article gives a general introduction to the form and function of the
TEI header and its relationship to the MARC record.  The article points-out
some of the reasoning of the Text Documentation Committee that went into
the Header's design, and discusses some of the its limitations.  The TEI
header's major strength is that it documents electronic text in a standard
interchange format that should be understandable to both librarian
catalogers and text encoders outside of librarianship.  It gives encoders
the ability to document the electronic text itself, its source, its
encoding principles, revisions, as well as non-bibliographic
characteristics of the text that can be support both scholarly analysis and
retrieval.  Its bibliographical descriptions can be loaded into standard
remote bibliographic databases, which should make electronic texts as easy
to find for researchers as texts in other media, including print.  Finally,
it represents a long collaboration between librarians and members from a
range of academic disciplines outside of librarianship, and may thus
represent a model of such collaboration.  The Header's major weakness is
that the default header does not provide the ability for fine-grained
retrieval within or across texts in which users may want now or in the
future as networked research environments improve.


Introduction.

The Text Encoding Initiative (TEI) is a multilingual, international project
that developed guidelines for the preparation and interchange of electronic
texts for scholarly research.  Since its work began in 1987, its goals have
served humanities scholarship generally, and have become important to a
range of applications in language industries (including publishing) as well
as across academic disciplines and applications.

In April 1994, the TEI issued the first full version of its Guidelines for
the Encoding and Interchange of Machine-Readable Texts. [Sperberg-McQueen &
Burnard] This report provides encoding conventions for a range of text
types and features relevant for research in language technology, the
humanities, computational linguistics, and the social and behavioral
sciences.  It represents a major milestone for never before the TEI was it
possible to achieve consensus among research communities about encoding
conventions to support the interchange of electronic texts.  It is
reasonably safe to assume that many electronic texts will be encoded
according to the recommendations of the Guidelines (some major text-based
projects are already following the Guidelines), and that TEI-encoded texts
will form an increasing share of collections in libraries and other
repositories.

Every TEI-conformant text is prefixed by a 'prolog' that documents the
encoded text itself, its source(s), its encoding practices, contextual
non-bibliographic information and its revision history.  This 'prolog' is
called the TEI Header.  The TEI Header provides information for people
using texts, for software processing them, and for catalogers and
archivists collecting them.[1] The TEI Header, thus, can be viewed as a set
of descriptions and declarations that provide the electronic equivalent to
the title page attached to a printed book.  Additionally, the TEI header
constitutes the equivalent to codebooks or introductory manuals customarily
accompanying electronic datasets.
 
The TEI Header was designed by a committee, the Text Documentation
Committee, initially composed of archivists and librarians from both Europe
and North America who had experience in cataloging, collecting and
describing electronic sources.  The committee's work was guided by
longstanding documentation principles for describing both texts on paper
and machine-readable data files.  Their work was also influenced by other
TEI committees and work groups unrelated to the library and archival
professions, specifically the work group on spoken language which did much
of the work on the profile description part of the TEI header. [Johansson]
As a consequence of input from other members of the Text Encoding
Initiative, the TEI header describes both bibliographic and
non-bibliographic information, and supports, in addition to the
identification and retrieval of an encoded text, the machine analysis of
encoded text.


Brief overview of the TEI header.

This overview is meant to explain the main features of the TEI header, what
it is and what it does.  It will not attempt to document it fully.  Because
this article will be most readers' first point of contact with the TEI
header, it will of necessity be a minimal description rather than a full
discussion or formal analysis.  The elements of the TEI header are
described here only to give an introduction to what they are and how they
could be used, but they are not a guide to constructing headers.  The
reader should consult the Guidelines for detailed and full information and
recommended practices.

The TEI header, or <teiHeader>, is composed of four functional major parts
that document the bibliographic description of the electronic text and its
source, the encoding of the text, non-bibliographical information that
characterize the text, and a history of updates and changes made to the
electronic texts.  These major elements are referred-to as descriptions.

The bibliographic description of the encoded text and its source, which is
essential for the retrieval of an item, forms the File Description, or
<fileDesc>; documentation on the relationship of the encoded text to its
source, for instance a documentation of editorial decisions or procedures,
is provided in the Encoding Description, or <encodingDesc>;
non-bibliographic information characterizing various descriptive aspects of
a text, and which are useful for the human or machine-assisted analysis of
text, form the Profile Description, or <profileDesc>; and, the history of
updates and changes to the machine readable text comprise the Revision
Description, or <revisionDesc>.  Together, these elements allow people to
identify an encoded text, understand the editorial decisions when the text
was encoded, have documentation of the characteristics of the text, and
maintain a history of any revisions to the transcription of the text.  A
skeletal representation of the Header looks like this:

<teiHeader>
	<fileDesc> ... </fileDesc>
	<encodingDesc> ... </encodingDesc>
	<profileDesc> ... </profileDesc>
	<revisionDesc> ... </revisionDesc>
<teiHeader> 

Of these major elements, only the file description is mandatory.


The File Description.

The <fileDesc> is best thought of as an electronic analogy to the title
page of a book because it provides users of an encoded text with at least
the information normally found on a title page of a printed work.

Because any work without a title page is difficult or impossible to
identify accurately, the <fileDesc>, thus, is the sole required element of
the TEI header.  The <fileDesc> should not be confused with a finished
cataloging record, but can be used by a librarian cataloger for the
creation of a cataloging record, or by anyone to derive the correct
bibliographic citation for the encoded text.  In addition to providing
bibliographic information about the encoded text itself, it provides
citation information, as prescribed in the Anglo-American Cataloging Rules
[AACR, 1988], of the source (or sources) from which the encoded text was
derived.

Because the bibliographic description of a machine-readable text resembles
the bibliographic description of a book or manuscript, the <fileDesc> has
been closely modelled on existing descriptive standards in library
cataloging, specifically the Anglo-American Cataloging Rules, the
International Standard Bibliographic Description (ISBD) [ISBD, 1977] and,
to the extent that it itself is modelled on ISBD, the USMARC record
format.[2] [Crawford, 1989; USMARC, 1987] Elements of the <fileDesc> have
been given explicit names that, where possible, parallel the names of areas
in ISBD and AACR, and fields in MARC.  Anyone familiar with a MARC record
and ISBD should immediately recognize the elements of a TEI Header and
should, it is hoped, be able to derive accurate local cataloging copy from
it.

The file description is composed of three mandatory and four optional
elements.  (Equivalent MARC fields are provided in parentheses.[3]) These
are,

<titleStmt> (mandatory, equivalent to the 240 or 245 MARC fields)
<editionStmt> (optional, equivalent to the 250 MARC field)
<extent> (optional, equivalent to the 'physical description' MARC field,
	256 or 3XX depending on local practice)

<publicationStmt> (mandatory, equivalent to the 260 MARC field)
<seriesStmt> (optional, equivalent to 4XX MARC fields) 
<notesStmt>(optional, equivalent to 5XX MARC fields) 
<sourcesDesc> (mandatory, can bemapped to the 'source of data' note (537 in
	RLIN MDF format)).  Because these elements are familiar to catalogers, it
	is superfluous to discuss all of them in detail here.  We will discuss
	instead the three mandatory elements.

The three required elements of the <titleStmt> will allow a user to
identify a unique electronic text, access it from a publisher or
distributor, and provide a reference to the source from which the
electronic text was derived.  The <titleStmt> element, like the title
statement in a bibliographic record, contains information about the title
of a work and those responsible for its intellectual content--that is, a
title and one or more 'statements of responsibility.' Formally, these
'statements of responsibility' are
 
<title> (equivalent to 24X$a)
<author> (equivalent to a 245 $c in MARC, not the 1xx) 
<sponsor> (equivalent to 24X $c)
<funder> (equivalent to 24X $c)
<principal> (as in the principal investigator, 24X $c)

Alternatively, encoders can user a general <respStmt> to identify those
responsible for the intellectual content of a work where specialized
elements do not suffice or do not apply.  For instance, a <titleStmt> in a
TEI header attached to an encoded version of Thomas Paine's Common Sense
would look something like this:

<titleStmt>
	<title>Common sense, a machine-readable 
			transcription</title>
			<respStmt><resp>compiled by</resp><name>Jon K Adams</name>
			</respStmt>
</titleStmt>

A more formal title statement may look like this:
<titleStmt>
  <title>Notebooks of a computer pioneer, Tom Kilburn; a machine-readable 
	transcription</title>
	<author>Tom Kilburn</author>
	<sponsor>National Archive for the History of 
		Computing</sponsor>
	<funder>Simon Engineering Fund</funder>
	<principal>Martin Campbell-Kelly</principal>
	<respStmt><name>Jon Shapiro</name>
		<resp>data entry, scanning and proof
			correction</resp>
	</respStmt>
	<respStmt><name>Carole Goble</name>
		<resp>created and maintained pre-SGML full text
			and image database</resp>
	</respStmt>
	<respStmt><name>Anna Garry</name>
		<resp>converted full text database to TEI 
			markup</resp>
	</respStmt>
</titleStmt>
    
The <publicationStmt> (the equivalent to the 260 MARC field) is the second
mandatory element in the file description, groups information concerning
the publication or distribution of an electronic or other text.  Like the
<titleStmt>, it can contain a simple prose description or groups of formal
elements.  At least one of the following three elements (all equivalent to
the 260$b in MARC) must be present unless the <publicationStmt> is given as
prose.  These are,

	<publisher> 
	<distributor> 
	
	<authority> (the person or entity responsible for making an
		electronic file available, other than the publisher or distributor).  If
		one or more of these elements are used, then each may be followed by one or
		more of the following elements:

	<pubPlace> (260$a)
	<address> 
	<idno> (such as ISBN or ISSN)
	<availability>
	<date> (other than the creation date)[4].

Thus, a prose <publicationStmt> can look like this,

<publicationStmt>
	<p>Oxford: Oxford University Press, 1989.</p>
</publicationStmt>

A <publicationStmt> using elements to group information would look like this,

<publicationStmt>
	<publisher>Oxford University Press</publisher>
	<pubPlace>Oxford</pubPlace>
	<date>1989</date>
	<idno type=ISBN>0192547054</idno>
	<availability><p>To be distributed for purposes of 
		teaching and research only.
	</availability>
</publicationStmt>

The <sourceDesc>, the last mandatory element in the <fileDesc>, is used to
record details of the source or sources from which an electronic text is
derived.  This might be a manuscript or printed text, another computer
file,an audio or video recording of some kind, or a combination of these.
An electronic file may also have no source, if what is being cataloged is
an original text in electronic form.

The <sourceDesc> may contain a simple prose description or, more usefully,
a structured bibliographic citation specifying the provenance of the
text.[5] A <sourceDesc> can look like the following,

<sourceDesc>
	<p>No source: created in machine-readable form.</p>
</sourceDesc>

<sourceDesc>
	<bibl>The first folio of Shakespeare, prepared by 
		Charlton Hinman (The Norton Facsimile, 	
		1968)</bibl>
</sourceDesc>

<sourceDesc>
	<biblioStruct lang=FR>
	<monogr>
	<author>Eug&egrave;ne Sue</author>
	<title>Martin, l'infant trouv&eacute;</title>
	<title type=sub>M&eacute;moires d'un valet de 
		chambre</title>
	<imprint>
		<pubPlace>Bruxelles at Leipzig</pubPlace>
		<publisher>C. Muquardt</publisher>
		<date>1846</date>
	</imprint>
	</monogr>
	</biblStruct>[6]
</sourceDesc> 
		
If an encoder chooses to create minimal headers, using prose instead of
grouping information in formal elements, very little cataloging expertise
is needed.  Note that there is no equivalent to the MARC 1XX field in the
TEI Header.  We chose this course primarily because the creator of a TEI
header may or may not be trained in cataloging, but simply documenting the
electronic text in an electronic equivalent to the title page. An encoder
should not necessarily be expected to provide full descriptive cataloging
of an encoded text when constructing a header.  For example, an encoder
might be expert in Old Norse poetry, but not know what should go in an
<author> element because, as many librarians know only too well, deciding
on the author is not always a simple matter, especially when there are
cases of multiple authorship, corporate authors, pseudonyms, or where
arcane cataloging rules apply.  And then there is always the vexing
question of the form of the author's name once the author is determined.
For instance, it is unreasonable to think that an encoder without reference
to a Name Authority File would know the correct form for the name,
T.S. Eliot.  If the Text Documentation Committee had included recommended
an explicit <author> element when the author was known, some encoders might
construe this to mean that the correct author and the correct form of entry
must be given.  The most likely outcome in this circumstance, of course, is
that encoders would give nothing.

The Text Documentation Committee believed that if it were possible for an
encoder uninitiated or uninterested in the rules of cataloging to include
everyone in the statement of responsibility who was seen to be responsible
for the intellectual content of a work, and the role made clear (such as
compiler, editor or, perhaps, person who sent for pizzas) then a cataloger,
not an encoder, could apply both standard AACR2 rules and local practice to
determine the appropriate author and others responsible for the work, and
the form of their names.

The file description exemplifies a principle of shared responsibility for
the documentation of scholarly material.  The intention was not to place
the burden of documentation squarely on the shoulders of the encoder, but
to encourage the encoder to provide enough accurate information to
librarians and others in the documentation community so that professional
cataloging could be carried out both effectively and efficiently.


The Encoding Description.

The relationship between the encoded text and its source or sources is
described in the Encoding Description, or <encodingDesc>.  The roots of the
<encodingDesc> can be found in social science data archives that collect
codebooks that document how phenomena are encoded in a particular dataset.
Such information as variable names, value labels, record layout, sampling
procedures, etc. are the sine qua non of every machine readable data
file.[7]

Similarly, the <encodingDesc> documents editorial rationales, decisions and
practices made both before and during the transcription of a source text
into encoded machine readable form.  Such documentation can include a prose
description of the aim or purpose for which the file was encoded, the
method or rationale used in sampling texts in the creation of a corpus or
collection, editorial principles and practices, including whether or how
the text was normalized during transcription, how the encoder resolved
ambiguities in the source, what levels of encoding or analysis were
applied, how canonical references are constructed, and definitions of any
classification codes, if any, introduced in the text by the encoder.

A minimal <encodingDesc> might look something like this:

<encodingDesc><p>Blank lines and multiple blank spaces, 
	including paragraph indents, have not been 
	preserved.</p>
</encodingDesc>

The <encodingDesc>, moreover, can hold structured information in its
sub-elements.  Moreover, if different editorial practices were applied to
different parts of the text, (for instance, sampling procedures may differ
throughout a large corpus [Dunlop, 1994]), one can repeat the
<encodingDecl> element in the TEI header to reflect different editorial
practices, and assign an ID attribute to each one.  The value of the ID
attribute can then be linked to the specific part of the text (which would
have the same ID value) where those editorial principles apply.

A simple, but structured, example of the <encodingDesc> for a typical
project that converts texts from sources of American history for use in a
course on historical methodology might look something like this:

<encodingDesc>
	<projectDesc><p>Transcription of the US Constitution 
		for the teaching of a first-year course in 
		historical methodology at Barnard College, 
		Columbia	University.</p>
	</projectDesc>
	<editorialDecl>
		<correction><p>Errors in scanning and 
			transcription	controlled by using the 
			Microsoft Word, v.5.0, spell checker.</p>
		</correction>
	</editorialDecl>
</encodingDesc>

The 567 (notes on methodology) appears to be the most appropriate MARC
field for this information., although this field is normally intended for
methodologies in the social sciences.  Practically, it would be wise to map
the elements of the <encodingDesc> as separate 567 fields.


The Profile Description.

The <profileDesc> is an optional element that provides a detailed
description that characterizes various descriptive non-bibliographic
aspects of the text, such as language usage, the situation in which the
text was produced, and the participants and their setting.  The purpose of
the <profileDesc> is to enable descriptive aspects which do not identify
the work, as bibliographic elements would, and which go beyond information
typically found in codebooks, to be recorded within a single unified
framework.  The <profileDesc>, which resulted largely from input from the
Spoken Language workgroup, is of most use to linguistic-based spoken text
projects, although many of its features can be applied to written text,
such as drama, or to those projects that desire to track multiple speakers,
or voices, though prose.  (See [Johannson, 1994] and [Dunlop, 1994] in this
issue for a fuller discussion.)

The core <profileDesc> element has three optional components, represented
by the following sub-elements:

<creation>: contains information about the creation of a text.  This may
vary from the publication date in the bibliographic description in that it
gives the date and place of composition, and may be of acute relevance to
studies of linguistic variation across space and time.  <langUsage>:
describes the languages, sublanguages, dialects, etc. represented within
the text.  <textClass>: groups information which describe the nature of the
text in terms of a standard classification scheme.

A brief profile description might look like this:

<profileDesc>
	<creation>
		<date value='1989-08'>August 1989</date>
		<place>Brooklyn, New York</place>
	</creation>
	<langUsage>
		<language id=EN wsd=wsd.en>
		<language id=SP wsd=wsd.sp>
		<p>Approximately 95% of the text is in American 
			English with quotations from first and second 
			generation Italian immigrants to Brooklyn; the 
			remainder is in transcribed Spanish spoken 
			by first and second generation Puerto Rican
			immigrants to Brooklyn.</p>
	</langUsage>
	<textClass>
		<keywords scheme=LCSH>
			<list><item>Brooklyn (New York, N.Y.)--
				Biography.</item>
				<item>Brooklyn (New York, N.Y.)--Social life
						and customs.</item>
			</list>
		</keywords>
		<classCode scheme=LC>F129.B7</classCode>
</profile Desc>

Such a classification system, while useful to most projects may be too
coarse when applied to the analysis of language in spoken text, or when
applied to some written text (such as drama), collections or corpora.  In
these cases the <profileDesc> allows the encoding of a high degree of
classificatory information about the text itself, the voices of characters
or participants within it, and the setting within which a language
interaction takes place.  This information can be recorded using the
optional extensions to the <profileDesc>: The Text Description or
<textDesc>; the Participants Description or <particDesc>; and the Setting
Description or <settingDesc>.

Formal situational information may be included in the <textDesc> to support
the analysis of speech or written text.  Such information--the medium by
which the text is delivered or experienced, the internal composition of a
text or of a text sample (for instance, documenting a complete text, a
fragment, a composite text), the nature and extent of indebtedness or
derivation of the text to others, the social context for which the text was
realized or intended (for instance, as entertainment, or religious and
ceremonial purposes, etc.), the interaction between those producing and
experiencing the text, whether or not a text was prepared or spontaneous,
and the purpose of the text--can be referred to as situational
parameters.[8]

Situational parameters are a description of the situation within which the
text was produced or experienced, and thus characterize it in a way
relatively independent of any a priori theory of text types.  Rather than
insisting on a system of discrete text types, which in practice would be
impossible to formulate, the Guidelines recommend the use of situational
parameters that can be used in combination to supply distinguishing
descriptive features of individual texts.  When text types are used in
combination with situational parameters, the internal structure of each
text type can be specified in terms of the parameters proposed.  This
allows for the relatively continuous characterization of texts, in contrast
to discrete categories based on type or topic, supports meaningful
comparisons across corpora, allows analysts to build their own text types
based on the particular parameters of interest to them, and may be equally
applicable to both spoken and written texts.

An informal domestic conversation might be characterized as follows:

<textDesc> id=t1 n='Informal domestic conversation'>
	<channel mode=s>informal face-to-face 
		conversation</channel>
	<constitution type=single>each text represents a 
		continuously recorded interaction among the 
		specified participants
	</constitution>
	<derivation type=original>
	<domain type=domestic>plans for coming week, local 
		affairs</domain>
	<factuality type=mixed>mostly factual, some 
		jokes</factuality>
	<interaction type=complete active=plural passive=many>
	<preparedness type='spontaneous'>
	<purpose type=entertain degree=high>
	<purpose type=inform degree=medium>
</textDesc>

We have noted that situational parameters may be applied to texts other
than to spoken texts.  Consider this example of situational parameters
applied to a novel:

<textDesc n='novel'>
	<channel mode=w>print</channel>
	<constitution type=single>
	<derivation type=original>
	<domain type=art>
	<factuality type=fiction>
	<interaction type=none>
	<preparedness type=prepared>
	<purpose type=entertain degree=high>
	<purpose type=inform degree=medium>
</textDesc>

It is possible, in addition, to document information about the participants
in a spoken text or persons named or depicted in written text, including
demographic and descriptive information about their individual
characteristics and the relationships among them.  The <particDesc> element
is used for this purpose.  Individual speakers, or groups of speakers, can
be named and identified by a code (or ID attribute), which can then be used
to identify the speaker throughout the text.  This allows an analyst to
identify multiple speakers or voices in the text, provide detailed
information about them, and then to track the participants' speech
throughout the text.[9]

An individual appearing in a text might be described either informally or
formally.  For instance, consider the following informal prose description
of a character:

<participant id=P1 sex=M age=39>
	<p>Male, well-educated, born in Newark, NJ, 28 
		September 1953, son of toolmaker and seamstress, 
		speaks pitiful but passable Italian taught to him by his
		maternal  grandmother, good French,
		lived on Upper West Side of Manhattan for past
		fourteen years, Social-Economic status high white
		collar (HWC) from Thernstrom's classification
		scheme, works as an endocrinologist.</p>
</participant>

Although a prose description may be sufficient for most projects,
demographic information may be described formally, and lend itself to
machine capture and analysis.  For example, consider the formal description
of the same character:

<participant id=P1 sex=M age=39>
	<birth date='1953-09-28'
		<date>28 Sep 1953</date>
		<place>Newark, NJ</place>
	</birth> 
	<firstLang>English</firstLang>
	<langKnown>Italian</langKnown>
	<langKnown>French</langKnown>
	<residence>New York City</residence>
	<education>medical school</education>
	<occupation>Endocrinologist</occupation>
	<socecstatus source=Thernstrom code=HWC>
</participant>


Finally, the setting or settings in which the language interaction takes
place can be described in the <settingDesc> element. The information here
may contain prose or it may be grouped in a series of sub-elements,
depending on the level of analysis required by the encoder.  For instance,
the setting can be described informally as follows:

<settingDesc>
	<p>The time is early summer, 1993.  P1 is doing the 
	dishes.  P2 is in the living room chair reading an
	unidentified newspaper. P3 is watching the news on 
	television. P4 (a television news broadcaster) is in a 
	broadcasting studio in New York.</p>
</settingDesc>

This description is useful as codebook information, may be sufficient for
the purposes of most projects, but it would be difficult for use in machine
analysis.  The following formal setting description may make computer
processing tractable.

<settingDesc>
	<setting who="P1">
		<place>New York City</place>
		<date value=1993>early summer, 1993</date>
		<locale>kitchen sink of New York apartment</locale>
		<activity>washing dishes</activity>
	</setting>
	<setting who="P2">
		<place>New York City</place>
		<date value=1993>early summer, 1993</date>
		<locale>living room chair of New York 
			apartment</locale>
		<activity>reading newspaper</activity>
	</setting>
	<setting who="P3">
		<place>New York City</place>
		<date value=1993>early summer, 1993</>
		<locale>living room of New York apartment</locale>
		<activity>watching news on television</activity>
	</setting>
	<setting who="P4">
		<place>New York City</place>
		<date value=1993>early summer, 1993</date>
		<locale>broadcasting studio, New York City</locale>
		<activity>reading news</activity>
	</setting>
</settingDesc>

The <profileDesc> can thus provide researchers with a powerful tool to
identify and track each participant throughout the text by the use of a
formal description, or can be used solely for documentation by using
informal prose descriptions.

The <profileDesc> is the most problematic element in the TEI header for
librarian catalogers, because it provides a detailed description of
non-bibliographic aspects of the text, which can be used for both retrieval
and analysis.  There is no place in the MARC record designed to hold this
information.  Catalogers have a number of alternatives which will be based
on local practice and local cataloging philosophy.  They can, of course,
ignore the <profileDesc>, or they can map the <profileDesc> (with its TEI
tags) into one or more 590 fields and develop software for retrieving them.
They may also keep a copy of the TEI Header (such as an Independent Header,
discussed below) intact on some machine, and have the MARC record point to
that Header which in turn will point to the encoded text.  tactics for
dealing with the <profileDesc> might be an area of some discussion in this
journal in future months.


The Revision Description.

The final part of the TEI header the Revision Description or <revisionDesc>
provides a detailed change log in which each change made to a text can be
recorded.  This log is very similar to logs kept of data files in the
social sciences.  It is an especially important element for recording
changes to a file as that file is passed from system to system or from
researcher to researcher.

The revision description consists of the following tags:

<revisionDesc>: summarizes the revision history of a file.
<change>: summarizes a particular change or correction made to a particular
	version of an electronic text which is shared among several researchers.
<date>: contains a date in any format.
<respStmt>: supplies a statement for someone responsible for the
	intellectual content of text, edition, etc., where specialized elements for
	authors, editors, etc. do not apply.
<item>: contains one component of a list.

An example of a change log might look like this:

<revisionDesc>
	<change><date>6/4/93</date>
		<respStmt><name>RG</name><resp>ed.</resp></respStmt>
		<item>proofread SJW's work</item>
	<change><date>6/2/93</date>
		<respStmt><name>SJW</name><resp><data entry></resp></respStmt>
		<item>Changes to pretty-up printed version</item>
</revisionDesc>

Like the <profileDesc>, the <revisionDesc> is problematic for catalogers
for two reasons: First, there are no MARC fields that deal specifically
with changes of these sorts, and it appears that the best appropriate field
for this would be a 59X field; Further, it is unclear how revisions might
affect the 'version' of an electronic text, and the current edition of the
Guidelines offers little help in this regard.


The Size and Complexity of the TEI header.

From the above overview, it is easy to see that the TEI header can become
quite large, and in some cases may even exceed the size of the text it is
documenting!  The size of the TEI header, however, depends on the nature of
the project and the amount of documentation that encoders wish to attach to
a text.  It is not intended, however, that all of the elements recommended
in the Guidelines, nor even the formal structures illustrated in this
article, be present in every TEI header.  At one extreme, an encoder may
expect that the TEI header will be needed only to provide minimal
bibliographic information of the text adequate to local needs, or to be
shared with a small number of close colleagues.  Similarly, encoders may
wish to describe their texts only in detailed prose, leaving it to
professional catalogers, archivists and others to create structured
sub-elements that are tractable by machine.  At the other extreme, wishing
to ensure that their texts can be used for the widest range of
applications, encoders may want to document both bibliographic and
descriptive information as explicitly as possible such that no prior or
ancillary knowledge about the text is needed in order to process it.  The
TEI header, in the latter case, will be very full, approximating to the
kind of documentation often supplied in the form of a manual.  Most texts
will lie somewhere between these extremes; large corpora and
linguistics-based research projects in particular will tend toward the
latter. 

 The Guidelines make no recommendations how a particular project
must encode a TEI header other than to require that at least an informal
<fileDesc> be included for all encoded texts.  Having said that, the
Guidelines offer guidance on creating minimal and recommended headers,[10] as
well as creating freestanding or Independent Headers that can be sent
libraries and other sites without text attached to them[11]. Contingent
resources such as time, staff and money, as well as the intended purposes
of the encoded texts, should determine the level of encoding, not printed
guidelines.  Once the participants on a project decide which information is
to be encoded, the Guidelines will provide recommendations on ways to do
it.  Some elements of the TEI header may be difficult to understand to
those unfamiliar with bibliographic description, and potential users
uninitiated with library practices may be wary of even attempting to
construct a TEI header.  It is not the TEI's intention that everyone who
creates a TEI header spend a couple of semesters in library school in order
to know which elements to include.  Rather than grappling with the
intricacies of bibliographic description, we anticipate that encoders
outside of librarianship will provide as much information in the <fileDesc>
in prose, and to leave it to a librarian cataloger at a text repository to
structure the information into appropriate sub-elements, or to input (or
automatically convert) that information into a local bibliographic system.
What matters when encoders construct a TEI header is not the structure of
the information, but instead its completeness and fidelity to the encoded
text.  Highly structured and accurate bibliographic information, especially
that contained in the <fileDesc> of the TEI header, however, will greatly
ease the burden of catalogers because such information can be loaded
directly into online catalogs with relatively little human intervention,
with the exception of the <profileDesc> and possibly the <revisionDesc>.
If Header information is already contained on a database, it might be a
matter of generating a TEI header automatically from that database, as was
done with the British National Corpus [Dunlop, 1994], and transferred to a
bibliographic file.


Use of the TEI header to Support Document Retrieval and Analysis: 
Prospects and Problems.

TEI headers can be distributed and published in paper or electronic form as
Independent headers, (that is, the TEI header is physically separated from
the text it describes), and distributed to libraries and other repositories
to support online bibliographic retrieval.  Two advantages of this
approach, mentioned by [Dunlop, 1994] are that users save local storage by
not needing the full text in order to make selections, and users need not
undertake any copyright responsibilities defined in end-user agreements.
Most important, TEI headers loaded into or referenced from standard
bibliographic databases will provide users with the same abilities to
identify electronic texts as easily as texts in other formats, including
print.  Linked with easily assessable bibliographic databases, the
Independent TEI header should, in the near future, engender widespread use
of electronic texts, and encourage people to encode texts for distribution,
rather than simply for local use.

As we have seen, it is not a simple matter to load all of the elements from
a TEI Header to the MARC record, although the majority of structured file
descriptions might be mapped to a MARC record with little or no human
intervention.  There are limitations with the current TEI header's ability
to provide fine-grained retrieval that might be important for some
applications.  The Guidelines provide encoders with a relatively fixed set
of tags and attributes (which, of course, can be extended by the user), but
the values given to those tags and attributes are, for the most part, open.
For instance, one encoder might encode <locale>New York City</locale> while
another might encode <locale>Manhattan</locale> while yet another might
encode <locale>Upper West Side</locale> to designate the same place.  This
uncontrolled vocabulary, and its form, may seriously affect retrieval
unless it is controlled through the use of an authority file or its
equivalent.

The form of personal names is, as we have seen, another potential problem.
This lack of specificity in the Guidelines for determining the form of
names, subjects (as well as uniform titles or serials) is the result of
twin circumstances: First, the Text Encoding Initiative was concerned
primarily with providing users with a set of recommended tags and
guidelines for using them instead of rules for determining the values that
encoders would use with those tags; Second, and, more significantly, the
encoders themselves and relevant international and professional bodies are
best positioned to address the syntactic and semantic questions of
standardized content formats and values because the specific form of values
is very much related to the intended use of the encoded texts, the purposes
of a project, and the 'language' of the discipline to which the text or
encoding project is connected.  The TEI could not issue specific guidelines
or make recommendations that would apply across a range of disciplinary or
professional orientations.  The Guidelines therefore assume that encoders,
or documentation professionals who encounter TEI encoded texts, will refer
to the work of professional organizations, such as the Library of Congress,
the American Library Association, the International Standards Organization,
or discipline-specific organizations, such as art documentation
associations, for guidance.

The model of 'documentation' that guided much of the work of the Text
Documentation Committee was informed by standard library and archival
practice, and this resulted in the default structure of the TEI header
which is adequate to the needs of many applications.  The reliance on
standard library practice is both a strength and a weakness.  Typically,
documentation in the library community involves providing enough
descriptive information to users so that a text itself can be uniquely
identified among many.  The Text Documentation Committee thus made
reference to an existing working model as defined in the Anglo-American
Cataloging Rules [AARC, 1988] for the cataloging and description of machine
readable data files.  Consequently, the Committee began work under the
assumptions embedded in those rules which implied that users would need
information in the form of an index and codebook to identify a text for
retrieval, and in general they would use that index and codebook
information the same way as one would when identifying a social science
machine readable data file stored on tape in a data archive.  This
assumption led to the construction of the <fileDesc>, <encodingDesc>, and
<revisionDesc> which document the bibliographic aspects of texts, their
encodings and revisions, and serve as pointers to encoded texts.  This is
to say, the mental model of documentation presupposed the traditional
notion of an index, which is a physical object, (such as a catalog), that
points users to another physical object, (such as a book) [Gorman, 1992].
The problem is that this model of an index points to the electronic text,
and assumes that electronic texts exist only as separate unified physical
objects.  Users are characterized as being interested in retrieving only
the entire physical object, rather than, say, only parts of it.
Thus, in many ways the TEI header at present functions as little more than
an electronic analogy of the catalog card: One can retrieve the encoded
text with information from the <fileDesc>, but there is no mechanism in the
default header to navigate through the text, or to retrieve portions of it
that fit some analytic criteria.

The work of the Text Encoding Initiative in general has been to support
multiple view of texts, (as physical objects, typographic objects,
linguistic objects, rhetorical objects, etc.), and recognizes that in many
situations more than one view of text is needed.  It should then come as no
surprise to observers that outsiders to library science, notably the Spoken
Language workgroup and members of the Corpus Linguistics community, made
recommendations to the <profileDesc> which are extremely useful to
researchers for the retrieval and analysis of portions of texts (for
instance, linguistic objects such as phonemes) which fall outside the
traditional boundaries of retrieval one normally associates with
librarianship.  Consequently, the information contained in the
<profileDesc> falls outside that normally contained in the MARC record and,
by extension, what would normally be found in online catalogs. The
<profileDesc>, for instance, gives encoders the ability to track
participants and voices through a text, and extensions to the TEI header
that support corpora allow users to isolate sub-portions of the text
[Dunlop, 1994].

Such features, though useful, can be seen to be under-specified.  Although
voices can be traced through a text using the TEI header, no unified
structure exists at present in the default header to track the occurrences
of intellectual or editorial responsibilities throughout the text.  Take a
case where there are multiple editors in a text, each responsible for
different parts of the text, perhaps edited at different times.  The
default TEI header does not at present give encoders the ability to
identify each editor, isolate which parts of the texts were edited by them,
and retrieve them.  There are, of course, ways of encoding intellectual
interventions, but none are to be found in the default TEI header without
extending its content model.  Further, the default TEI header at present
does not provide tools for the tracking of editors, voices, or participants
across texts.  For example, it does not give encoders the ability to
isolate an editor in a text, retrieve those parts of the text for which
s/he is responsible, and to retrieve parts from other texts for which that
editor is also responsible.  Such a facility would give researchers the
ability to compare editorial styles across works of, for example, one genre
to another.  It can be used, also, to investigate the effects of different
situations on the register of the speech of a given speaker. 

The potential solutions to these problems are deceptively simple: there is
no reason why one might not re-define the TEI header so that those listed
in the <respStmt> are given an ID attribute, and their 'interventions'
isolated in the text using much the same mechanism that isolates speech by
participants in spoken text.  The Text Documentation Committee, however,
did not take this approach because it demands that the <respStmt> do
something for which the library community never intended, and the
Guidelines would therefore be at variance with standard library practice.
Such information can be carried in the <profileDesc>, but encoders then run
the risk of encoding redundant information in the <respStmt> (by listing
editors of a text) and again in the <profileDesc> (by pointing to areas of
the text that they had indeed edited).

One can hazard the proposition, however, that once textual collections are
in an electronic environment, in particular a networked electronic
environment where texts in arbitrary locations can easily be combined in a
screen buffer, the rules that guide the identification and retrieval of
texts are altered, for it is important for users not only to have the
ability to retrieve texts in a traditional sense, (as one would retrieve a
book), but some standard mechanisms are needed as well for the fine-grained
retrieval and analysis to support multiple views of texts.

Although the international library and documentation community will sooner
or later have to focus on this problem, there are mundane issues that will
need to be addressed soon.  The difficulty, however, is that no one really
knows yet what those mundane issues are.  There is no large-scale empirical
field experience with the TEI header, only a set of guidelines and a lot of
good will.  How people will make use of the TEI header; if people make use
of the TEI header; the ability of the TEI header to provide the
documentation that researchers and scholars need; the willingness of
encoders to create accurate TEI headers... these are issues that might
potentially occupy these pages in the coming years.


FOOTNOTES.
 
[1]This article is intended primarily for librarian catalogers and others
who collect and catalog electronic texts.  A treatment for humanities
encoders by the same author is contained in vol. 28, no. 4 of the Computers
and the Humanities special triple issue on the Text Encoding Initiative.
In 1988, four committees were initially responsible for appropriate
sections of the Guidelines.  These were the Committee on Representation
(which provided for the adequate representation of printed and manuscript
versions of text), the Committee on Text Analysis and Interpretation (which
provided tags for textual features not conventionally represented
typographically in a text), the Committee on Metalanguage Issues (which
provided a syntax for the tag set for the Guidelines) and the Committee on
Text Documentation, which designed the Header. [Ide & Sperberg-McQueen]


[2]UNIMARC was not sufficiently stable at the time of the Header's
development.

[3]For details on mapping TEI Headers into MARC records, see Chapter 24 of
the Guidelines, [Sperberg-McQueen & Burnard] pp. 672-676.

[4]Local practice will determine the appropriate MARC fields for <address>,
<idno> and <availability>. Restrictions on access should normally be placed
in the 506 field, while the place where an item may be ordered may be
located in a local notes (590) field. If local practice warrants it, the
address of the publisher should be indicated in the 260 field.


[5]Recommendations of the form of citations are given in section 5.2.7 of
the Guidelines.

[6]Note that a source description containing a full bibliographic reference,
like the one in this example, using the <biblioStruct> element might be
mapped to a 581 field, (note on primary publication) using the ISBD format
to separate each data element.

[7]An encoded text, however, is not just another electronic file that can be
categorized like a data set of numbers because text, unlike a file of
numbers, can contain multiple hierarchies and multiple meanings.  For this
reason, the Profile Description, which is discussed later in this article,
was developed for the TEI header largely through the efforts of the Spoken
Language and Corpus Linguistics workgroups.

[8]See [Sperberg-McQueen & Burnard], pp. 648-658 for more details.

[9]The Guidelines [Sperberg-McQueen & Burnard] provides a set of elements
to identify the name of the speaker, place of birth, residence, education,
occupation and so on, but in practice such details will vary enormously for
different forms of analysis, and users of TEI P3 are encouraged to
customize them to fit the needs of their projects.

[10]See [Sperberg-McQueen & Burnard], pp. 134-135.

[11]See [Sperberg-McQueen & Burnard], pp. 667-678.