Susan Hockey
CETH

[This document is mirrored from http://lcweb.loc.gov:80/catdir/semdigdocs/hockey.html.

]

Susan Hockey has been Director of the Center for Electronic Texts in the Humanities (CETH) since 1991 when she moved to the United States from Oxford University. Her academic background is in Classics and Ancient Near Eastern Languages. She has been active in humanities computing for 25 years and is the author of A Guide to Computer Applications in the Humanities and SNOBOL Programming for the Humanities as well as some thirty articles. She has managed OCR and typesetting facilities for the humanities and directed the development of the Oxford Concordance Program (OCP). She is Chair of the Association for Literary and Linguistic Computing and is a member (past Chair) of the Steering Committee for the Text Encoding Initiative. Her current interests (and CETH's research) center around the use of SGML as a tool for developing and accessing electronic texts for humanities scholarship.

Describing Electronic Texts: The Text Encoding Initiative and SGML

1. Electronic Texts and Markup

Markup or encoding is necessary for electronic texts. It makes explicit for computer processing things which are implicit to the human reader. It can be used to specify areas or fields of the text to be searched and to identify text which has been retrieved. It can be used to designate "hot text" for hypertext links and, most obviously, to provide formatting information for display or printing of the text. Attempting to use an electronic text without markup would be like using a bibliographic record where the fields are not delimited in any way.

Electronic texts have been used for research and teaching in the humanities ever since 1949 when Father Roberto Busa began work on his Index Thomisticus. Early applications included concordances and text retrieval, lexicography, stylistic analyses, the preparation of scholarly editions, studies of sound patterns and some other forms of linguistic analysis. Until recently these projects have mostly been carried out by individuals and by groups within federally funded research institutes located mainly in Europe. This has led to the creation of large corpora of texts in many languages. Notable examples include the Thesaurus Linguae Graecae, the Corpus of Old English, the Global Jewish Database, as well as many individual texts.

It is only recently that these texts have begun to be used in the library environment. Putting electronic texts in the library implies permanence in that the texts are expected to be available for a very long time. It also implies reusability which means that the texts must be able to serve users who have different, and possibly conflicting, theoretical perspectives. The latter is especially important since much of the scholarly process in the humanities is concerned with the interpretation of primary source material. A scholar publishes his or her interpretation as a journal article or monograph. Subsequent scholars will then disagree with that interpretation and publish their own analyses, citing earlier work on the same text.

Humanities texts are complex in nature. They can include a critical apparatus, marginal notes, variant spellings, words in other languages and character sets. They also exhibit different logical structures, including, in some cases, multiple parallel referencing schemes. These features must be represented adequately in the text in order for the text to be useful. Many different ways have been devised for encoding these features, but by the mid 1980's it was recognized that there were problems with all of them. It was not possible to extend any of them easily to deal with other kinds of texts. Some represented only one theoretical view of the text. Others concentrated on the typographic appearance of the text, leading to ambiguity when, for example, italic can be used for a title, a foreign word or an emphasized word. All were specific to one or two programs. In addition, there was no accepted mechanism for providing information about the source of a text and the rationale for the encoding. Much time was wasted in converting texts from one encoding scheme to another.

In late 1987 a planning meeting, held at Vassar College and attended by international experts in humanities computing, determined that it was time to create a common encoding scheme which would satisfy as many purposes as possible. Previous efforts to create a common encoding scheme had foundered but they had all been before the existence of the Standard Generalized Markup Language (SGML).

2. The Standard Generalized Markup Language (SGML)

SGML became an international standard in 1986. It is not itself an encoding scheme, but a metalanguage within which encoding schemes can be defined. It far surpasses other encoding mechanisms in power and flexibility and provides a way of using the same electronic text for many different purposes. Since it consists of a plain ASCII file, it is completely independent of any particular hardware or software, and can be transmitted across all networks.

The basic principle of SGML is "descriptive", not "prescriptive" markup. An SGML-encoded text is viewed as a collection of objects within an overall hierarchic structure. It is up to the designer of an SGML document to determine what objects should be encoded or tagged. Typical objects could be title, chapter, page, verse, stanza, act, scene, quotation, name, date, list etc. They can also be analytical features such as word class tags or other forms of linguistic, literary or historical interpretation. Metadata or descriptive information about the text can also be encoded as SGML objects. The applications program determines what happens to those objects when the text is processed. An object called a title could be italicized if the application is to display or print the text. It could be searched if the application is to retrieve all titles that contain a specific term. It could even be a hypertext link to another document.

SGML has formal properties which make the processing of SGML- encoded texts less error-prone than other systems. Its basic components are entities and elements. An entity is any named bit of text and the definition of an entity associates the name with the text. One use for entities is non-standard characters, for example β for the Greek letter beta. This ensures that the character can be transmitted across all networks but it is expected that a program will translate from the entity to the correct character for display. A second use is for expanding abbreviations or for boiler-plate text, for example &LOC; for "Library of Congress".

Elements are the objects within the text. Each element is marked by a start and end tag, for example

... the novel <title>Northanger Abbey</title> is associated with ...

Angle brackets normally delimit the tags. Entity references can be used if real angle brackets appear in the text. Attributes may be associated with elements to give further information. A simple example is

<chapter n=`6'> ... text of chapter 6 ... </chapter>

for numbering chapters. Attributes may also be used to control indexing. In

<name type=personal normal=`SmithJ'>Jack Smyth</name>

the name Jack Smyth could be listed under SmithJ in an index of personal names. SGML also has mechanisms for using attributes as generic cross-references which are only resolved into concrete references when the text is processed.

The set of elements permissible in an SGML-encoded text is defined in a Document Type Declaration (DTD) which provides a formal specification of the document structure. The structure is basically a single hierarchy or nested form. Multiple hierarchies, as commonly exist in humanities texts, can be represented in various, if rather inelegant, ways. The DTD is used by an SGML parser to validate the markup by checking that it conforms to the model which has been defined. Software to aid the creation of an SGML-encoded text uses the DTD to offer a user, who is tagging the text, only those tags which are valid at a particular point in the document. SGML-browsing software also derives information about the document structure from the DTD. The same SGML software can thus use the DTD to enable it to operate on many different document structures.

3. The TEI Project

The Vassar meeting resulted in the Text Encoding Initiative (TEI), which became a major international project within the humanities and language industries. Sponsored by the Association for Computers and the Humanities, the Association for Computational Linguistics and the Association for Literary and Linguistic Computing, the TEI attracted funding totalling over $1.2 million from the National Endowment for the Humanities, the Commission of the European Union, the Social Sciences and Humanities Research Council of Canada and the Andrew W. Mellon Foundation as well as indirect support from the host institutions of participants.

The TEI was established as a community volunteer effort, but it is co- ordinated by two editors, appointed by a Steering Committee which has two members from each of the three sponsoring organizations. Feedback was obtained continuously from an Advisory Board representing fifteen scholarly associations in the humanities, and library and information science. For the first phase of the project four Working Committees were established to look at Text Documentation, Text Representation, Text Analysis and Interpretation, and Syntax and Metalanguage Issues. Early on the Syntax and Metalanguage Committee determined that (SGML) should form the basis of the new encoding scheme.

In July 1990 the TEI published the first draft of the TEI Guidelines (document P1) which was circulated widely for comment. It included initial recommendations for documenting electronic texts and features common to most text types. Some topics in it were treated in more depth than others. Some needed revision in the light of comments, and some were omitted completely.

In the second stage of the work, the two major objectives were to test the guidelines and to extend their scope. A number of small work groups were set up to tackle specific areas. These included character sets, text criticism, hypertext and hypermedia, language corpora, transcription of manuscripts, spoken texts, general linguistics, print dictionaries, drama, verse, literary prose, historical analysis, and terminological databases. During this phase, a Technical Review Committee met three times to discuss the work group proposals and to integrate their recommendations.

The TEI also made arrangements with a number of affiliated projects which were engaged in the creation of electronic texts. Each project was assigned a consultant to help them and workshops were held for the projects and consultants. The projects were asked to encode samples of their text and report back to the TEI on the results of these tests.

Individual chapters of the next version of the Guidelines were distributed electronically over a period of two years and in May 1994 the TEI published a definitive version of the Guidelines, two volumes totalling almost 1300 pages and the result of collaborative work involving well over a hundred people. The volumes contain an introduction to the TEI and SGML, descriptions of some 400 SGML elements, technical specifications for using the TEI SGML application, and a reference section.

4. The TEI Guidelines

The TEI Guidelines give recommendations both on what features to encode and how to encode them. They include features which are explicitly marked and those which are the result of analysing the text. Very few of the 400 tags are mandatory. The basic philosophy is "if you want to encode this feature, do it this way". The encoding process is incremental so that new markup can be added to a text without altering what is already there. Multiple and possibly conflicting views can be encoded within the same text and a method of documenting the encoding is provided.

The Guidelines are built on the assumption that virtually all texts share a common core of features to which can be added tags for specific disciplines or applications. The TEI's modular DTD makes this possible. The user chooses an appropriate base tag set for which at present prose, verse, drama, spoken texts, print dictionaries and terminological data are provided. The common core and documentation tags are automatically included. The user may then select additional tagsets if he or she needs, for example, the critical apparatus, hypertext linking, or names and dates. The construction of the TEI DTD has thus been likened to the preparation of a pizza, where the base and the toppings are chosen.

A TEI conformant text consists of a header followed by the text itself. The header consists of a set of SGML elements which provide documentation about the text and its encoding. It represents the first attempt to provide a method of in-file documentation for electronic texts which can be processed by the same software as the text itself. It provides metadata which is needed by librarians who will catalogue the text, scholars who will use the text, and software programs which will operate on the text.

The header has four major sections. The file description contains a bibliographic description of the electronic text and can be used as a chief source by cataloguers. It is the only part of the header that has mandatory elements. These include the title statement which gives the title of the work and those responsible for its intellectual content, the publication statement which provides information about the publication or distribution of the text, and the source description which records details of the source from which the electronic text was derived. The encoding description documents the editorial principles used in the transcription of the text, for example the treatment of hyphenation, quotations and spelling variation, and any sampling methods used. The profile description is most relevant for spoken texts where it documents the participants in the conversation. The revision history provides a change log indicating who made each change to the text and when. For a corpus or composite text, there is one header which includes elements common to the entire corpus and individual headers for each text within the corpus.

The text itself consists of optional front matter, followed by the body, followed by optional back matter. Instead of trying to define elements for every possible subdivision within all text types, the TEI uses a generic subdivision element <div> which carries an attribute indicating the type of subdivision, for example <div type=stanza>. <Div>s can be numbered as <div0>, <div1> etc if this is convenient. Within a prose text a <div> contains any one or more paragraphs tagged

. In verse, <div>s contain lines tagged <l> which are optionally grouped into line groups tagged <lg>. Depending on the nature of the text, paragraphs or lines are found within <div>s in drama. The core tags such as quotations, lists, names, abbreviations, notes, bibliographic citations may appear anywhere.

5. The TEI and Digital Libraries

The TEI's application of SGML satisfies many requirements of the digital library. Its scope already covers the major text types and, because of the modular DTDs, it is easily extended to new text types. It can handle multiple views and interpretations of a text and, through the header, it provides mechanisms for documenting the text. Furthermore, the use of SGML is not restricted to text. It can be used to describe images and other non-textual material and thus provides the link between a digital image of a text and its transcription. The TEI has extended the cross-referencing systems within SGML to enable them to point to complete texts or sections of text stored elsewhere as images or transcriptions.

It is important to understand the relationship between the TEI and the Hypertext Markup Language (HTML). HTML is a somewhat limited SGML application, which concentrates on markup for the presentation of information. It is less suitable for encoding text which is to be analysed in some way, for example by a retrieval program, but it can be used to display the results of an analysis of a more richly marked up text such as one encoded in the TEI scheme. However HTML has introduced many more people to the concept of structured text, and has contributed significantly to the spread of SGML.

Many publishers are now adopting SGML for their electronic publications and much more SGML-based software is available now than a few years ago. Tools such as Author/Editor and Omnimark can be used to aid the production of SGML-encoded text. Products like PAT, Dynatext and Explorer allow the user to search and browse structured text and some, like Panorama, a public domain version of Explorer, can be launched from the World Wide Web and work with any arbitrary DTD.

Using a powerful SGML-encoding scheme like the TEI provides an opportunity to rethink the way we work. At any level, creating an electronic text implies some interpretation of the source material. SGML allows more levels of interpretation to be embedded in a text than other markup schemes. It thus provides many more possibilities for using that text. However those who insert the markup become responsible for some of the intellectual content of the text. Decisions need to be made as markup is inserted, and it is the markup which determines how words are indexed and thus how they can be retrieved. At present it is not clear whose role this is. In current projects, it is variously being handled by scholars, student assistants, librarians, publishers, and software developers, some of whom may be more familiar with the source material than others. What is important is that documentation such as that provided in the TEI header is supplied so that users of the text know what they have got.

6. Using the TEI Header

At the Center for Electronic Texts in the Humanities (CETH), we see the header information as central to all the functions which are performed on an electronic text, since it provides the intellectual rationale for the encoding of the text as well as a description of the text. We are now working on the design of an enhanced header which will provide a direct mapping to all the MARC fields which we use for cataloguing electronic texts as well as encoding, profile and revision descriptions which can be used by computer software as well as human users. Our cataloguer will have responsibility for the information which map on to the MARC fields. From this master SGML format we will be able to generate records automatically for the Rutgers Inventory of Machine-Readable Texts in the Humanities in several formats including MARC records for RLIN, relational database records which can be accessed via forms on our World Wide Web server, and even a printed directory. Using SGML for the master format means that we are not dependent on any particular hardware or software and that the data structure can easily be modified in the future, if enhancements are needed.

Appendix: Information about the TEI

The TEI has a listserv TEI-L@UICVM for disseminating information and discussing the Guidelines. Its fileserver has information about how to obtain the Guidelines electronically. Printed copies of the Guidelines are available from

(in USA), C.M. Sperberg-McQueen, Computer Center (M/C 135), University of Illinois at Chicago, 1940 West Taylor Street, Chicago, IL 60680, e-mail u35395@uicvm.uic.edu, price $75 ($50 for members of ACH, ACL and ALLC) (make checks payable to Association for Computers and the Humanities).

(in Europe), Lou Burnard, Oxford University Computing Services, 13 Banbury Road, Oxford OX2 6NN, England, e-mail lou@vax.ox.ac.uk, price 50 pounds (35 pounds for members of ACH, ACL, ALLC) (make checks payable to Oxford University Computing Services).

References

Burnard, Lou. "Report of Workshop on Text Encoding Guidelines." Literary and Linguistic Computing, 3 (1988): 131-3.

Burnard, Lou. "What is SGML and How Does It Help?" TEI document TEI ED W25, available from TEI fileserver tei-l@uicvm, 1991.

Coombs, J.H, A.H. Renear, and S.J. DeRose. "Markup Systems and the Future of Scholarly Text Processing." Communications of the ACM, 30 (1987), 933-947.

Giordano, Richard. "The Documentation of Electronic Texts Using Text Encoding Initiative Headers: an Introduction". Library Resources and Technical Services, 38 (1994): 389-402.

Hockey, Susan. A Guide to Computer Applications in the Humanities. London: Duckworth and Baltimore: Johns Hopkins, 1980.

Hockey, Susan. "Electronic Texts in the Humanities: A Coming of Age". Forthcoming in Proceedings of Annual Clinic on Data Processing in Libraries 1994, Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign.

Hoogcarspel, Annelies. Guidelines for Cataloging Monographic Electronic Texts at the Center for Electronic Texts in the Humanities. CETH Technical Report No. 1, 1994.

Horowitz, Lisa. CETH Workshop on Documenting Electronic Texts, May 16-18, 1994. CETH Technical Report No. 2, 1994.

Lancashire, Ian (Ed.). The Humanities Computing Yearbook. Oxford: Oxford University Press, 1991.

Rubinsky, Yuri. "Electronic Texts the Day After Tomorrow." In Ann Okerson (Ed.), Visions and Opportunities in Electronic Publishing: Proceedings of the Second Symposium. Association of Research Libraries, 5-13, 1993.

Sperberg-McQueen. C. Michael. "Text in the Electronic Age: Textual Study and Text Encoding with Examples from Medieval Texts." Literary and Linguistic Computing, 6 (1991): 34-46.

Sperberg-McQueen, C. Michael and Lou Burnard. (Eds). Guidelines for the Encoding and Interchange of Electronic Texts. Chicago and Oxford: ACH, ACL, ALLC, 1994.

van Herwijnen, Eric. Practical SGML. Second edition. Dordrecht: Kluwer, 1994.

Susan HockeyCETH

Describing Electronic Texts: The Text Encoding Initiative and SGML

Susan Hockey
CETH