CETH Newsletter, Fall 1994

Summary of CETH Workshop on Documenting Electronic Texts

May 16-18, 1994 -- Radisson Hotel, Somerset, NJ

by Lisa R. Horowitz
In May 1994, the Center for Electronic Texts in the Humanities (CETH) sponsored an invitational workshop on documenting electronic primary source materials in the humanities. The goal of the workshop was to work toward a clearer understanding of the relationship between the TEI header, the MARC record, and the current international cataloging rules, with an objective of establishing how far they meet the needs of scholars, librarians, publishers and software developers who work with these materials.

Background Information

Since people may be unfamiliar with either the TEI or MARC, a brief explanation of each is in order. The Text Encoding Initiative (TEI) is a major international project to develop and disseminate guidelines for the interchange of machine-readable texts among researchers in the humanities, and to make recommendations for the encoding of new texts. It is sponsored jointly by the Association for Computers and the Humanities (ACH), the Association for Computational Linguistics (ACL), and the Association for Literary and Linguistic Computing (ALLC). The guidelines were formally published as Guidelines for Electronic Text Encoding and Interchange (referred to as TEI P3) in May 1994. In these guidelines, a structure is proposed for an electronic text "header," somewhat equivalent to a printed book's title page. The header contains the traditional elements of a title page, such as title, author, and publisher. However, it also includes more extensive information specific to the electronic texts used by humanities scholars, such as explanations of the text encoding (e.g., what features were marked up: proper names, abbreviations, quotations, foreign words, bibliographic references, editorial comments, etc.); its source, if it was transcribed; and a revision history (who did what to the file when). This header can alleviate problems caused by the lack of documen- tation so common in informally-created (and even some formally-created) electronic texts.

MARC (MAchine-Readable Cataloging) is a definition of a structure for formatting data, originally designed as a standard for the bibliographic information found in library catalogs, although its uses have multiplied. In each record, MARC defines fields and subfields which represent specific kinds of information and which require certain syntax. Software for online library catalogs is based on MARC record formats. The software will generate a screen, based on each field's syntax and content, which is easy for users to read and understand.


Originally, a workshop was proposed that would address issues surrounding the cataloging of electronic texts. As plans developed, the purpose of the workshop was clarified and broadened to include the convergence and/or divergence of MARC and TEI, because it became clear to CETH that the goals of these two electronic standards were similar, but the extent of the relationship between them was unclear. It was even possible to envision a single electronic file that would represent all the bibliographic information contained in the TEI header while also functioning as an access point, the way the MARC record currently does.

Other issues which CETH considered important to both catalogers and users of the TEI guidelines were the question of what represents a new edition of an electronic text and what are the requirements for new electronic material. It was hoped that workshop participants would confront these issues and examine the implications for the computing, publishing, humanities, and library communities.

The Program

In preparation for the program, participants were asked to read a number of materials introducing the subjects under discussion. To illustrate the cataloging issues, the newly printed Guidelines for Cataloging Monographic Electronic Texts at the Center for Electronic Texts in the Humanities (informally called the CETH cataloging guidelines) and a booklet explaining USMARC formats (the MARC standard used in the United States) were included. To give background on TEI and SGML, three chapters from TEI P3 were included which explained SGML (Standard Generalized Markup Language, the markup language on which the TEI guidelines are based) and the TEI header. Additionally, the TEI header of an electronic text was included, with a related MARC record and a catalog record as might be viewed by a library patron.

The one-and-a-half day program combined a great deal of new information with much discussion. For the first part of the workshop, experts knowledgeable about SGML, TEI and MARC presented overviews to ensure that all workshop participants, most of whom were expert in one or two of these fields, had a background in all three. The overviews began with an introduction by Allen Renear of Brown University on the needs of the humanities scholar. Michael Sperberg-McQueen of the University of Illinois at Chicago, one of the editors of TEI P3, gave a general introduction to SGML and its use in humanities materials, followed by Rich Giordano of the University of Manchester, England, a member of the TEI Text Documentation Committee, who discussed the purpose and contents of the TEI header. Clifford Lynch of the University of California described how the TEI guidelines could benefit networked resources, and what needs to be developed to link bibliographic information with actual locations on the Internet. Randall Barry of the Library of Congress gave the group an introduction to MARC formats.

Three presentations of projects that applied the principles of TEI and MARC to electronic texts were presented following the overviews. Dominic Dunlop explained how the British National Corpus, a national corpus of language used in writing, reading and speaking, used TEI headers, describing the difficulties the header presents for spoken texts and the issues of generating bibliographic records from the information contained in the headers. Daniel Pitti of the University of California demonstrated the Berkeley Finding Aids Project, a DynaText-based prototype interface used to search and examine unpublished collections of primary source materials at Berkeley. (DynaText is an SGML-aware browser produced by Electronic Book Technologies.) John Price-Wilkin and Edward Gaynor described the process used by catalogers at the University of Virginia to create TEI headers for electronic texts held by the university. The catalogers then use those headers to create MARC records.

The rest of the program was devoted to discussion. The participants were divided into four groups of approximately twelve people each. An attempt was made to include people with different perspectives in each group. Each group's moderator kept the discussion focused on a precribed set of questions. A reporter recorded his or her group's viewpoints for presenting at the final discussion. Break-out discussions lasted four hours, followed by a general discussion of an hour and a half. All four groups discussed the same set of questions. The questions covered topics such as the information needed to describe or access electronic texts, the special needs required by various formats such as images or sound as opposed to text, and dealing with different versions or editions of electronic materials.

Conclusions and Recommendations

Generally, the groups came to very similar conclusions, an outcome which was unexpected considering the wide range of perspectives held by the participants. A number of key points were raised in the discussions. The consensus was that the TEI header is an invaluable tool for controlling and managing the information related to an electronic text. However, it was agreed that the people who know the most about the electronic text are not the catalogers but the creators of the text, and that a way must be found to motivate scholars and publishers to include headers in their texts. Catalogers could then perform authority work on the information found in the header, the same way that they do for a title page of a printed book. The possibility of mapping directly from the TEI header into a MARC record was seriously considered and supported. However, although the prospect of using just one electronic record for use both as a "title page" to the document and as a way to access materials was considered by some of the groups, it was only acceptable to a minority of participants. The main closing recommendation was that more meetings must take place between people of many different backgrounds.


A full report of the workshop was published by CETH in early fall as CETH Technical Report #2.

CETH is developing its research program on documenting electronic texts in the humanities and expects to organize future workshop on this topic. Your comments on the topic, addressed to CETH, are welcomed.

