[Mirrored from: http://www.acls.org/n44hock.htm]


Volume 4, Number 4 (February 1997)

Internet-Accessible Scholarly Resources
for the Humanities and Social Sciences

This issue focuses on the presentations of a program session on Internet-accessible scholarly resources held at the 1996 ACLS Annual Meeting.

Electronic Texts:
The Promise and the Reality

by Susan Hockey
Center for Electronic Texts in the Humanities

With the current level of interest in the Internet, World Wide Web, and digital libraries, electronic texts and the technology needed to use them are now in the center of the scholarly arena. But their usage in humanities scholarship is not new. It is now almost fifty years since the first electronic text project began in 1948 when Father Busa started to prepare electronic texts of St Thomas Aquinas and related authors for his monumental Index Thomisticus (
Note 1). Since then a considerable body of expertise has developed among scholars who work with electronic texts, but so far less of this expertise than one might hope has found its way to most of the texts available on the Internet today.

This piece concentrates on primary source texts which are studied for many purposes and are the most varied in nature. The solutions to problems encountered in working with electronic primary source material can be applied just as well to simpler material such as journal articles and monographs. Humanities primary source texts may take the form of literary works, historical documents, manuscripts, papyri, inscriptions, coins, transcriptions of spoken texts, or dictionaries and they may be in any natural language. Much of this material is complex. It may include variant readings, variant spellings, marginal notes, annotation of various kinds, cancellations and interlineations as well as nonstandard characters. Much of the scholarship carried out on this material consists of interpretation and annotation, often involving fine detail as well as broad overviews.

Representing this material in electronic form in such a way that it can be used for traditional scholarly processes in the humanities poses significant intellectual and technological challenges, especially when, as is often the case, scholars differ in their opinions about the material.

Electronic texts can be used for many purposes in the humanities (
Note 2). Content retrieval, where, for example, the computer is instructed to find all the documents about a particular topic, is only one application. Other scholarly uses include rapid retrieval of words and phrases for the study of lexis and syntax, and the production of concordances and word frequency lists on which analyses of vocabulary, style, disputed authorship, and poetic rhythm and meter can be based. Other projects have used concordances in the compilation of historical dictionaries, and for scholarly editions. These fairly mechanical processes are relatively well understood and they have been used intelligently by many projects, but they do have significant limitations. Words are seen only as a sequence of letters. There are no facilities for separation of homographs (e.g., "might" as an auxiliary verb and as a noun), or for morphological analysis (e.g., for inflected languages such as Latin and Greek), or for lemmatization (putting words under their dictionary headings), or for bringing together variant spellings of the same word.

Automatic concept-based searching, which is what most scholars really want to do, is still a long way off.

Markup makes explicit for computer processing things which are implicit for the human reader. We are of course most familiar with typographic markup, but this is ambiguous for most kinds of analysis of electronic texts (
Note 3).

For example, italics can be used for titles, foreign words, emphasized words or stage directions. It would not be possible to use italic markup to make an index of titles or exclude stage directions from a word search.

Structural markup distinguishes titles, foreign words, emphasized words, stage directions and anything else. It can be used to refine searches and can also be mapped on to typographic markup for display or printing. HTML is a specific markup format which is interpreted by World Wide Web browsers to display text or provide links to graphics, sound or to other HTML files. HTML is a mixture of structural and typographic markup, but the structure which it uses is not rich enough for most kinds of searches or analyses of humanities texts.


SGML is a kind of computer language which permits the definition of many different markup schemes (
Note 4). HTML is in fact a simple SGML-based scheme. The TEI has defined an SGML-based markup scheme which can handle many different types of humanities texts. The TEI Guidelines for the Encoding and Interchange of Electronic Text were first published in 1994 after six years of work based on the experiences of an international community of humanities computing scholars (Note 5).

The TEI scheme can be used in a very simple fashion, but it can also handle very complex texts and represent multiple and conflicting views within those texts. SGML makes the computer processing of texts easier because of the SGML Document Type Definition, a formal specification of the document structure which is used by any SGML-based computer program to derive information about what features the program can expect to find in the text. SGML files are important for the longevity of electronic texts since they are plain ASCII and can thus move very easily from one computer system to another. They can also be processed by any SGML-aware program and so the user of SGML texts is not dependent on any one software vendor. SGML is now being widely adopted in industry as companies realize the cost of moving from one software system to another. Many academic publishers are now also adopting SGML. Notable among these is Chadwyck-Healey, who use a version of the TEI for the English Poetry Database, the Patrologia Latina, and other textbases.

Creating an electronic text can be a very time-consuming process. Keyboarding is the best method to reach the levels of accuracy that humanities scholars would expect to find. Even on texts which are clearly printed and all in the same typeface, optical character recognition (OCR scanning) is rarely more than 99.9% accurate, that is one error per 1,000 characters or 10-12 lines (
Note 6). OCR systems typically have problems with early printed books, manuscripts, newspapers, dictionaries and bibliographies (where there are many changes of typeface), and microfilm prints. Even if the letters in the text are recognized accurately by the OCR program, work still needs to be done to create an accurate and usable text for processing. This may entail inserting markup and deleting extraneous information such as page numbers, line numbers, and other marginal information. It is worth noting that most of the major humanities electronic text projects (Thesaurus Linguae Graecae, Women Writers Project, Perseus, Oxford English Dictionary, etc.) do not use OCR, but prefer to keyboard their text. OCR may be more appropriate for content-based document retrieval where a completely accurate text is less important. In this case OCR is used together with digital images of text. A search is performed on the uncorrected text but the result is delivered to the user as a digital image of the page where the match is found. This is the approach being used by the JSTOR project funded by the Andrew W. Mellon Foundation, which will initially consist of about 10 journal titles in the areas of economics and history and will initially contain approximately 750,000 journal page images.

Before the TEI there was no common methodology for describing and documenting electronic texts. Scholars who created electronic texts for their own purposes did not feel the need to prepare documentation since they knew what the material was in any case. The large European archives of electronic text which began in the 1960s mostly use simple database tools for keeping records of what they have got. A scholar looking for an electronic text of an author is particularly interested in what source text was used, what features are encoded in the text, and what changes have been made to the electronic text. None of these items of information fits well into library cataloging models such as MARC, which are also expensive to create and maintain. At the other extreme, the current version of HTML has very little provision for metadata, which is why the current WWW crawlers can perform only very crude searches. The TEI has developed a specification of an electronic text file header which provides all of this information and can also act as a work-sheet for a full catalog record if that is needed. Since it is also in SGML the TEI header has the added advantage of being processable by the same software as the rest of the text. It can thus remain part of the text file rather than be separated from it. In an attempt to make some progress with metadata for WWW documents, in 1995 OCLC convened a meeting of experts who defined a set of core elements for the description of networked document-like objects, known as the "Dublin Core" (
Note 7). A further meeting in early April 1996 discussed ways of implementing this, possibly using SGML.

Electronic texts are distributed either as raw text files which then require a third-party program, or they are packaged together with a retrieval program. Most publishers' electronic text CD-ROMs (Chadwyck-Healey English Poetry Database, OED, CETEDOC CD-ROM, etc) are packaged with proprietary software. It may take some time to learn how to use the software, but more experienced users also often find that this software does not answer some of the questions they want to ask. Libraries are also finding it expensive to support many different products. Raw text files need to be indexed for most applications. Otherwise searches can be very time-consuming. Indexing large textbases with Opentext's Pat program or similar tools is not a trivial task and the burden of deciding which options to choose in the indexing program and thus what words can be retrieved often falls on a computer systems person who may well not understand fully what scholars might want to do with the texts.

The electronic texts which are available on the Internet today represent only a small portion of the total number of texts in existence. The World Wide Web's strength is that it can bring to the desktop material which would otherwise be inaccessible. However the Web is basically a delivery mechanism. It provides only limited means of manipulating and analyzing the texts. At some institutions the Web and HTML are used as a means of delivering the results of retrieval programs which operate on a richer markup scheme such as the TEI. The results of these searches are normally converted ("dumbed down") to HTML for display to the user, but they can also be delivered in SGML for further manipulation (
Note 8). The Web's limited facilities for metadata are an added problem for the humanities. Current Web technology also encourages the fragmentation of material into short files which may not necessarily map well to the logical structure of a text.

It is worth noting that some of the major electronic textbases in the humanities are not yet on the Internet. These include the Thesaurus Linguae Graecae, Packard Humanities Institute Latin texts, and the Women Writers Project. This is partly because of licensing issues but also because of the inadequacies of current searching tools.


For publication in electronic form, it is no longer necessary to write in a single linear sequence, as is the case for publication in print form. Hyper-text permits links between associated, yet randomly distributed, items of information (
Note 9). It has become very popular in the humanities largely because it can model the way scholars follow a thread from one source document to another. Yet many hypertexts, including much material on the World Wide Web, at present use electronic material as if it was in print form. Some groups of scholars are beginning to devise publications which can only exist in electronic form. Foremost among these in the humanities are editions where the electronic form permits multiple versions of the same text with links between transcriptions of the text, digital images of the manuscripts, and annotation of various kinds (Note10).

The Model Editions Partnership (MEP) is a consortium of seven documentary editing projects, CETH and the TEI (Note11). Funded by the NHPRC, the MEP is developing a set of models for electronic historical editions which will exploit the electronic medium for addressing questions of intellectual access to source documents and their context as well as maintaining current standards of scholarly editorial excellence. Scholars will be able to go from a clear reading text to the most conservative diplomatic transcription at a click of a mouse. An editor can embed variants and footnotes within the text so that scholars can view or suppress them at will. Instead of being limited to a single organizing principle like chronology, an edition can allow readers to dynamically organize documents into subsets relating to their own interests. Annotation can be linked to many documents and indexes prepared cumulatively. The MEP sees markup as the key to the functions it needs to provide and is creating a specialized version of the TEI encoding scheme. This will also enable print to be produced very easily if needed. In the first phase of the MEP, the three coordinators visited all seven partner projects to examine their methods of working. Building on the information acquired during these visits the MEP then drafted a prospectus for electronic historical editions, which was circulated widely among the documentary editing community for comment. The markup scheme is now being drafted and this will also be circulated and tested widely.

CETH is also participating in the Orlando project at the University of Alberta . Orlando is creating an Integrated History of Women's Writing in the British Isles which will comprise four volumes and a chronology. SGML is being used throughout this project, including note-taking as the research for the project is carried out. The authors of the volumes will draw on a centralized SGML-encoded database of information as they write. The chronology can be produced almost automatically from this database and it is envisaged that there will be several other spin-off hypertext products all created by using the markup.

As the use of the Internet and electronic technology develops, we can expect to see many more projects like the MEP and Orlando, that is, projects which really use the medium in ways that otherwise would not be possible. At present much of the current Inter-net excitement is generated because of the easy access to material. As more material becomes available and more scholars use it, this excitement will begin to wear off. Usage will become commonplace, no different from how the library catalogue is used now. It is questionable how much of the current material on the Internet will be really usable and useful in this new environment. We must begin now to think what humanities scholars will want to do in the twenty-first century and plan the building of electronic resources to meet these needs.

1. Father Busa's remarks on the lack of intellectual progress since he began his project in 1948 are salutary reading for those embarking on ambitious projects. See "Half a Century of Literary Computing: Towards a `New' Philology", in Reports of Colloquia at Tübingen, Literary and Linguistic Computing 7 (1992): 69-73. [
Back to text.]

2. For an overview of electronic text applications in literature, see Susan Hockey, "Electronic Texts in the Humanities: A Coming of Age", p. 21-34 in Literary Texts in an Electronic Age: Scholarly Implications and Library Services, edited by Brett Sutton, Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign. [Back to text.]

3. Renear, Allen, "Representing Text on the Computer: Lessons for and from Philosophy," Bulletin of the John Rylands University Library, 74 (1992), 221-248. [Back to text.]

4. Eric van Herwijnen, Practical SGML, Kluwer, second edition, 1994 is a good introduction to SGML. For a comprehensive review of the SGML world, consult the WWW site compiled and maintained by Robin Cover. [Back to text..]

5. Sperberg-McQueen, C.M. and Burnard, L. (eds), Guidelines for the Encoding and Interchange of Machine-Readable Texts, TEI document P3, Chicago and Oxford: ACH-ACL-ALLC, 1994. More information is available online at http: //www.uic.edu/orgs/tei. See especially the document TEI U5, TEI Lite: An Introduction to Text Encoding for Interchange for an introduction to the TEI SGML tag set. [Back to text.]

6. See Optical Character Recognition in the Historical Discipline: Proceedings of an International Workgroup, Netherlands Historical Data Archive, Nijmegen Institute for Cognition and Information, 1993 for a collection of papers on this topic. Most conclude that OCR is not suitable for humanities source material. [Back to text.]

7. The set of metadata elements defined by the first OCLC meeting has become known as the Dublin Core. See Stuart Weibel, Jean Godby, Eric Miller and Ron Daniel, OCLC/NCSA Metadata Workshop Report. [Back to text.]

8. Notable among these are the Universities of Michigan and Virginia. At the Center for Electronic Texts in the Humanities, we have developed a similar prototype system for the Princeton Library of Electronic Texts in the Humanities (PLETH). [Back to text.]

9. The Perseus project is perhaps the best-known humanities hypertext project. It is worth noting that Perseus uses SGML for the archival form of its data. [Back to text.]

10. The Literary Text in the Digital Age, edited by Richard Finneran (University of Michigan Press), contains a collection of essays on the potential of electronic editions. Peter Robinson's work, most recently with the electronic Wife of Bath's Prologue published on CD-ROM by Cambridge University Press is particularly noteworthy. [Back to text.]

11. The MEP is directed by David Chesnutt, editor of the Papers of Henry Laurens. Susan Hockey, Director CETH and C. Michael Sperberg-McQueen, Editor of the TEI serve as coordinators. The partner edition projects are The Documentary History of the First Federal Congress, The Documentary History of the Ratification of the Constitution and the Bill of Rights, The Papers of General Nathanael Greene, The Papers of Henry Laurens, Lincoln Legal Papers, The Papers of Margaret Sanger, and The Papers of Elizabeth Cady Stanton and Susan B. Anthony. [Back to text.]



Building the Scene: Words, Images, Data, and Beyond by David Green
Images on the Internet: Issues and Opportunities by Jennifer Trant
The World Wide Web as a Resource for Scholars and Students by Richard C. Rockwell
The American Arts and Letters Network (AALN) by Charles Henry
The National Initiative for a Networked Cultural Heritage (NINCH) by David Green
Because It's Time: A Commentary on the Program Session by Willard McCarty
Online Scholarly Resources Mentioned in this Issue

Visit the ACLS website for further information on the American Council of Learned Societies and its publications.