The Electronic Archive of Early American Fiction

[This local archive copy mirrored from the canonical site: http://etext.lib.virginia.edu/eaf/eaf-about.html; links may not have complete integrity, so use the canonical document at this URL if possible.]

The Electronic Archive of Early American Fiction

at the University of Virginia

In 1996, the University of Virginia Library received a grant from The Andrew W. Mellon Foundation to digitize and put on the Web 558 rare volumes of early American fiction, and to study the economics of electronic versions of rare books. The texts chosen for the project include first printings of works by James Fenimore Cooper, Edgar Allan Poe, Nathaniel Hawthorne, and other novels and short stories. Two versions of each text will be made available: a TEI-conformant SGML-tagged text and color images of the pages of the first editions--a total of 118,000 pages. The project will conclude in 1998 with an economic study of usage of the e-texts compared with usage of the original rare books.

The University of Virginia Library (UVA) is in the process of creating electronic texts of rare books, and will be comparing the usage and costs of electronic texts with the usage and costs of original paper texts of rare books. Called the Early American Fiction project, the work is focusing on e-texts of a well-defined and comprehensive collection of early American fiction.

In the U.S. and Canada the largest university libraries house over a million rare books. From the ancient library in Alexandria, Egypt, to the present time, preserving unique and rare books has been taken as a principle raison d'être of research libraries. But this objective is a costly part of the libraries' mission. And from the standpoint of the patron, usage of original rare books requires visiting geographical locations to use physical objects.

The World Wide Web offers the possibility of greatly expanded access to computer versions of rare books. The electronic versions offer the added value that every word in the rare books can be indexed and searched. It is possible in an online collection of early American fiction to find in seconds every instance of a word or concept, while in the original rare books such a search might take years. And while computer images of rare book pages alone can only serve as pointers to the rich actuality of the original physical artifacts, the combination of searchable text and high-resolution color page images provides a detailed and flexible view of the material to teacher and scholar alike.

Early American Fiction presents the opportunity to study scholarly use of original rare books and of their computer simulacra, and to determine the extent to which electronic texts of rare books can serve scholars. We expect to create an online collection, to focus on use by faculty and student scholars, and to obtain objective data supporting reliable comparisons of usage of e-texts with usage of original rare books.

The Early American Fiction Collection

The 75 years up to 1850 were marked by some masterpieces of American novels and short stories, such as James Fenimore Cooper's The Last of the Mohicans, Edgar Allan Poe's Tales of the Grotesque and Arabesque, and Nathaniel Hawthorne's Scarlet Letter. In addition to these well-known works, there were also numerous publications popular in their own time but now forgotten. All of these works, however, cast light on the early days of the United States, and they are worth studying in order to examine patterns of thought when the U.S. was still young.

Two standard bibliographies describe classic American literature:

Wright's American Fiction 1774-1850 lists all works of fiction published from the first story up to 1850.¹

Bibliography of American Literature (BAL) lists the original editions of the most important authors of American literature, as chosen by a committee of the Modern Language Association of America.²

The University of Virginia Library is fortunate to have two of the world's major collections of rare first editions of American fiction in its Barrett and Taylor collections. In these collections most of the first editions in Wright and BAL are available. For some editions UVA has one of the few existing copies of the edition. In the EAF project, therefore, we are using first editions from UVA that meet the following criteria:

the author is in BAL;
the edition is listed in Wright;
UVA has a first edition of the work.

When these criteria are applied, the project will cover 421 titles in 558 volumes by 81 authors, containing 118,000 pages.

The Production of Digital Images of Rare Books

There are two major tasks in the creation of the electronic archive of early American fiction. First, we are making digital images of every page of each of the 558 volumes. Second, these page images are being converted to SGML-tagged ASCII text. The final product will be both digital images and searchable tagged texts of every book in the project.

Often, non-rare books can be scanned on a flat-bed scanner, and sometimes the ASCII text can be created by Optical Character Recognition from the digital images. Rare books present the special problem that the physical books must be handled with extreme care. Most rare books cannot be scanned on a flat-bed scanner, for example, because on such a scanner their spines or pages might be damaged. Some of the rare books are so fragile, in fact, that they can not be opened wider than about 120 degrees. As a result, this project has needed to develop methods for large-scale production of digital images of fragile materials. This part of the project is being carried out in the UVA Library's Special Collections Digital Center.

The imaging work is being done with digital cameras mounted above light tables. The camera backs that we are using are manufactured by Phase One, and provide a maximum resolution of 5000 x 7000 pixels with 24-bit color. The backs are attached to Tarsia Technical Industries Prisma 45 4x5 cameras, on TTI Reprographic Workstations. To protect the books, we use book cradles specially designed by for rare books. A full-time digitizing supervisor and part-time student assistants keep the cameras busy all day long, in order to stay on our project schedule. In an eight-hour day we can shoot about 20 images per hour for each of two cameras, or one image every three minutes. Of course, within this average there is a range of imaging speeds that depend on the type of book, among other things. For example, a book that is tightly bound or exceptionally fragile takes longer to digitize than a book that can be opened wider on the book cradle.

Since a goal of Early American Fiction is to test whether digital versions of rare books can substitute for the original editions in some cases, we are methodically making images of all parts of every book--the spines, front and rear covers, and all pages of the book, including copies of any blank pages. For each book we do a test sheet on which we film the cover of the book with a ruler and Kodak grayscale and color strips for color comparisons. From the EAF copy of a book, it should be possible to get an idea of the appearance of every part of the book. In the future, we hope to use these images to create virtual reality images of each of the rare books.

The pages are being scanned in 24-bit color at a resolution of 500 dots per inch. The images are saved as TIFF files for long-term storage. Each TIFF is then converted by Photoshop 4.0 software into a high-quality JPEG image and a smaller-size JPEG for display on the World Wide Web. Each page image requires 20-25 megabytes or more for the TIFF file. The JPEG images are approximately 300 and 100 kilobytes each. On the average, therefore, a book in this project requires at least 4.2 gigabytes for storing its TIFF images; 63 megabytes for its 300KB JPEGs; and 21MB for its 100KB JPEGs. The entire collection of 558 volumes should require about 2,500 gigabytes of storage for TIFFs, 37GB of storage for the high-quality JPEGs, and about 12GB of storage for the smaller JPEGs.

Both the TIFF and the JPEG images are stored on writeable CD-ROMs. The TIFF CD-ROMs are used for archival storage of those images, and for additional security EXABYTE tapes are used to back up the CD-ROM TIFF images.

The Production of SGML-Tagged Electronic Texts

To summarize the preceding section: the 558 volumes that comprise the print originals for this electronic text collection will be scanned as high-quality color images, and then "working copy" JPEG files will be created from the archival TIFF image original. These copies are the source for the keyboarding of the text by a commercial service bureau. This method is much cheaper than transcribing the documents ourselves, and much faster. The work is done in large professional typing operations overseas, and is the way in which practically all large commercial e-text projects are created. This part of the project will be overseen by the UVA Library's Electronic Text Center.

Both optical character recognition (OCR) and keyboarding are variously useful and reliable methods for the creation of machine-readable transcriptions of print materials. For Early American Fiction, the obvious practical choice is to use a commercial keyboarding company, because of both the physical nature of the source material and its bulk.

OCR works by taking a digital image of a page of type and interpreting the shapes on it, turning clusters of image pixels into ASCII characters. OCR works well with modern typefaces, and often copes reasonably well with later 19th-century printed matter, but its effectiveness decreases with earlier material. The ability for the software to recognize letters is principally challenged by printing flaws that disrupt the integrity of the letter form, such as uneven inking and broken type; and such features are typical of earlier printed material. For sample typefaces and the results of OCR on them, see the following Web site:

http://etext.lib.virginia.edu/helpsheets/scan-train.html

Even with clean modern type, one or two errors per page are not atypical with OCR. This makes it suitable for small runs of materials, especially when they can be effectively corrected using a modern spell-checker. But both the bulk and the non-20th century spelling in Early American Fiction would make them cumbersome to correct, and their typographical features will result in more numerous errors than we would see in modern print items.

The workflow for creating e-texts is as follows:

Using images of the pages, the vendor types in all texts at least twice, and the two versions are electronically compared to catch discrepancies. This double-keying is the accepted "standard" in the humanities text industry at present, and is the way in which works such as the Oxford English Dictionary have been prepared. This results in an accuracy rate of 99.995% or better.

As the texts are created, Standard Generalized Markup Language (SGML) tagging is added to record the physical and structural characteristics of the text: title-page layout, pagination, paragraphs, verse lines, italics, accented letters, etc. The vendor checks the accuracy of the tagging with a computer program that makes sure the tags are properly formed.

Each page of the electronic text has the location of its corresponding image marked, so that the two can be linked together hypertextually.

The texts are returned as they are finished to UVA, where the following happens:

They are spot-checked -- sections are proof-read -- to verify accuracy of input.
The SGML encoding is checked for completeness.
For each work, a standard bibliographical header is created by staff in the Electronic Text Center following the guidelines established by the Text Encoding Initiative's (TEI) Guidelines for Electronic Text Encoding and Interchange (P3). The header records all the details of the print source and of the electronic version, and includes some keyword information that will be valuable as the items are searched.
For each pictorial illustration in a work, a standardized description of type and subject matter is created.
The final text is re-parsed, indexed, and put online on the World Wide Web.

The final online texts will be searchable, like the other texts at the E-Text Center Web site.

Future Directions for the EAF Project

As online texts become available in the EAF project, two issues will be the focus of attention: cost recovery for the project and measurement of the usage and benefits of the texts.

We have already begun to address the questions of cost recovery by arranging with a commercial publisher to make the texts available. We have contracted with Chadwyck-Healey Inc. to publish Early American Fiction on CD-ROM and to make the collection available on the World Wide Web as part of Chadwyck-Healey's Literature Online (LiOn) Web service. Through this arrangement we hope to recover at least some of our investment in the EAF texts, in order to regain funds to use for creating additional e-texts -- probably of American fiction published from 1851 to 1900.

A critical part of the EAF project is assessment of the costs and usage of the texts, as compared with costs and usage of the original rare books. We plan to test the hypothesis that for most uses of rare books, high-quality electronic texts and digital images are adequate substitutes.

In order to survey EAF users, we will put on the World Wide Web in the spring of 1998 a forms-based survey instrument. The online questionnaire will be used to collect information on the demographics, knowledge, attitudes, and behavior of users of the online American fiction. Questions about behavior, for example, will focus on use of the searchable ASCII texts vs. the images of pages of the books. Other questions will elicit information on perceptions of ease of use of online texts compared with original paper texts. People who have used the online American fiction will be asked to fill in the questionnaire. Standard procedures for ensuring the reliability and validity of survey results will be followed, such as follow-up of non-respondents.

These data will be compared with data collected in a survey during spring, 1998, of users of original rare books in the UVA and other libraries. We expect to do a sample survey of users, again covering the topics of demographics, knowledge, attitudes, and behavior. The questions on this survey will be designed to mirror those on the online survey. For example, users will be asked if they are familiar with the online texts and if those texts could satisfy their needs. Demographic questions will elicit information on topics like distance traveled to use the original texts; this information can be used for inferences about costs.

From the segment of the project on measurement, the data should be available to allow us to consider costs per use. Consideration of costs needs to take account of the traditional way of getting at rare books: users travel to where the books are and look at them in a library. That is, the costs of access are mainly paid by the individual user. Even here, however, the maintenance of rare books imposes special burdens and costs on research libraries, especially the older and larger research libraries.

For example, in 1994-95 the unit cost of a purchased monograph in U.S. university libraries was $45.07.³ In the same year, the UVA Library spent an average of $373 for each purchased rare book. So the typical rare book costs over 8 times as much as an ordinary monograph. And this initial cost disparity persists throughout the life of the books. Conventional wisdom is that it costs three times as much to house and maintain a rare book as a regular book. So in a cost-per-use model, a typical rare book would need to be used 3 to 8 times as much as a regular monograph for the unit costs of acquisitions and maintenance to be equal. But in fact, the per-volume circulation of rare books is considerably lower than the per-volume circulation of ordinary monographs. As a result, the costs of acquiring, maintaining and providing access to rare books is disproportionately high for research libraries; and for users there is also a cost differential to use rare materials.

Though the initial cost of creating an e-text and image facsimile of a rare book is also high (about $1,000), the e-text offers considerably greater opportunities of distributed uses, and thus of much lower unit costs per use. From the standpoint of patrons, their costs of travel to UVA to see the first edition of The Scarlet Letter may be made unnecessary by the availability of an online version. And the online edition also accommodates many classes of users -- such as high school students -- who could never access the original text because the cost to them is too high. It is significant that 70% of the books to be put online in the EAF project are not in print, and are available in only a few university libraries.

Early American Fiction therefore offers an opportunity for testing whether principles of digital libraries can be applied to a very traditional area of librarianship, the area of rare books and special collections.

A version of this paper was delivered by Kendon Stubbs and David Seaman at the Digital Library Workshop, University of Library and Information Studies, Tsukuba Science City, Japan; and at Keio University, Tokyo, Japan (March 1997).

1. Lyle H. Wright. American Fiction 1774-1850: A Contribution Toward a Bibliography. San Marino, CA: Huntington Library, 1969. Second revised edition.

2. Bibliography of American Literature, compiled by Jacob Blanck for the Bibliographical Society of America. New Haven: Yale University Press, 1955-1990. 9 volumes.

3. ARL Statistics 1994-95. Washington: Association of Research Libraries, 1996. Page 46.