EAF - Producing SGML-Tagged ASCII Texts

[Mirrored from: http://www.lib.virginia.edu/speccol/mellon/texts.html]

Producing SGML-Tagged ASCII Texts

The 582 volumes that comprise the print originals for this extraordinary electronic text collection will be scanned as high-quality color images, and then "working copy" JPEG files will be created from the archival TIFF image original. These copies are the source for the keyboarding of the text by a commercial service bureau. This method is much cheaper than transcribing the documents ourselves, and much faster. The work is done in large professional typing operations overseas, and is the way in which practically all large commercial e-text projects are created. This part of the project will be overseen by David Seaman, Coordinator of the Library's Electronic Text Center, and by David Gants, Assistant Coordinator of the Center; any post-creation processing work will be done by student assistants in the Center.

Both optical character recognition (OCR) and keyboarding are variously useful and reliable methods for the creation of machine-readable transcriptions of print materials. For the early American fiction project, the obvious practical choice is to use a commercial keyboarding company, both because of the physical nature of the source material and its bulk.

OCR works by taking a digital image of a page of type and interpreting the shapes on it, turning clusters of image pixels into ASCII characters. OCR works well with modern typefaces, and often copes reasonably well with later 19th-century printed matter, but its effectiveness decreases with earlier material. The ability for the software to recognize letters is principally challenged by printing flaws that disrupt the integrity of the letter form, such as uneven inking and broken type; and such features are typical of earlier printed material. For sample typefaces and the results of OCR on them, see the following Web site:

http://etext.lib.virginia.edu/helpsheets/scan-train.html

Even with clean modern type, one or two errors a page is not atypical with OCR. This makes it suitable for small runs of materials, especially when they can be effectively corrected using a modern spell-checker. But both the bulk and the non-20th century spelling in the American Fiction to 1850 project would make them cumbersome to correct, and their typo-graphical features will result in more numerous errors than we would see in modern print items.

The keyboarding companies provide both highly accurate text (one error per 20 pages on average) and also add in most of the SGML encoding as they type. And they produce the output quickly. To duplicate the same accuracy in-house we would need a large and expensive team of staff to scan, tag, and proofread -- and the job would take much longer. The team would need to be assembled, trained, and overseen, all of which would add to cost. Multiple computers, OCR software packages, and scanners would have to be bought, as well as separate machines for the proofreading stage. Some of the texts in this project will perform so poorly as input for OCR that they would have to be typed manually. Even using student labor, and leaving the space and equipment costs out of the equation, we could not come close to the $1.97 per page average commercial charge for data entry, rigorous proofing, and SGML markup.

To date, none of the large full-text literary databases produced commercially have been generated using OCR; publishers such as Chadwyck-Healey and Oxford University Press have concluded as we have that the use of a keyboarding service is the cheapest option.

Concerning personnel for the project, the project will become part of the ongoing work of two of our existing electronic centers, the Special Collections Digital Center and the Electronic Text Center. Both Centers are already engaged in the kind of work that this project will involve: using the Kontron camera to make digital images of rare books and processing SGML tagged ASCII texts for our World Wide Web etext site.

In the Special Collections Digital Center, the fiction project will follow already established routines for digitizing rare books. The Center's coordinator, Edward Gaynor, will be available for policy questions and overall direction, which will require about 15% of his time. The day-to-day work of supervising the student assistants who are using the cameras to make images of the book pages will fall to the full-time Digitizing Supervisor. This person, not yet hired, will be responsible for seeing that schedules are maintained in the process of making digital images, saving them on disk, and transferring the files to CD-ROMs.

The Electronic Text Center will receive the page images from the Special Collections Center on CD-ROMs and will send the CD-ROMs to the vendor who will create SGML-tagged texts and return the texts to the Etext Center. The returned texts will then be checked, cataloging headers will be prepared, and the texts will be added to the Etext Center's Web site. This kind of process is already being carried out in the daily work of the Etext Center, which has processed some 10,000 texts for the World Wide Web. The Coordinator of the Etext Center, David Seaman, can add the American fiction project to his purview with an expenditure of 15% of his time per week. As part of their responsibilities, Mr. Gaynor and Mr. Seaman will be available to speak for the project and to answer queries about it.

The workflow for creating e-texts and the costs are as follows:

U.Va. creates the page images as archival-quality 24-bit color TIFF files, and then makes JPEG working copies. All pages, as well as the cover and spine, will be digitized for a full visual record of the book.

These images are backed up onto writable CD-ROMS

U.Va. also provides a CD-ROM with JPEG page images to the vendor

The vendor prints out page images.

The texts are all typed in twice, and the two versions electronically compared to catch discrepancies. This double-keying is the accepted "standard" in the humanities text industry at present, and is the way in which works such as the Oxford English Dictionary have been prepared. This results in an accuracy rate of 99.995% or better. This means that the final text will exhibit typographic error at a rate of approximately one error per twenty pages.

As the texts are created, standard SGML markup is added to record the physical and structural characteristics of the text: title-page layout, pagination, paragraphs, verse lines, italics, accented letters, etc. The vendor checks the veracity of the tagging with a computer program that makes sure the tags are properly formed.

Each page of the electronic text has the location of its corresponding image marked, so that the two can be linked together hypertextually.

The texts are returned as they are finished to U.Va., where the following happens:

They are spot-checked -- sections are proof-read -- to verify accuracy of input.

The SGML encoding is checked for completeness.

The text is completed, in the following manner:

The titlepage transcription is completed (this is a little too complex to ask the keyboarders to do).

For each work, a standard bibliographical header is created, recording all the details of the print source and of the electronic version, and including some keyword information that will be valuable as the items are searched.

For each work, a standard bibliographical header is created by staff in the Electronic Text Center following the guidelines established by the Text Encoding Initiative's (TEI) Guidelines for Electronic Text Encoding and Interchange (P3). The header will record all the details of the print source and of the electronic version, and will include some keyword information that will be valuable as the items are searched. Gayle Cooper and Eva Chandler in Special Collections Cataloging will enhance the TEI header with additional bibliographical information and will standardize access points according to national library cataloging standards. The project catalogers will then create MARC (Machine Readable Cataloging) records which will be added to VIRGO, the Library's online catalog, and to OCLC and RLIN, the national bibliographic utilities. The cataloging records for all titles will be freely available to libraries around the world.

For each pictorial illustration in a work, a standardized description of type and subject matter is created.

A copy of this header is added to the binary code of each image in a given volume. This text is hidden from the user -- it is not visible on the surface of the image -- but it is integral to the image file and travels with it as a cataloging and attribution record.

The final text is parsed, indexed, and put online on the World Wide Web.

An interim report on the image and SGML text creation segments of the project will be presented at the invitational international Digital Libraries conference in March, 1997. Kendon Stubbs and David Seaman have been invited to deliver papers at the 9th Digital Libraries conference in Tokyo on March 5th, 1997. If the early American fiction project is funded, we propose to use this international conference for our first report on the progress of the project. The paper will be made available in both English and Japanese.

The Digital Libraries conferences are a continuing series sponsored by the University of Library and Information Science (ULIS) in Tsukuba Science City near Tokyo. The last international conference, ISDL '95, was a four-day conference in August, 1995, with invited talks from faculty at Carnegie-Mellon, Bellcore-USA, Princeton, Ohio State, the National Diet Library of Tokyo, the Bibliothéque Nationale, etc. Detailed information on the Digital Libraries conferences is available at the Web address

http://www.dl.ulis.ac.jp

ULIS also publishes the journal Digital Libraries. The February 1996 issue, for example, has an article by Gary Olson and Daniel Atkins on "Digital Library and Educational Initiatives at the University of Michigan."

At the end of the project in the summer of 1998, a written report will be prepared on the project as a model for creating images and tagged ASCII texts of rare books.

Size and cost estimates:

Special Collections staff have surveyed the texts and estimate an average page size of approx. 1,800 characters.

To this is added about a 20-25% tagging overhead--so the SGML tags for an average page comprise about 450 keystrokes.

On this estimate, the average page is 2,250 billable keystrokes.

The storage costs include the need for several large hard drives to house the online collection, and long-term offline storage for data security and for the large archival master copies of the digital images. The on-line storage hard drives have an expected life of 3-4 years and will be much cheaper to replace than they are to buy initially -- disk storage is getting cheaper all the time. The offline archival storage will be on CD-ROM disks, which have an archival life of at least 10 years.

As this project is being added to a thriving current digital library, we already have in place for other texts the expertise, experience, web servers, search tools, and backup systems that ensure a cost-effective and standards-driven operation

| Return to Contents | UVa Library | Special Collections | Electronic Text Center |

Send comments to the Special Collections Department
Last Modified: Wednesday, 21-Aug-96 09:02:24 EDT
URL: http://www.lib.virginia.edu /speccol/mellon/texts.html