[Mirrored from: http://www.lib.virginia.edu/speccol/mellon/texts.html]
The 582 volumes that comprise the print originals for this extraordinary electronic text collection will be scanned as high-quality color images, and then "working copy" JPEG files will be created from the archival TIFF image original. These copies are the source for the keyboarding of the text by a commercial service bureau. This method is much cheaper than transcribing the documents ourselves, and much faster. The work is done in large professional typing operations overseas, and is the way in which practically all large commercial e-text projects are created. This part of the project will be overseen by David Seaman, Coordinator of the Library's Electronic Text Center, and by David Gants, Assistant Coordinator of the Center; any post-creation processing work will be done by student assistants in the Center.
Both optical character recognition (OCR) and keyboarding are variously useful and reliable methods for the creation of machine-readable transcriptions of print materials. For the early American fiction project, the obvious practical choice is to use a commercial keyboarding company, both because of the physical nature of the source material and its bulk.
OCR works by taking a digital image of a page of type and interpreting the shapes on it, turning clusters of image pixels into ASCII characters. OCR works well with modern typefaces, and often copes reasonably well with later 19th-century printed matter, but its effectiveness decreases with earlier material. The ability for the software to recognize letters is principally challenged by printing flaws that disrupt the integrity of the letter form, such as uneven inking and broken type; and such features are typical of earlier printed material. For sample typefaces and the results of OCR on them, see the following Web site:
http://etext.lib.virginia.edu/helpsheets/scan-train.html
Even with clean modern type, one or two errors a page is not atypical with OCR. This makes it suitable for small runs of materials, especially when they can be effectively corrected using a modern spell-checker. But both the bulk and the non-20th century spelling in the American Fiction to 1850 project would make them cumbersome to correct, and their typo-graphical features will result in more numerous errors than we would see in modern print items.
The keyboarding companies provide both highly accurate text (one error per 20 pages on average) and also add in most of the SGML encoding as they type. And they produce the output quickly. To duplicate the same accuracy in-house we would need a large and expensive team of staff to scan, tag, and proofread -- and the job would take much longer. The team would need to be assembled, trained, and overseen, all of which would add to cost. Multiple computers, OCR software packages, and scanners would have to be bought, as well as separate machines for the proofreading stage. Some of the texts in this project will perform so poorly as input for OCR that they would have to be typed manually. Even using student labor, and leaving the space and equipment costs out of the equation, we could not come close to the $1.97 per page average commercial charge for data entry, rigorous proofing, and SGML markup.
To date, none of the large full-text literary databases produced commercially have been generated using OCR; publishers such as Chadwyck-Healey and Oxford University Press have concluded as we have that the use of a keyboarding service is the cheapest option.
Concerning personnel for the project, the project will become part of the ongoing work of two of our existing electronic centers, the Special Collections Digital Center and the Electronic Text Center. Both Centers are already engaged in the kind of work that this project will involve: using the Kontron camera to make digital images of rare books and processing SGML tagged ASCII texts for our World Wide Web etext site.
In the Special Collections Digital Center, the fiction project will follow already established routines for digitizing rare books. The Center's coordinator, Edward Gaynor, will be available for policy questions and overall direction, which will require about 15% of his time. The day-to-day work of supervising the student assistants who are using the cameras to make images of the book pages will fall to the full-time Digitizing Supervisor. This person, not yet hired, will be responsible for seeing that schedules are maintained in the process of making digital images, saving them on disk, and transferring the files to CD-ROMs.
The Electronic Text Center will receive the page images from the Special Collections Center on CD-ROMs and will send the CD-ROMs to the vendor who will create SGML-tagged texts and return the texts to the Etext Center. The returned texts will then be checked, cataloging headers will be prepared, and the texts will be added to the Etext Center's Web site. This kind of process is already being carried out in the daily work of the Etext Center, which has processed some 10,000 texts for the World Wide Web. The Coordinator of the Etext Center, David Seaman, can add the American fiction project to his purview with an expenditure of 15% of his time per week. As part of their responsibilities, Mr. Gaynor and Mr. Seaman will be available to speak for the project and to answer queries about it.
The workflow for creating e-texts and the costs are as follows:
The titlepage transcription is completed (this is a little too complex to ask the
keyboarders to do).
An interim report on the image and SGML text creation segments of the project will be
presented at the invitational international Digital Libraries conference in March, 1997.
Kendon Stubbs and David Seaman have been invited to deliver papers at the 9th Digital
Libraries conference in Tokyo on March 5th, 1997. If the early American fiction
project is funded, we propose to use this international conference for our first report on
the progress of the project. The paper will be made available in both English and
Japanese.
The Digital Libraries conferences are a continuing series sponsored by the University of
Library and Information Science (ULIS) in Tsukuba Science City near Tokyo. The last
international conference, ISDL '95, was a four-day conference in August, 1995, with
invited talks from faculty at Carnegie-Mellon, Bellcore-USA, Princeton, Ohio State, the
National Diet Library of Tokyo, the Bibliothéque Nationale, etc. Detailed information
on the Digital Libraries conferences is available at the Web address
ULIS also publishes the journal Digital Libraries. The February 1996 issue, for
example, has an article by Gary Olson and Daniel Atkins on "Digital Library and
Educational Initiatives at the University of Michigan."
At the end of the project in the summer of 1998, a written report will be prepared on the
project as a model for creating images and tagged ASCII texts of rare books.
Size and cost estimates:
Special Collections staff have surveyed the texts and estimate an average page size of
approx. 1,800 characters.
To this is added about a 20-25% tagging overhead--so the SGML tags for an average page
comprise about 450 keystrokes.
On this estimate, the average page is 2,250 billable keystrokes.
The storage costs include the need for several large hard drives to house the online
collection, and long-term offline storage for data security and for the large archival master
copies of the digital images. The on-line storage hard drives have an expected life of 3-4
years and will be much cheaper to replace than they are to buy initially -- disk storage is
getting cheaper all the time. The offline archival storage will be on CD-ROM disks, which
have an archival life of at least 10 years.
As this project is being added to a thriving current digital library, we already have in place
for other texts the expertise, experience, web servers, search tools, and backup systems
that ensure a cost-effective and standards-driven operation
Send comments to the Special
Collections Department
Last Modified: Wednesday, 21-Aug-96 09:02:24 EDT
URL: http://www.lib.virginia.edu
/speccol/mellon/texts.html