Making of America DFE Report

[Mirrored from: http://dns.hti.umich.edu/htistaff/pubs/1997/ejshaw.01/]

OCR and SGML Mark-up of Documents from the Making of America Project

Report on a Directed Field Experience at Humanities Text Initiative

University of Michigan
December, 1996
Elizabeth Shaw

Overview
Sample Documents in Various Stages of SGML Markup
Analysis of Process Strengths and Weaknesses
Analysis of Costs and Feasibility
Summary

Overview

The purpose of this project has been to explore the feasibility and costs of doing an automated OCR (optical character recognition) conversion of scanned TIFF images for the Making of America Project and automating initial SGML mark-up of the documents.

About Making of America (MOA)

MOA is drawing on primary materials at the Michigan and Cornell libraries to develop a thematically-related digital library documenting American social history from the antebellum period through reconstruction. Over 5,000 volumes with imprints primarily between 1850 - 1877 are being selected, scanned, and made available to the academic communities at each institution. When the initial phase of the project is completed, the MOA collection will include over 1.5 million images. The materials in the MOA collection are scanned from the original paper source, the images are captured at 600 dpi in TIFF image format and compressed using CCITT Group 4. Minimal document structuring occurs at the point of conversion, primarily linking image numbers to pagination and tagging self-referencing portions of the text.

For more information see the Making of America Project

Project Goals

The goals of this project were to:

take the TIFF images already created for the first phase of the MOA projects and convert them to text files while:
- retaining as much information as possible about the document and individual pages so that the information might be available for indexing or other text manipulation purposes
- automating the process so that minimal human intervention is needed and
- minimizing OCR errors while maximizing automation
automate the initial SGML mark-up of the resulting text using the TEI DTD:
- including all relevant header information (title, author, publication information, subject, etc.) supplied as a part of the project
- maintaining all identifying information for future indexing and/or markup including information that references non-text images
assess the problems, potentials and costs of this automated process.

Documents

Approximately 15-35 documents constituting approximately 4,000 images are stored on each CD-ROM. The documents vary greatly in length, format, typeface and the condition of the original document. They contain few non-text images but those that do present special handling challenges. Format variations include:

single and double column formats,
pagination located at top or bottom of the page,
inconsistent inclusion of header information on pages
multiple means of formatting chapters and other divisions of the documents
inconsistent ordering of document divisions such as table of contents, table listings and indices
non-text images interpsered with text on individual pages.

In addition to differences of format, variations in typeface and condition of the original pages affect the effectiveness of the OCR process.

Tools

The initial development of the process outlined in this report was conducted on an RS6000 with 64 MB of memory using Xerox's ScanWorx. Perl version 4 was used to write scripts that controlled both the ScanWorx process and the initial SGML mark-up of the documents. Author/Editor was used to move sample documents through the remaining mark-up stages to fully proofed and validated SGML versions of the documents.

Results

Automation of OCR Process Although initially it seemed that the process of converting the TIFF images to text would be fairly straightforward using the ScanWorx toolset, our desire to retain as much information about the individual pages as possible and our need to be able to track exceptions such as non-text images for later processing presented unique challenges. In addition, the varied nature of the documents presented challenges to fully automating the process for maximal character recognition. ScanWorx is not designed to process such a large number of diverse images without human intervention. Nonetheless, we managed to develop a process whereby ScanWorx can process an entire CD-ROM unattended in 7-10 hours producing a minimal number of exceptions that must be handled individually. Handling these exceptions requires an average of 1 to 2 hours of staff time. The percentage of each page that is OCR'ed correctly varies greatly depending on the quality of the image and the typeface. Although we have not done a thorough analysis, a spot check of samples indicates that over 95 percent of each page was scanned correctly - excluding those pages such as title page and advertisements that contain multiple type face and size.

Automated Tagging Process Despite the numerous variations in the documents we developed a process that:

took meta-data about the document contained in a file prepared by NMI and inserted it into a document header,
processed the text files to remove non ASCII characters and clean up the text,
concatenated all of the document pages into a single valid SGML file that includes markup that divides the content into gross divisions within front, body and back matter,
inserts page numbers on each page and retains references to non-text images.

This process is performed using perl scripts contained within a single perl script and takes approximately 12 minutes for an entire CDROM.

Sample documents After completing the automated processes above, we took two sample documents of approximately 30 pages each and completed SGML markup and proofing for the documents. The results of various levels of mark up and approximate staff time are available below:

	History of the Emigrant Aid Society	The Chicago Fire
Automated Processing with Minor or No Intervention	Intervention Time None	Intervention Time 5 minutes
	SGML Emigrant Aid	SGML Chicago Fire
	HTML Emigrant Aid	HTML Chicago Fire
Rough Mark-up including major structural elements but no proofing	Intervention Time 3 minutes per page	Intervention Time 2 minutes per page
	SGML Emigrant Aid	SGML Chicago Fire
	HTML Emigrant Aid	HTML Chicago Fire
Mark up Completed and Proofed	Intervention Time 9 minutes per page	Intervention Time 9 minutes per page
	SGML Emigrant Aid	SGML Chicago Fire
	HTML Emigrant Aid	HTML Chicago Fire
Making of America Page Images	Emigrant Aid	Chicago Fire

Analysis of Process Strengths and Weaknesses in the Existing Conversion Process

Trade-Offs inherent in automated OCR of multiple document types, formats and typefaces

The existing conversion process runs independently of the variations in the document collection. This is both its strength and weakness. Because the process runs without human intervention to distinguish variations in typeface and document layout, an entire CD-ROM can be processed with little staff time. However, this limits our ability to "train" ScanWorx to improve character recognition. This means that during full mark-up staff must spend more time on proofing the document. Despite the lack of training, an limited sample of pages indicates that well over 95% of characters are being recognized correctly.

Formatting variations are also ignored. This allows us to use a single script to do initial mark-up on a document, but again this slows the process when staff takes over for full mark-up. Page headers and footers that might otherwise have been removed in the automated mark-up process cannot be removed because there is no consistency among documents and therefore no clear way to capture consistent patterns for mark-up and text manipulation.

Handling Exceptions with Images

We have managed to work around many of the issues created by ScanWorx's inability to handle pages with no recognizable text, images or unusual font sizes. However, ScanWorx continues to exit when it attempts to process a page with significant non-text images. In our current processing schema this means that Scanowrx exits before it completes a document. The staff must then go in and restart the process on that document. We considered restarting ScanWorx for every page. The start up time for ScanWorx would have meant that processing time would have increased significantly. In addition, when we later started to run two simultaneous versions of ScanWorx occasionally a second version started at the wrong moment would crash the first version. Although this may be a problem specific to our development machine, we avoid the problem by having ScanWorx start only once for each document.

Handling pages with images is the most time consuming portion of the conversion process. While we were able to pull the vast majority of images out of the processing queue (thereby minimizing the number of times ScanWorx crashes) by identifying plates that do not have their own page numbers, there are still several documents that contain images that we can not identify before processing. These pages both crash the process and require individual handling since a number of them contain both image and text. We still want to capture the text in the OCR process.

One change to the ScanWorx program that I would recommend to Xerox is to enable ScanWorx to continue processing a script even when it can not process an individual page. It could send an error message and move to the next page. This would greatly enhance our automated process.

Costs and Feasibility of Completed SGML Mark-up of University of Michigan's Making of America Documents

The University of Michigan expects to process approximately 650,000 images for the Making of America project. The table below illustrates estimated staff costs for various levels of processing for the MOA project. For assumptions underlying these estimates see the notes at the bottom of the table. These are rough estimates based on our experience so far. They are not intended to be comprehensive costs and do not include hardware or license costs. I have tried to provide real estimates that fall at neither extreme.

**Staff Costs**
	Running automated ocr conversion and markup (1)	SGML Mark-up	Proofing and Completed Corrections(2)
Number of pages in project	650,000	650,000	650,000
Time per page	2 seconds (maximum)	2-3 minutes	8-9 minutes
Total hours to process	361	21,700-32,500	86,700 -97,500
Wages per hour including employer benefits/taxes as appropriate	$16.60	$9.66	$9.66
Total Cost	$6,000	$209,600 - 313,950	$837,500 - 942,000

In additon there would be overhead costs of some portion of a professional staff person's salary to manage student staff.

Note 1 Each CD-ROM contains approximately 4,000 pages. Each CD-ROM takes between 1 and 2 hours of human intervention in that process. Less than 2 seconds of human intervention time brings the pages from scanned TIFF images to roughly marked up unproofed pages. It will take approximately 360 hours of staff time - 18% of one full-time staff person's time in a year - to run the automated ocr conversion and initial sgml mark-up for the entire University of Michigan portion of the MOA collection. Although this mark-up is not ready for use as a readable document on-line, it could be used in an indexing process. At an annual salary of $25,000 per year for a technician (with 33% allocated for associated benefits totalling $33,250) this will cost $6,000.

Note 2 On our test samples, the mark-up took 2-3 minutes per page and proofing took 8-9 minutes per page. Most of this work is done on an hourly wage basis by students at $8.40 per hour. Adding in employer paid taxes we can estimate $9.66 per hour to do mark-up and proofing.

Summary

Using the automation that we have developed, we can process a CD-ROM with approximately 4,000 pages into roughly marked up documents with an average of less than 2 hours of human intervention per CD-ROM. Moving that unproofed rough markup to a finished valid SGML document takes an additional 2-3 minutes per page for mark-up and 8-9 minutes per page for proofing and entering corrections. Documents with significant differences (two column formats or a significant number of images) from the norm would take additional processing time. However initial analysis of the documents indicates that these anomalies are in the minority - ranging from 0 to 3 documents per CD-ROM. In addition, most of the two column documents are less than 30 pages in length.

Although there are some significant limitations in using ScanWorx as a tool to batch convert TIFF images to text while retaining specific information about individual images, many of these limitations have been overcome using perl scripts to manage the process. The processes created as a part of this project allow for 99+% of the pages to be processed automatically with over 95% accuracy in text recognition.

The process that takes the text files to initial mark-up requires the user to run a single script. Although, as demonstrated above, the text is not ready for use as a readable document, it does enable the development of a searchable index based on the text files. In addition it prepares over 95% of the documents for follow-up markup that would produce a readable document.

While the costs may be prohibitive at this time to take these documents to fully marked-up versions using this method, the interim step of providing at least partial indexing functionality at a modest cost is worthwhile. Should at some point the funds become available to fully proof and mark-up these documents, we will have much of the work ready, with the necessary documentation to take them the rest of the way.

Processing instructions and scripts used in project are available for Digital Library Production Service staff.