Automating the Structural Markup Process

[Mirrored from: http://stirner.library.pitt.edu/DL95paper.html]

Automating the Structural Markup Process in the Conversion of Printed Documents to Electronic Texts

Casey Palowitch (cjp+@pitt.edu)
Darin Stewart (dlsst51+@pitt.edu)

Electronic Text Project
University of Pittsburgh Library System
Pittsburgh PA USA 15260

(Draft 19 March 1995, submitted to Digital Libraries '95)

ABSTRACT: This paper presents the early results of a research initiative constructing a system for automatically identifying structural features and applying SGML tagging based on the Text Encoding Initiative DTD to text generated from the scanning and OCR processing of print documents. The system interprets typographical and geometric analysis output from a specialized OCR software system, and maps combinations of these characteristic features to TEI constructs based on a user-generated document analysis specification. The system is being developed as part of a pilot project to create from the original paper document a TEI-encoded edition of the Transactions of the American Medical Association, Vol. 2, 1849, a research resource for 19th century United States medical and urban historical study. Although this project focuses on one specific text, an important goal of the project is to create a software system that can process and at least minimally tag many types of printed documents, given a proper document analysis specification, and thus allow a more rapid process of retrospective conversion of printed documents into SGML texts in libraries.

1. Introduction

Retrospective conversion of printed documents into digital forms is an increasingly necessary and desirable task facing those building digital libraries. Networked access technologies like the World Wide Web (WWW) allow remote use of digital collections and a variety of new methodologies for interaction and manipulation. Digital conversion efforts along these lines are becoming commonplace, with hundreds of projects to digitally capture existing printed monographs, serials, and special format items underway around the world. Digital imaging technology is inexpensive and can provide high resolution and color representations, but images alone cannot be indexed or rapidly searched for content. Many institutions are experimenting with the technologies of optical character recognition (OCR) and the use of structured markup systems such as the Standard Generalized Markup Language ISO 8879 (Goldfarb 1990) to create character-based digital documents that can be indexed and searched as textual databases. However for the high-production needs of libraries, the inaccuracy of commercial OCR software on all but the most modern and uniform documents still demands thorough proofreading and error correction, and the intensive manual-labor required for SGML text markup is often a barrier to its more widespread use. While some commercial projects may be able to justify investment in manual markup effort (Fawcett 1993), libraries who wish to provide their patrons with the access, searching, and manipulation possibilities inherent with structurally-marked-up texts must find other alternatives. One alternative is to automate the process.

Much current research in the field of document image analysis and structure and pattern recognition is a top-down approach, identifying structural elements at the page or text-block level before proceeding on to the recognition of individual characters (Baird et al. 1992, Dengel et al. 1992). The present system relies on information generated by the OCR software Aurora[1] in recognizing structures from their physical characteristics. In addition to its extremely accurate character recogntion, Aurora is designed to provide various important page layout information from which text structures can be reasoned, specifically line geometry information (vertical and horizontal positions on the page, lengths of lines, height of lines) and typographic information (font changes and line height information). One recent similar effort to define a system architecture for translating the Xerox XDOC format (word-level geometry and font data) into stuructural markup has influenced the present approach (Taghva et al. 1994). Other efforts in this area have used rigorous lexical analysis, and have found success separating highly formatted text for entry into a relational database (Crawford and Lee 1990).

2. System architecture

Figure 1. System architecture

Original print document. Our project is focused on creating an electronic-text edition of the Transactions of the American Medical Association, vol.2, 1849, the pre-cursor to the Journal of the AMA. This document was chosen for the project primarily because of its research import and interlibrary-loan demand for the volume from the Falk Medical Library at the University of Pittsburgh. The rare 1000-page volume is also fragile: the paper crumbles with normal use, and its spine and binding have disintegrated. A searchable e-text edition will allow for remote use of the document without requiring its physical transfer.

Optical scan. Because of the fact that the original document was only available for a limited time and was so fragile we needed to obtain the highest possible quality archival image data in one pass. A single high-resolution tiff image of each page was captured using a Ricoh IS-50 scanner on a Macintosh IIcx running the Optix document image management software from Blueridge Technologies Inc. All pages were scanned at the scanner's native 400 pixel per inch resolution in 8 bit grayscale. This produced very large individual page image files of approximately 6 megabytes each but allowed the derivation of other formats as required without rescanning. This also provided the data needed to allow Aurora to perform its most accurate OCR processing (See Fig. 2 for sample page image portion). Due to the large size of each of these files and limitations of available disk space, a maximum of 150 pages could be scanned at one time, and because of the fragility of the document, the scanning procedure was painstaking. Each scanning session lasted approximately three hours. The resulting 5.5 gigabytes of image data were stored on 4mm DAT tapes, and were subsequently accessed for later processing.

Figure 2. Greyscale page image portion from original document (360K GIF)

Document Analysis. In order for the post-processing application to associate geometric and typographical data with the existence of a structural unit, the application must be provided with a document analysis specification. The specification contains three parts. The first is a precise specification of the presentational and geometric characteristics that can uniquely identify the structural components. The second is a structure-hierarchical model of the text, which must identify the hierarchical relationships between the structural components (i.e. nesting of elements). The third component is a mapping of font changes to desired markup (e.g. emphasis, quotations) and the mapping of the document's glyph vocabulary to SGML entities for substitution. The document analysis specification is prepared in much the same manner as document analysis prior to SGML document type definition (DTD) preparation, referring back to the original document (or document images) to determine characteristics and relationships.

Because the Transactions was to be marked up using the Text Encoding Initiative DTD (Burnard and Sperberg-McQueen 1994b), the specification phase produced a mapping of the structural elements of the Transactions to the elements provided for in the TEI DTD[2]. For the first phase of the project, the following TEI elements were capable of unique identification using a combination of geometric and presentation characteristics:

four levels of nested structural divisions (DIV elements of three types and P paragraphs)
headers at each structural level when present (HEAD elements)
footnotes, both occurrence point in text and at page bottom (NOTE elements)
tables are recognized as such and wrapped (TABLE elements)
page breaks, with page numberings inserted as attributes (PB elements)

In addition, emphasis markup (EMPH elements) and quotations (Q elements) are mapped to the various corresponding typefaces where this was possible.

OCR. Aurora's output consists of alternating lines of geometric data and the character stream, the character stream is interspersed with presentation data as well. Each original page image resulted in its own independently coded output file. This CGP data format (characters, geometry,, and presentation) is presented in Figure 3. The geometric data line contains a line number, line height, vertical and horizontal position of the line start and end relative to the left and right page edges and the page top, and overall length. Although there are none marked in the example, font changes are indicated with an escape symbol '@' and a typeface indicator, immediately before the characters in the character stream.

Markup Processing. In the system architecture in Figure 1, the structural and presentation markup phases are represented as an iterative process, due to the nature of the current implementation. The markup processing is performed by a program written in the Perl programming language (Wall and Schwartz 1991) and executed in a UNIX operating system environment. Perl was chosen as the prototype language for its strengths in operation on input by lines, and for its strengths in textual string processing using regular-expressions. In its current implementation, the program operates on four lines of input, two lines of geometric data and two lines of character data. The geometric data is parsed and stored in variables, and the text data is stored into temporary buffers. Using a combination of comparisons, the contents of the variables and buffers are matched against the document analysis specification to indicate the beginning (or end) of a structural division. Unlike the raw OCR output which is independently coded on a per page basis, the structural information

detected at this stange can cross page boundaries. A stack is maintained, to preserve the structural hierarchy and to ensure a parsable final output. The stack operators are informed by the structural hierarchy provided also in the document analysis specification. Currently 'open' elements are maintained on the stack, and the indication of the opening of a higher-level structural unit forces the closing of the remaining elements on the stack.

@24:    h 13 d -8 x 43 y 445 x 311 l 268
tageous their use will be eminently injurious.
@25:    h 14 d -4 x 178 y 486 x 313 l 135
  CONCLUDING REMARKS.
@26:    h 16 d -4 x 60 y 514 x 447 l 387
 With these general observations your committee will introduce the
@27:    h 14 d -3 x 43 y 532 x 448 l 405
special reports from individual members, embracing an account of
@28:    h 14 d -3 x 44 y 549 x 446 l 402
the sanitary condition of the cities of Portland, Concord, Boston,
@29:    h 15 d -3 x 42 y 567 x 445 l 403
Lowell, New York, Philadelphia, Baltimore, Charleston, New Or-
@30:    h 14 d -4 x 42 y 584 x 445 l 403
leans, and Louisville, so far as it may be developed by answers to
@31:    h 13 d -4 x 42 y 601 x 377 l 335
the questions propounded in the circular issued by them.
@32:    h 14 d -3 x 55 y 619 x 445 l 390
 They are aware, however, that the investigation into this interest-
@33:    h 14 d -4 x 41 y 636 x 444 l 403
ing and almost unexplored region of medical inquiry, has but just
@34:    h 14 d -3 x 41 y 654 x 444 l 403
commenced; and that their labours have accomplished little more
@35:    h 14 d -4 x 40 y 671 x 444 l 404
than to open the way for its farther, and, as they hope, more suc-
@36:    h 13 d -4 x 41 y 688 x 443 l 402
cessful prosecution.  The subject they conceive to be one eminently
@37:    h 13 d -4 x 41 y 705 x 444 l 403
congenial with the purposes of the Association, inasmuch as it has
@38:    h 14 d -3 x 40 y 723 x 445 l 405
for its object the preservation of human life, and the removal of
@39:    h 14 d -4 x 41 y 740 x 441 l 400
those causes of disease and death which it is in the power of
legisla-
@40:    h 14 d -4 x 41 y 758 x 441 l 400
tion to eradicate, and to which the public mind must be directed by
@41:    h 13 d -5 x 68 y 775 x 142 l 74
 VOL. II.-29

Figure 3. Aurora's output for previous example.

After structures have been identified, the temporary text buffers are flushed, and font change indicators are replaced with the proper tags. These are also kept on the stack. Finally SGML entities are substituted for special characters as per the document analysis specification. The final output is a parsable (if at the moment simple) SGML document.

tageous their use will be eminently injurious.
</P></DIV><DIV type='subsection'><HEAD>CONCLUDING
REMARKS.</HEAD>
<P> With these general observations your committee will introduce
the special reports from individual members, embracing an account of
the sanitary condition of the cities of Portland, Concord, Boston,
Lowell, New York, Philadelphia, Baltimore, Charleston, New Or-
leans, and Louisville, so far as it may be developed by answers to
the questions propounded in the circular issued by them.
</P><P> They are aware, however, that the investigation into
this interest-ing and almost unexplored region of medical inquiry, has but just
commenced; and that their labours have accomplished little more
than to open the way for its farther, and, as they hope, more suc-
cessful prosecution.  The subject they conceive to be one eminently
congenial with the purposes of the Association, inasmuch as it has
for its object the preservation of human life, and the removal of
those causes of disease and death which it is in the power of
legisla-tion to eradicate, and to which the public mind must be directed by
</P><P> VOL. II.-29

Figure 4. Example tagged output for previous example

3. System evaluation

Although the project is still underway, an early evaluation can be made of the effort along three lines: practical considerations, system design considerations, and representation and standards considerations.

From a practical point of view, the OCR component of the system requires a relatively large volume of temporary and permanent storage, as well as significant CPU power, especially for large documents or document groups. In addition, if the printed document is fragile or access to it is inconvenient, the digital images need to be accessed repeatedly in the document analysis phase, which can be burdensome if the image files are large. In the present case it was necessary to create a set of easily-manipulable page images at 1/16 the file size for viewing purposes. However the Perl implementation of the markup processing program is quite fast, and is able to process a group of 50 CGP files (each CGP file containing the data output from the processing of one page image) in under 5 seconds on a 486/66-based system.

The current system is successful in reducing by approximately one third the number of element identifications and tagging operations that in the present Transactions project would have been done manually. It is expected that further refinements can increase that amount to approximately one half. The savings is achieved at the cost of an up-front investment in preparing the document analysis specification, and at the moment is only cost-effective for documents or document collections of large size. It is important to point out however, that not all encoding projects can benefit equally by systems such as this; projects producing critical or scholarly editions with a high proportion of tagging involving expert evaluation may only see marginal benefit relative to the overall tagging effort. However such projects may utilize a system of this type for initial markup of basic structures. Indeed, the initial automatic markup processing of the Transactions was followed by identification of highly intellectually-intensive markup done by librarians and subject specialists. Furthermore, the current system is only able to discern structures which can be unambiguously defined in terms of their geometric or typographical characteristics. Whereas for most print documents this set is quite large, it is by no means the complete set of regularly occurring structures.

The current project also has pointed out difficulties with the use of strict hierarchical representations such as SGML and applications of SGML such as the TEI in encoding documents with concurrent and complex organic structures, a characteristic shared by a surprising number of printed texts[3]. In at least two cases so far, automatic markup of the Transactions has been complicated by such problems.

4. Future directions

At this point, the focus of the present effort has been to develop a system for correctly recognizing major structural features of a target text, and our implementation has been successful enough to proceed with enhancing and expanding the system's competence in handling some other textual features which can be programmatically identified, such as dates. Other directions for the project include: 1) the design of a 'standard' document analysis specification file format, which would draw on the existing standards for document output formats and for SGML-syntax-directed editing systems (ISO/DTR 10037). The structural-hierarchy component of the document analysis specification should be derived directly from a given DTD. 2) Implementation in a compiled language such as C++ to enable deployment apart from an interpreter, to enable run-time interaction with the Aurora OCR application, to enable run-time interaction with other applications, such as an SGML parser (Clark 1995), and to utilize object-oriented software engineering methodologies.

Notes

[1]The Aurora OCR system is under development by Dr. Art Wetzel, currently with Carnegie Mellon University. The authors wish to recognize his contributions to the project, and to this paper.

[2]The project is using a smaller subset of the full TEI DTD, called at the moment TEILITE (Burnard and Sperberg-McQueen 1994a) which is extended with markup for tables and for other content-specific elements.

[3]Although SGML provides for the markup of concurrent structures using CONCUR, few if any of the SGML-aware editors, databases, and browsers support the feature.

5. Works referenced

[Baird et al. 1992] Baird, H. S., H. Bunke, and K. Yamamoto (eds.). Structured document Image analysis. Berlin : Springer-Verlag, 1992.

[Burnard and Sperberg-McQueen 1994a] Burnard, Lou, and C. M. Sperberg-McQueen. Encoding for Interchange : An Introduction to the TEI, TEI U5. Chicago: Text Encoding Initiative, 1994.

[Burnard and Sperberg-McQueen 1994b] Burnard, Lou, and C. M. Sperberg-McQueen (eds.). Guidelines for Electronic Text Encoding and Interchange, TEI P3. Chicago: Text Encoding Initiative, 1994.

[Clark 1995] Clark, James. "SP, a next-generation SGML parser." http://www.jclark.com/sp.html

[Crawford and Lee 1990] Crawford, R. G. and Susan Lee. "A prototype for fully automated entry of structured documents." Canadian Journal of Information Science 15(4): 39-50, December 1990.

[Dengel et al. 1992] Dengel, Andreas, Rainer Bleisinger, Rainer Hoch, Frank Fein, and Frank Hones. "From paper to office document standard representation." IEEE Computer 25(5): 63-67, July 1992.

[Fawcett 1993] Fawcett, Heather. "The new Oxford English Dictionary project." Technical Communication 40(3) 379-382, August 1993.

[Goldfarb 1990] Goldfarb, Charles. SGML Handbook. Clarendon: Oxford University Press, 1990.

[ISO/DTR 10037] International Standards Organization. Guidelines for SGML syntax-directed editing systems. Geneva: ISO/IEC, 1988.

[Taghva et al. 1994] Taghva, Kazem, Allen Condit, Julie Borsack, and Srinivasulu Erva. "Structural markup of OCR Generated Text." UNLV Information Science Research Institute 1994 Annual Report. Las Vegas: ISRI.

[Wall and Schwartz 1991] Wall, Larry, and Randal L. Schwartz. Programming perl. Sebastopol CA: O'Reilly & Associates, Inc., 1991.

This document created March 20, 1995

Copyright (c) 1995 Casimir J. Palowitch and Darin L. Stewart, all rights reserved