Page Image and SGML: Alternatives for the Digital Library

Terry R. Noreault and Mark A. Crook

Office of Research, OCLC Online Computer Library Center, Inc.

Abstract

As the Digital Library evolves, important questions about technology selection for information capture and delivery arise. Today, the two most widely-used technologies are tagged (e.g., SGML) documents and document page images. OCLC Online Computer Library Center, Inc. (OCLC) has successfully applied both techniques in the Electronic Journals Online and SiteSearch Image Extension systems. OCLC's experience in both technologies is a source of practical insight into the relative strengths and weaknesses of the respective approaches. This paper compares these methods of electronic access and delivery of digitized, journal literature and highlights considerations for the evolving Digital Library.

Keywords: SGML, digital imaging, libraries, OCLC

1. Introduction

1.1 The Library's Place in an Electronic World

What is the place of the library in the electronic world? "Place" can be considered from two perspectives, namely "place" in terms of physical location and "place" from the perspective of the library's role in an electronic world. Since it is generally accepted that electronic tools help a library transcend its physical location, we will briefly consider the library's role. A recent Business Week article said that in an electronic world, libraries would either languish or that they would adapt and thrive. We believe that the latter will happen because library staff have the unique skills necessary to organize, access, and deliver information regardless of its format. These skills are especially needed in an electronic world where non-integrated offerings proliferate-sometimes in a fashion that actually impedes information access and delivery to patrons. However, with the dizzying array of electronic resources available to today's libraries, it becomes a difficult matter to select among the alternatives and to choose which resource best meets the needs of the library patron and library staff.

In this paper, we will help the reader consider the advantages and disadvantages of two popular forms of representing electronic documents: page image and SGML. We will provide an overview of OCLC's practical experience with these two forms, give general definitions of SGML and imaging and then evaluate the two on the basis of four types of costs and three usability perspectives. We conclude the paper with a view of the future of electronic documents and what such a future means to the library's role in an electronic world.

2. OCLC's experience

Since the early 1980s, OCLC has been developing demonstration systems and commercial offerings based on image, SGML, and hybrid databases. OCLC's ability in building and providing access to large, text-based (ASCII) databases such as the Online Union Catalog (OLUC) has gained us worldwide recognition as a bibliographic utility. However, we have long known that the need of library patrons to have "information where and when they want it, in the form they want it, at a price that they can afford" would lead us beyond ASCII databases and beyond bibliography to images and full text.

Our first effort to combine text and images was a research project, GraphText® (1984-1989), which used TEX to mark up the document database and proprietary software to render the documents on an IBM AT workstation with a Hercules graphic adapter card. This approach was extended when we placed an SGML version of the Kirk-Othmer Encyclopedia of Technology under GraphText again using TEX and proprietary software for on-screen rendering. By 1989, our experience in electronic publishing allowed us to begin work on the SGML-based Electronic Journals Online (EJO); document pages are displayed on-screen using Guidon or a Web browser such as Mosaic as the user interface. Now a commercially available service, EJO offers 35 titles of journals published wholly or in part in an online environment.

Two significant projects using image technology to deliver electronic documents were begun in 1992: an image project at Virginia Polytechnic Institute using Elsevier's TULIP database and SiteSearch. The purpose of the Virginia Tech research project is to investigate the network delivery and use of electronic journals. The database consists of page images, an ASCII version of those pages generated using OCR, and table of contents and abstract information. Both the page images and ASCII text are managed by OCLC's Newton search engine enabling users to search for keywords and retrieve the respective page images. OCLC SiteSearch software is a commercially available family of software tools which enables customer sites to build and maintain a local information repository of text and image data and to access that information through standard searching protocols using a site-selected user interface ranging from glass TTY to a graphical-based Web browser.

Finally, beginning in 1993, we explored a multiplicity of issues related to a large (450,000 pages) electronic journal database through our participation in the CORE project. These documents were marked up in SGML and managed by Newton. The distinguishing characteristic of this project was the SCEPTER X-Windows graphical user interface which rendered on-screen displays of these SGML documents and their embedded images on the fly rather than relying on a pre-formatted representation. Given OCLC's experience, we now turn to some of the practical considerations of electronic documents stored in image and SGML and their implications for an electronic library.

3 SGML

SGML (Standard Generalized Markup Language) is a language which allows an author to define a markup environment in the flow of a document to relate something about the document, that is, the intention of the text. SGML is geared for the reusability of context; it is a hardware- and software- independent, international standard which has been adopted by industries and supported by vendors.

A basic SGML text consists of a Document Type Definition (DTD) and the document content. Document content is composed of markup elements and data. The DTD defines the valid markup elements for a document type and what they mean. In other words, a DTD is a descriptive envelope for the contents of the document. For example:

<!DOCTYPE ARTICLE [

<!ELEMENT ARTICLE - - (AUTHOR, TEXT)>

<!ELEMENT AUTHOR - - (#PCDATA)>

<!ELEMENT TEXT - - (#PCDATA)>

<!ATTLIST TEXT style CDATA #implied> ]>

The preceding DTD defines a document type article as consisting of two structural parts, AUTHOR and TEXT. The elements AUTHOR and TEXT are further defined by the terminal symbol PCDATA which means that the contents of this element is parsed character data, i.e., data characters in the text that are not markup. Finally, the text element has an attribute, style.

In a "document instance," markup tags surround pieces of document content and describe the function of the pieces in the document's overall structure. Attributes modify the elements by carrying additional information about the document section. In the following example, the tags are in bold and the attribute is in italic:

<article>

<author>Terry Noreault</author>

<text style=emphasized>This is a SHORT article</text>

</article>

The resultant output could be:

This is a SHORT article

Note that we said that the resultant output could appear like the string above. This is because an (output) application would use the style attribute value "emphasized" to actually specify how the text would be formatted. There is no physical manifestation presumed by the style attribute value. Document Style Semantics and Specification Language (DSSSL) is an international standard that defines how to transform the marked up text into formatted text.

The key thing to remember about DTDs with respect to the above example is that a DTD allows you to validate attributes and attribute values for each element without marking up the document again for a special output application. The SGML parser determines whether the document structure is valid, then the output application (print, large print, Braille, Mosaic, etc.) determines what the special markup attributes mean to it. In addition, "SGML is now often expected to have the parts of a document best represented as images, such as figures and photographs linked to it so that they can also be displayed."

4 Imaging

"Page images are most easily thought of as facsimile images of the pages, and as a matter of fact they are typically stored in CCITT Group 4 format because that is one of the most compact ways to store black and white scanned images of text." Basically, digital imaging is an analog of the microfilming process where a picture of the page is captured by a scanning device and stored-in this case on magnetic or optical media. In the remainder of this section, we will highlight four important factors to consider with imaging: color, resolution, compression, and encoding. These factors ultimately combine to impact production costs, storage requirements, and display quality. The display quality is a function of the amount of visual information that the image conveys.

Encoding: Images can be represented as vectors, pixels, or a hybrid of the two. By far the most common is pixel representation. Vector representation is primarily used in CAD (Computer Aided Design) systems. Hybrid, sometimes called compound, images consist of both pixel and vector. The pixel, or raster, approach breaks the image into a series of single points. The remainder of this paper will discuss only pixel images.

Color: Depending on the sophistication (and price) of the electronic document scanning equipment, there are three "color" options with respect to the resultant digital image: black and white, grey-scale, and color. A black and white image has each pixel represented by 1 bit of color representation (i.e., black or white). Grey-scale can be thought of as a black and white photograph of the image with 2 to 8 bits of black, white, and grey shades which provide more definition to the picture. Half tones, commonly used in newspapers, are black and white pixels which vary in size, giving the visual impression of grey-scale. True grey-scale has much greater definition than half tones. Finally, color images can be thought of as color photographs with increasingly rich color palates. In other words, 8-bit color offers 256 color combinations, 16-bit gives approximately 16 thousand and 24-bit produces approximately 16 million colors. As previously mentioned, each color progression conveys more information but incurs higher costs.

Resolution: Most of us are familiar with resolution - how many dots per inch (dpi) of color there are in the image. In terms of conveying more information, the higher the dpi, the sharper the image will be. As with color, with increased dpi comes increased scanning and storage costs. The most common dpi for scanned images is 300 dpi, however this resolution is inadequate to capture most grey-scale images.

Compression: Compression refers to how many bytes of repetitive data can be squeezed out of the image using numerical algorithms. Compression algorithms reduce the stored size of image files and are generally invoked as an image is displayed to decompress it. Commonly used image compression algorithms include LZ (a modified version of this algorithm is used in GIFF images), Group 4 facsimile, and JPEG. LZ and JPEG can be used to compress grey scale or color where Group 4 is effective only for black and white.

5 Comparative Evaluation

5.1 Costs

There is no single best way to capture information in digital form. The decision involves a tradeoff between image capture costs and display capabilities which depends on the value and anticipated use requirements of the information.

5.1.1 Creation

The most reliable figures that we have for converting existing material to SGML is up to $15 per page for technical material and $0.50 per page to create page images. For SGML to be a cost-effective alternative requires retooling the production process currently being used to prepare paper copy. This retooling allows the creation of SGML as a part of the normal production stream and only marginally increases the cost of the production. Still, these increased costs caused by SGML markup cannot be justified unless the document in question is likely to go to several output devices. In spite of the added costs to create an underlying SGML document many publishers are moving in this direction. A library that wants to build its own collection does not have the option of retooling the production process and therefore faces a huge cost difference between image and SGML.

5.1.2 Acquisition

There are two document acquisition issues to consider in creating an electronic document database: copyright and database loading. With both page image and SGML, the costs of securing the rights to the material are about the same. Similarly, the mechanics of data preparation, indexing and database loading do not differ significantly between page image and SGML. As such, acquisition costs are not distinguishing factors in deciding which format to use.

5.1.3 Storage

Storage costs favor SGML-tagged documents. Since SGML documents are ASCII documents with embedded tags, their storage requirements are quite similar to straight ASCII documents. On the other hand, the average page image requires 80-100 kilobytes to represent a single page in the source document. We estimate that there is a 10:1 storage savings using SGML. For a document database the size of CORE, this means a 40 gigabyte savings, or in dollar terms, around $20,000.

5.1.4 Display

As previously mentioned, encoding electronic documents in SGML makes sense when the documents are destined for a variety of output devices. Typically, SGML output applications such as FrameMaker and SoftQuad's Enabler allow the end-user to configure the document display to the equipment which is available. Although you are required to purchase this application software, additional investments in specialized equipment is not required. On the other hand, page images do not generally display well on computer screens because they are same-sized images of the printed page. Although the scanned image is faithful to the original, the end-user must scroll the screen to view the entire page contents. As a result, larger, higher-resolution display devices are required for page images than for pages rendered by SGML output applications.

5.2 Usability

In general, the end-user or document creator is more in control of the SGML document whereas with page image documents, the editor makes a final decision for the way images are displayed. The placement and finality of this control has very real implications for document usability.

5.2.1 User-apparent Response time

When comparing user-apparent response time between page image and SGML documents, two aspects must be considered: telecommunications (delivery) and rendering. We define user-apparent response time as the amount of time elapsed between the final keystroke entry which requests the document and the complete screen painting of a page of the requested document.

Because of their high storage requirements, page images will take longer to deliver to the end-user's workstation than SGML documents. In the case of images, higher bandwidth (in excess of 28.8 kilobit per second) is required to download a document at an acceptable rate, whereas an SGML document (without embedded images) can comfortably be done over a dialup line running at 9600 baud. Although telecommunications and modem costs continue to fall, the 10:1 size differential will continue to be a bottleneck until all "telephone" lines are replaced with high-speed connections.

Document usability is also affected by rendering time: the faster a screen is painted, the better it holds the end-user's attention and the more productive they are. Because of their smaller size, delivery of SGML documents is significantly faster. Moreover, once the SGML document has been transferred, it can generally be displayed more quickly than a page image because there is either no need to decompress the information or at least that need is reduced because the embedded images in an SGML document will typically be smaller than page images. Nevertheless, "on the fly" (SGML) document rendering increases the amount of time required to paint the screen and, depending on the complexity of the document, the user-apparent response time could approximate that of page images. Therefore, while SGML documents have an advantage in terms of delivery speed and costs, a high, on-screen rendering time could cause productivity to fall-although in most cases the delivery speed, the cost, and the end-user apparent response time is superior to page images.

5.2.2 Display quality

Imaging clearly has the advantage with respect to true representation of the original document. However, this faithfulness is a function of the scanning resolution and the ability of the scanning device to capture color. On the other hand, the display of text marked up in SGML will be better vis-à-vis the text in the page image because the output software uses machine-native fonts for display. As with many of the preceding issues, the layout and display characteristics of page images are totally under the document editor's control, where with SGML, the document end-user makes the final decisions about display (output).

5.2.3 Searching

Perhaps the most telling difference between page images and SGML documents lies in the extensiveness of searching the underlying document base. Consider a simple example of a series of imaged pages of a journal versus an SGML version of the same text. If the editor hand-built an index of the keywords on the page of the imaged document, the end-user could query the database for a given keyword and (ideally) retrieve the page on which the keyword existed. By comparison, the marked up version of the same information allows the editor to apply full-text indexing software to automatically build a word index. Granted, this index might add in excess of 100% overhead to the database, but this "cost" enables the end-user to search on a word (or truncated word stem) and locate all occurrences of the desired word in the database at a more precise level rather than simply getting a hit on a keyword and a full page image. Admittedly, such additional recall might be undesirable in some applications, but nevertheless the additional "hits" give the end-user options for further exploration. In addition, the availability of all of a document's words permits the possibility for special term weighting or the application of other search algorithms to make the end-user's searching, browsing, and document navigation more productive.

But more importantly, the page image document does not support structured browsing and searching. That is to say that the end-user is not able to specify where in the document to search for the desired word. Since SGML retains or imposes a document structure, the document user has increased flexibility in locating the information in question by searching for a term in author, title, chapter headings, footnotes, or body text, for example.

Finally, even if the scanned page image output is subsequently run through an OCR process to create ASCII full text, it is not yet feasible to automatically uncover the document structure and at best, the OCR process itself introduces 1 error in every other line of text. Nevertheless, this approach can provide limited full text access for page image documents in some cases.

6 Conclusions

Clearly, the future of electronic documents includes both page images and SGML. As we have seen, SGML documents are enriched through embedded images which can in fact be page images themselves. Demands for quick and inexpensive availability of electronic documents will drive the creation of page image resources. As research into automatic document structuring and error correction advances, page image databases become a viable, intermediate step towards structured, full text. Because of their relatively low cost, page image databases will likely be used for digitizing retrospective items-if those items are captured at all. It is entirely possible that libraries will develop digitizing strategies within discipline or subject area in which high use materials will be marked up and lower use materials scanned with an option to convert them at some future date.

At the present state of technology, SGML has an important place because of its flexibility. Its neutral format allows automatic generation of new formats while maintaining document structure. Because SGML is marked up ASCII text, storage costs and transmission times are dramatically lower than page images. However, the storage and transmission advantages are somewhat diminished when images are included as part of an SGML document. Finally, and most importantly from a library patron's perspective, SGML offers possibilities for structured searching and browsing.

As we consider how we will improve access to our document collections in places outside of the confines of the traditional library, we must avoid document myopia since electronic documents will continue to change. We need to keep our options open by choosing a format that enables us to take advantage of technological developments at the least cost of converting to that format. We must watch developing standards in multimedia such as the Virtual Reality Markup Language (VRML) and Java, as we consider their implications and application in the library of the electronic world. From this vantage point it is difficult to say how these technological advances will ultimately impact libraries, but certainly they will change the way electronic documents are created, stored, accessed, and displayed. As a result, the library's role will change as we continue to provide value to our patrons in a rapidly evolving electronic library.