Principal Investigator: Thomas B. Hickey, Chief Scientist
Abstract
Virginia Tech is one of the universities participating in Elsevier's TULIP project. OCLC is supplying the software to electronically distribute the collection to their campus. For this purpose, we have modified the Guidon electronic journal interface client software to display page images and have written a program to load Elsevier's data into OCLC's Newton text retrieval program. To make working with image databases as natural as possible, we have made a number of enhancements, including "thumbnail" page images and a magnifying option in Guidon.
The University Licensing Project (TULIP) is a cooperative research project testing a system for networked delivery and use of journals conducted by Elsevier Science and nine universities in the U.S. Virginia Polytechnic Institute and State University (Virginia Tech), one of these universities, contracted with OCLC to supply the software needed to support the project on its campus in Blacksburg, Virginia. OCLC is supplying Virginia Tech with the following software:
In addition, OCLC is supplying software support needed to get the system running at Virginia Tech. On its side, Virginia Tech is supplying the hardware for the project and coordinating arrangements on campus to ensure adequate use of the system as it becomes available.
The TULIP data comes from Elsevier in datasets, each of which covers the issues of 43 materials science journals published during two weeks. A dataset is a collection of articles, which in turn are collections of pages. Each page is represented by two files:
All page files for an issue of a journal are grouped into a directory. All issue directories for a journal are grouped into a journal directory. All journal directories are grouped into a dataset. A Table of Contents file locates pages for each journal, issue, and article in the hierarchy of directories. It contains bibliographic information about the journals, issues, and articles in the dataset.
The Elsevier data must be converted into an ASN1.BER (OSI Abstract Syntax Notation 1, Basic Encoding Rules) format for entry into the TULIP Newton database. An ASN1.BER file is created for a dataset with a set of ASN1.BER records; each record contains a full set of information for an entire article. This information includes the image of each page of the article, the ASCII text for each page, a thumbnail image of each page, and bibliographic information about the article. The bibliographic information includes the article title, authors, publisher, date of publication, abstract, journal title, and issue. These pieces of information are identified in the record by different ASN1.BER tags so that they can be used as access points for retrieving articles.
Our design for the TULIP data differs substantially from the designs of the other sites. Our approach stores page images in the database itself, rather than maintaining them as separate files. The images are broken up into approximately 10,000 byte pieces before storage. During database building, the compressed page image is decoded only far enough to determine scan line boundaries to generate separate TIFF files for each segment of the page. Breaking the image up enables us to send only the portion of a page being viewed, thus speeding the initial display of the pages.
We also process each page to produce the thumbnail images that are used for navigation within an article. This reduction is done using an adaptation of the display software, which passes a filter over the image to perform the reduction, but does only minimal decoding of the image. Thumbnail generation now takes about three seconds per image on a Sun SparcStation 10. Originally we were using a public domain image software library, but the 60-seconds-per-image was judged too slow for reasonable build times by the staff at Virginia Tech.
Maintaining the images within the database reduces the number of files for dataset storage by a factor of 10,000. This reduction should substantially reduce the difficulties in working with such a large database. The processing needed for storage and thumbnail production also serves as a quality check to make sure that the software processes all of the page images. We encounter very few errors in the images, but it is much better to identify any problems when the database is built, rather than waiting until a user stumbles upon one.
Our TULIP system uses a single Newton logical database made up of multiple Newton physical databases. Each Newton physical database typically includes three or four datasets. All the information in the ASN1.BER record for an article is loaded into the Newton database. The various pieces of information such as the image of a given page can be retrieved using the ASN1.BER tags as identifiers. The Newton database also contains 19 indexes, allowing Guidon users to locate information easily. Examples of indexes are: journal id, issue id, article id, article title, article author, abstract, and ASCII text. Each word in these fields is indexed so that a user can locate all articles containing the word holographic in the title or all pages with the word lunar in the ASCII text.
Guidon was originally written to display full-text SGML documents using DVI (Device Independent) typesetting-information generated by TeX. It was also designed to be easily modified to support different types of databases. Therefore, since Guidon could display TIFF images (for figures in the typeset documents), most of the necessary pieces were already in place.
Two major obstacles were encountered in making Guidon display documents from the TULIP image databases. First, the ASN1.BER record structure for the TULIP image database contains data elements different from those found in the full-text records for typeset documents. These data elements had to be mapped onto the internal document structure used by Guidon in a way that made sense. For instance, Guidon expects to find a Table of Contents in the full-text record, which it uses to access the different sections and subsections of the document. The image database records do not have a Table of Contents; they do, however, have a list of the pages that occur in the document. This list of pages then can be mapped internally onto the Guidon document Table of Contents structure so that Guidon can use the list of pages to access the beginning of each page.
The second obstacle was that because the documents were scanned in at 300 dpi (dots per inch) resolution, simply decompressing and displaying the segments of the page images on the screen (which typically display at 96 dpi) showed the documents at over three times their original size. This required significant scrolling and navigating by the user to read the document, making reading even a single page unwieldy and cumbersome. The solution was to scale the 300 dpi images to a lower resolution before displaying them, producing 100 dpi anti-aliased, gray-scale images of the pages that are quite readable and much easier to navigate.
Modifying Guidon to the proof-of-concept point where image documents could be viewed took minimum effort. Only a few iterations were needed during a three-week period. After that, several enhancements were made to improve the user's access to the documents.
One enhancement lets users choose the reduction factor used in scaling down the document from the original 300 dpi. As an image is scaled down, more of the document can be viewed at one time, making reading from column-to-column and page-to-page easier. However, due to the unavoidable loss of information inherent in reducing an image, individual characters become harder to discern. Because the optimal point in this trade-off may differ for different documents or even for different users viewing the same document, we decided to let users select the scaling factor for displaying documents.
A second enhancement allows users to view thumbnail images of document pages and scan through sketches of the pages for a quick overview of the article structure. These thumbnail images also function as a navigation aid. If the user clicks the mouse on a thumbnail, Guidon jumps to the top of that page. Additionally, by clicking and dragging the mouse slightly within a thumbnail image, the user can cause Guidon to jump to a specific position on the corresponding page in the document. The user scans quickly thumbnails of an entire document, finds a graph or figure of interest, clicks on that portion of the thumbnail, and immediately views that part of the document to determine whether the graph or figure contains information the user is seeking.
A third enhancement was made because scaling a page image down from 300 dpi for display makes small characters unreadable. This is especially true for subscripts and superscripts, or for captions and labels in figures or graphs. Using the rescaling feature causes significant interruption of reading if the user simply wants to determine whether a portion of an equation is x2 or x3. The user must stop reading, select the Enlarge Text menu item, wait for the entire screen to be redrawn using the new scaling factor, and possibly scroll the window to find the text in question. To continue reading, the user must select the Reduce Text menu item to return the image to the original, more navigable scaling factor, and again wait for the screen to be redrawn. If another equation occurs later, the same cumbersome process must be repeated.
To circumvent the excessive effort required to read a few tiny letters at a scaling factor appropriate for the rest of the document, we provided an electronic magnifying glass that lets a user closely examine a small section of the document. To use it, the user clicks on the right mouse button over a section of text and a window pops up displaying an enlarged version of that text. (This enlarged version is actually a segment from the original 300 dpi image.) The user then releases the mouse button and continues reading.
Undoubtedly the biggest problem encountered has been integrating Virginia Techs optical jukebox into their computer system. Virginia Tech is now relying on magnetic disks for storage until the optical system works reliably. Software problems encountered by OCLC have been the common ones expected when working with a large amount of data in a new format. For example, unexpected combinations of fields and system limits were encountered because the Newton database building software was originally designed for smaller documents.
The Guidon conversion went fairly smoothly. We encountered a problem merging databases, but easily made the required changes in the server software and in Guidon.
Currently (December 1994) all the software is in place, sufficient magnetic storage for a year's data has been installed, and building of databases from the Elsevier datasets is underway. Students at Virginia Tech should begin using the system in the spring semester 1995.
American National Standards Organization. Information retrieval service and protocol: American national standard information retrieval application service definition and protocol specification for open systems interconnection. ANSI standard Z39.50, Version 2, 1992.
Information on TULIP The University Licensing Program. Elsevier Science. URL: http://www.elsevier.nl/.
International Organization for Standardization (ISO). Information technology, open systems interconnection, specification of basic Encoding rules for abstract syntax notation one (ASN.1). ISO/IEC standard 8825, Second edition, 1990.
OCLC Project Staff: Robert J. Haschart, Senior Systems Analyst; Thomas L. Terrall, Systems Analyst
Virginia Tech Project Staff: William Dougherty, Systems Engineer; Dr. Edward A. Fox, Associate Professor, Department of Computer Science; Janice Gibbs, Systems Engineer; Charles Litchfield, Co-Head, Library Automation Department