Scholarly Publishing on the World Wide Web

Project Manager: Stuart L. Weibel, Consulting Research Scientist

Abstract

The explosive growth of the World Wide Web (WWW) is due in part to the ease with which information can be made available to Web users. The simplicity of HTML and HTTP servers lowers the barriers to network publishing.

The high-quality rendering of HTML in WWW browsers such as Mosaic raises the aesthetic appeal of information and makes it more useful by virtue of enhanced readability. But the simplicity that makes WWW technology so appealing also makes it difficult to represent the complex markup and typography necessary for scholarly publishing. The need for extensive character sets and more effective interface facilities for inter- and intra-document navigation stretches the limits of the current standards that underlie the Web and its clients. In addition, the stateless nature of WWW client-server interactions presents certain challenges to the effective implementation of search and retrieval functionality so important to effective document retrieval systems.

OCLC distributes several scholarly journals under its Electronic Journals Online service, acting, in effect, as an electronic printer for scholarly publishers. As part of this effort, OCLC has prototyped a WWW-accessible version of these journals.

This report describes the problems encountered, details some of the short-term solutions, and highlights changes to existing standards that will enhance the use of the Web for scholarly electronic publishing. [Note: a version of this paper will appear in the forthcoming proceedings of the 1994 Chicago World Wide Web Conference to be published as part of a special issue in Elsevier's Scientific Computer Networks and ISDN Systems.]

Scholarly Publishing

Publishers are increasingly turning to Standard Generalized Markup Language (SGML) as the lingua franca for electronic representation of their products. SGML allows the representation of the logical structure of a document and is sufficiently flexible to support arbitrary rendering models, either paper-based or electronic.

HTML (HyperText Markup Language), a simple application of SGML-like markup, is the standard method for expressing document structure in the WWW (Berners-Lee, 1993). Its simplicity has contributed to its popularity and made Web publishing more accessible, but that same simplicity makes it difficult to express the full richness of conventionally published scholarly documents.

Three major problems loom large for the publication of technical scholarly information on the Web:

Mathematical representation
A formal model for expressing tabular material
Representation of non-standard character sets

The first two are difficult, but are currently the subject of active development, and standardization of the resulting solutions is part of the agenda of the Internet Engineering Task Force (IETF) working group on HTML. An acceptable work-around is conversion of equations and tables to graphics.

The character-set problem, however, will require a consensus among the networking community and implementation on all client platforms. No straightforward solution can be expected in the near term. A sufficient work-around is to generate bitmaps for unsupported characters and insert them into the running text. This results in a mixture of small bitmaps interspersed with normal characters (the disadvantages are discussed later), but allows correct representation of complex text which may include unusual characters and mathematical equations.

In spite of these problems, there is great incentive to offer scholarly electronic publishing via the Web. Mosaic has set a new standard by providing platform-independent access to networked information through an aesthetically pleasing, intuitive interface. In recognition of this, the American Institute of Physics and OCLC introduced WWW access to the Applied Physics Letters Online journal in January 1995 (fig. 1). To accomplish this goal, it was necessary to develop a process for translating from a rich SGML markup to the relatively less rich HTML representation without compromising accuracy.

The architecture described here involves two areas of development:

Translation of a rich, descriptive markup format (SGML) to the simpler, rendering-oriented markup of HTML
Creation of a gateway from the stateless Web environment to a session-based document search and retrieval system

Translation Process

A translation facility was developed to automatically convert SGML documents to an HTML representation. Several functions are integrated under the control of a Tcl

Fig. 1 Sample Page from Applied Physics Letters Online

Fig. 1 Sample Page from Applied Physics Letters Online

(Tool Command Language) shell (Ousterhout, 1994). Tcl is a scripting language providing generic, embeddable programming facilities that can easily be incorporated into applications. The translator described here integrates the powerful scripting facilities of Tcl with the tools developed in OCLC's SGML Document Grammar Builder Project (Shafer, 1994); namely, grammar extraction, decomposition, analysis, and a generic translation language.

Context-Sensitive Mapping

Mapping of SGML to HTML requires a context-sensitive approach: the same tag can mean different things, depending on its context. For example, the <title> tag may be applied both to the title of an article and to a series of titles that occur in reference citations. Thus, it is not sufficient to map all instances of <title> to, for example, the HTML major heading tag <H1>. To do so would give equal display emphasis to the titles listed in the bibliography of an article and to the title of the article itself.

Source Documents: 12083 SGML

The source documents in this effort are provided by the American Institute of Physics in 12083 SGML. The 12083 proposed standard is a recommended Document Type Definition (DTD) for markup of scholarly information, and includes specifications for marking up mathematical notation (Electronic Manuscript Preparation and Markup, 1994).

The translator reads each 12083 SGML document and decomposes it into a grammar tree. Each SGML character entity is translated into either its HTML-specific counterpart or a Universal Resource Locator (URL) pointing to a Graphic Image File Format (GIFF) of the appropriate TrueType™ character. Each formula (i.e., equations or mathematical notation) is extracted and translated from 12083 SGML to TeX to generate a corresponding image. A URL for this image is substituted in the appropriate location in the text, causing web browsers to display the image when the document is loaded. Figure 2 shows incorporation of inline images, as represented by both inline and display equations, in the running text of a Web browser.

Fig. 2 Inline Images in HTML Text Viewed by a Web Browser

Fig. 2 Inline Images in HTML Text Viewed by a Web Browser

Figures and Tables

Figures and tables are handled similarly. In these cases, however, the URL points to a subsampled thumbnail GIFF image hyperlinked to a corresponding full-size GIFF image. The thumbnail image is reduced to a height of 125 pixels while maintaining the aspect ratio, thereby reducing initial loading burdens and providing a better-proportioned page display (full-resolution figures in electronic documents are typically of awkward proportion when included inline in running text). The full-sized image is displayed by selecting the thumbnail image, thereby invoking the appropriate external viewer.

Hyperlinks

Forward and backward hyperlinks are established utilizing the tag, attribute, and value information corresponding to each of the appropriate forward- and backward- referencing SGML tags. These hyperlinks provide a mechanism for traversing the individual document and allow the user to jump forward and backward from references, figures, and tables.

Rendering of Glyphs, Figures, and Tables

The image-rendering process is based on an existing software product developed at OCLC as the display client for the Electronic Journals Online (EJO) electronic publishing system. Guidon software is provided with each subscription to any EJO journal. It is a Microsoft Windows application optimized to display EJO documents. Since OCLC designs TrueType fonts for any characters that publishers request (as part of its electronic publication service), OCLC controls the characters to display EJO documents. Our rendering and display engine has been adapted to function as a bitmap server, rendering a single character or a complex equation as required. These resulting glyphs are stored as transparent GIFF images, linked to the text by URLs.

Retaining SGML Structure

Once the translation is complete, the newly created HTML file and all associated image files are written into the HTML store. The original SGML versions of the documents build an inverted-file database for use by Guidon. This same database is used to search the collection and generate pointers to the HTML version of the document. It is important to note that the original SGML markup is retained in this database, and this markup supports rich searching capabilities. Thus, although the delivery of scholarly journals into the Web involves some display formatting compromises, it need not result in loss of structured document searching capabilities.

The database system is OCLC's Newton database and search engine designed for tree-structured data of arbitrary complexity. It is currently in use for databases as large as 30 million records, and is accessible via Z39.50 requests.

Pitfalls

The translation architecture is not without problems, and should be viewed as an interim solution until WWW protocols evolve to support scholarly publishing. Some of the specific pitfalls include:

Searching within documents: The searching facilities currently available within browsers do not work properly with embedded bitmap characters. For example, an author's name containing a nonstandard character can be represented using an embedded bitmapped glyph, but searching and highlighting functions do not work properly for such terms.
Resizing of fonts: Since bitmapped glyphs are a fixed size, a user who resizes the display font finds that the glyphs do not change in size proportionately. This is acceptable for small changes in font size, but the use of large fonts, important for those with visual impairments, results in unappealing page displays.
Anomalous behavior among browsers: HTML is only now undergoing a rigorous standardization process, and minor anomalies in browser behavior sometimes result in document rendering that is unaesthetic. For the near term, publishers need to advise their subscribers on the suitability of specific Web browsers for use with their journals.

A Stateful Union

The statelessness of the Web is one of the virtues of the model: its simplicity has contributed to its popularity. Servers maintain no context information and tear down a connection immediately after providing a response. This virtue becomes a vice for an information service that benefits from (or requires) a session-based interaction, as do most reference databases and document retrieval systems.

Statefulness is essential to retain session context for the user (reusable result-sets, for example) and, in the case of fee-based services, eliminates the need to reauthenticate a user for each transaction. One solution lies in a hybrid HTTP-Z39.50 server, a stateless gateway to the session-based Z39.50 world.

Z39.50 and HTTP

The Z39.50 standard provides the basis for assuring interoperability among disparate document and database query systems, allowing vendors to implement search and retrieval systems that understand queries expressed in the standard manner, and supporting a common understanding of a retrieval session.

Normally, an HTTP server terminates after responding to a client request. Because our server is acting as a Z39.50 client for the HTTP client, it cannot terminate. If it did, the Z39.50 session would be terminated. The connection to the client has been torn down, as per the HTTP protocol. The server must have a way of recognizing returning clients. In our architecture, this is accomplished by putting a session ID in all URLs produced by the gateway and returned by the client.

Session Reliability

The initial implementation of our architecture used a multiuser server that maintained Z39.50 sessions for each Web browser. Although this approach was generally functional, any server failure resulted in the loss of all "open" sessions, clearly an undesirable result. The alternative is to provide some isolation of code- and data-space by initiating individual gateway sessions for users.

The central multiuser HTTP server still exists, but it now acts simply as a message router between the Web browser and a Z39.50 gateway (fig.3). The message router needs sufficient functionality to determine if the database request is an initial request (which initializes a Z39.50 gateway) or a subsequent request (the request is routed to an existing Z39.50 session). This functional simplicity results in a more dependable piece of code, important for sustaining production-quality service. A separate Z39.50 gateway is started for each session, and is maintained until the session is terminated.

Fig. 3 Architecture for an HTTP-Z39.50 Gateway

Fig. 3 Architecture for an HTTP-Z39.50 Gateway

Interactive Transactions in a Stateless Environment

Database search requests are initiated by the HTTP client using HTML forms. These forms are provided by the HTTP message router, which can satisfy simple HTTP GET requests. Upon receiving the form, the message router either initiates a Z39.50 gateway session or routes the query through an existing Z39.50 session. The Z39.50 server then searches the database, identifies results sets, and creates a dynamically generated HTML document which it sends to the Web client via the message router.

The HTML document returned in response to a search request either indicates that no documents satisfy the request or contains a short list of titles and associated URLs that can be used to retrieve the complete documents. If a large number of documents satisfy the request, then the HTML result-set document also includes URLs that can be returned for another list of titles. This is done to avoid delivering unmanageably large result-sets to the client.

Session-based URLs

Marrying the session-oriented world of Z39.50 with the stateless model of the Web gives rise to interesting issues related to the form and function of URLs. The precise form of Z39.50 URLs remains under discussion in standardization circles. OCLC has implemented a URL-Z39.50 scheme which serves the intended purpose, though it may change to reflect experience in a production environment. Several points concerning these URLs merit attention.

Z39.50 URLs are not persistent; the maintenance of session information is achieved by embedding a session ID in the URL, which is aged by the Z39.50 session gateway. A URL corresponding to a particular document in one session would not be valid for use at a later time without reauthentication of the user.
Navigational URLs are generated dynamically to manage the session interactions and have no meaning outside the context of the session in which they are generated. For example, a URL defining a result-set or request for additional records in a result-set has no meaning in isolation.
URLs containing queries, though also session-specific, may be retained by a client for reuse (assuming reauthentication of the user). Thus, a standard profile query may be issued periodically by a user as part of a current awareness program.
Result-set URLs (again, nonpersistent in nature) include a URL for a complete result-set, as well as dynamically generated sublist URLs used to manage the presentation of small chunks of very large retrievals.

Conclusion

The architecture presented here provides a means of delivering complex scholarly publications using current WWW protocols. The solutions are workable methods that will serve the needs of many information providers until the capabilities of this environment evolve. In exchange for the necessary compromises, publishers gain access to the fastest growing segment of the Internet (WWW) and the growing array of client software across all major hardware platforms.

Notes

Berners-Lee, Tim. ftp://info.cern.ch/pub/www/doc/http-spec.ps Hypertext Markup Language (HTML), March 1993.

Electronic Manuscript Preparation and Markup. Bethesda, Maryland: National Information Standards Organization. NISO/ANSI/ISO 12083, 1994.

Information Processing—Text and Office Systems—Standard Generalized Markup Language (SGML). Geneva, Switzerland: International Organization for Standardization. ISO 8879:1986., 1986.

Ousterhout, John K. Tcl and the Tk Toolkit. Reading, Massachusetts: Addison-Wesley Publishing Company, 1994.

Shafer, Keith. 1994. "SGML Grammar Structure." Annual Review of OCLC Research July 1992-June 1993, 39-40. Dublin, Ohio: OCLC Online Computer Library Center, Inc.

Project Staff: Jean Godby, Associate Research Scientist; Eric Miller, Research Assistant; Ralph LeVan, Senior Consulting Systems Analyst; Vincent M. Tkac, Programmer Analyst