An Architecture for Scholarly Publishing on the World Wide Web

Stuart Weibel, Eric Miller, Jean Godby, Ralph LeVan
Office of Research, OCLC Online Computer Library Center, Inc., Dublin, Ohio
Questions or comments: Stuart Weibel, weibel@oclc.org

Abstract

The explosive growth of the World Wide Web (WWW) is due in part to the ease with which information can be made available to Web users. The simplicity of HTML and HTTP servers lowers the barriers to network publishing.

The high-quality rendering of HTML in WWW browsers such as Mosaic raises the aesthetic appeal of information and makes it more useful by virtue of enhanced readability. But the simplicity that makes WWW technology so appealing also makes it difficult to represent the complex markup and typography necessary for scholarly publishing. The need for extensive character sets and more effective interface facilities for inter- and intra-document navigation stretch the limits of the current standards that underlie the Web and its clients. In addition, the stateless nature of WWW client-server interactions presents certain challenges to the effective implementation of search and retrieval functionality so important to effective document retrieval systems.

OCLC distributes several scholarly journals under its Electronic Journals Online program, acting, in effect, as an `electronic printer' for scholarly publishers. As part of this effort, OCLC is prototyping a WWW-accessible version of these journals.

This presentation will describe the problems encountered, detail some of the short-term solutions, and highlight changes to existing standards that will enhance the use of the Web for scholarly electronic publishing.

Scholarly Publishing on the Web

Publishers are increasingly turning to Standard Generalized Markup Language (SGML) [2] as the lingua franca for electronic representation of their products. SGML allows the representation of the logical structure of a document and is sufficiently flexible to support arbitrary rendering models, either paper-based or electronic.

HTML (HyperText Markup Language), a simple application of SGML-like markup, is the standard method for expressing document structure in the WWW [5] . The simplicity of HTML has contributed to its popularity and made Web publishing more accessible, but that same simplicity makes it difficult to express the full richness of conventionally published documents.

There are three major problems that loom large for the publication of technical scholarly information on the Web:

mathematical representation,
a formal model for expressing tabular material, and
the representation of non-standard character sets.

The first two are difficult, but are currently the subject of active development, and standardization of the resulting solutions is part of the agenda of the IETF working group on HTML. The acceptable work-around for these problems is conversion of equations and tables to graphics.

The character-set problem, however, will require a consensus among the networking community and implementation on all client platforms; no straightforward solution can be expected in the near term. The work-around here is to generate bitmaps for unsupported characters and insert them into the running text. This results in a mixture of small bitmaps interspersed with normal characters (disadvantageous, as discussed below), but allows correct representation of complex text which may include unusual characters and mathematical equations.

In spite of these problems, there is great incentive to offer scholarly electronic publishing via the Web. Mosaic has set a new standard by providing platform-independent access to networked information by way of an aesthetically pleasing, intuitive interface. In recognition of this, the American Institute of Physics and OCLC will introduce WWW access to the American Physical Letters Online journal in January of 1995. To accomplish this goal, it has been necessary to develop a translation process from a rich SGML markup to the relatively less-rich HTML representation without compromising accuracy.

The architecture described here involves two areas of development:

translation of a rich, descriptive markup format (SGML), to the simpler, rendering-oriented markup of HTML, and,
the creation of a gateway from the stateless Web environment to a session-based document search and retrieval system.

The Translation Process

A translation facility has been developed in the OCLC Office of Research to automatically convert SGML documents to an HTML representation. Several functions are integrated under the control of a Tcl (Tool Command Language) shell [3]. Tcl is a scripting language providing generic, embeddable programming facilities that can easily be incorporated into applications. Each application can extend the core Tcl features with additional commands specific to that application. The translator described here integrates the powerful scripting facilities of Tcl with the the tools developed in OCLC's SGML Document Grammar Builder Project [4]; namely, grammar extraction, decomposition, analysis, and a generic translation language.

Context-Sensitive Mapping

Mapping of SGML to HTML requires a context-sensitive approach: the same tag can mean different things, depending on the context in which it is found. For example, the <title> tag may be applied both to the title of an article and to a series of titles that occur in reference citations. Thus, it is not sufficient to map all instances of <title> to, for example, the HTML major heading tag <H1>; to do so would give the same display emphasis to the titles of documents referenced in the bibliography of the articles as is given to the title of the article itself.

Source Documents: 12083 SGML

The source documents in this effort are provided by the American Institute of Physics in ``12083 SGML''. The 12083 proposed standard is a recommended DTD for markup of scholarly information, and includes specifications for marking up mathematical notation [1].

The translator reads each 12083 SGML document and decomposes it into a grammar tree. Each SGML entity in the document is translated into either its HTML-specific counterpart or a URL pointing to a bitmap of the appropriate TrueType(tm) character. Each formula (ie., equations or mathematical notation) is extracted and translated from 12083 SGML to \TeX\ and subsequently rendered to generate a corresponding bitmap. A URL for the bitmap is substituted in the appropriate location in the text, causing the bitmap object to be loaded by the Web browser when the document is loaded. The incorpoation of inline images, as represented by both inline and display equations, in the running text of a Web browser, can be seen in Figure [1].

Figures and Tables

Figures and tables are handled similarly. However, in these cases, the URL points to a subsampled thumbnail GIF image, which is, in turn, hyperlinked to a corresponding full-size GIF image. The thumbnail image is reduced to a height of 125 pixels while maintaining the aspect ratio, thereby reducing initial loading burdens and providing a better-proportioned page display (full-resolution figures in electronic documents are typically of awkward proportion when included inline in running text). The full-sized image is displayed by selecting the thumbnail image, thereby invoking the appropriate external viewer.

Hyperlinks

Forward and backward hyperlinks are established utilizing the tag, attribute, and value information corresponding to each of the appropriate forward- and backward-referencing SGML tags. Utilizing these tags and their appropriate reference attributes and values, corresponding HTML anchors and hotlinks can be introduced utilizing the HTML ANCHOR tag. These ANCHORs provide a mechanism for traversing the individual document and allow the user to jump forward and backward from references, figures, and tables.

Rendering of Glyphs, Figures, and Tables

The image-rendering process is based on an existing software product developed at OCLC as the display client for the Electronic Journals Online (EJO) electronic publishing system. This software, named Guidon (tm), is provided with each subscription to any of the EJO journals. It is a Microsoft Windows (tm) application that is optimized to display EJO documents. Since OCLC designs TrueType (tm) fonts for any characters that publishers request (as part of our electronic publication service) we have control of all the characters necessary to display EJO documents correctly. Our rendering and display engine has been adapted to function as a bitmap server, rendering a single character or a complex equation as required. The resulting glyphs are stored as transparent GIF images, linked to the text by URLs.

Retaining SGML structure

Once the entire translation is complete, the newly created HTML file and all associate image files are written into the HTML store. The original SGML versions of the documents are used to build an inverted-file database for use by OCLC's dedicated MS-Windows client, Guidon. This same database is used to search the collection (as described below) and generate pointers to the HTML version of the document. It is important to note that the original SGML markup is retained in this database, and this markup supports rich fielded searching capabilities, even though much of this document structure is lost in the translation to HTML for screen rendering. Thus, although the delivery of scholarly journals into the Web involves some display formatting compromises, it need not result in loss of structured document searching capabilities.

The database system employed in this project is OCLC's Newton database, a database and search engine designed for tree-structured data of arbitrary complexity. It is currently in use for databases as large as 30,000,000 records, and is accessible via Z39.50 requests.

Pitfalls

The translation architecture described here is not without problems, and should be viewed as an interim solution until WWW protocols evolve sufficiently to support the needs of scholarly publishing. Some of the specific pitfalls include:

Searching within documents: The searching facilities currently available within browsers will not work properly with embedded bitmap characters. For example, an author's name containing a non-standard character can be represented using an embedded bitmapped glyph, but searching and highlighting functions will not work properly for such terms.
Resizing of fonts: Since bitmapped glyphs are of fixed size, a user who resizes the display font will find that the glyphs do not change in size proportionately. This is acceptable for small changes in font size, but the use of large fonts, important for those with visual impairments, will result in unappealing page displays.
Anomalous behavior among browsers: HTML is only now undergoing a rigorous standardization process, and minor anomalies in browser behavior sometimes result in document rendering that is unaesthetic at best. Publishers will, for the near term, need to advise their subscribers on the suitability of specific Web browsers for use with their journals.

A Stateful Union

The statelessness of the Web is one of the virtues of the model, an expression of the simplicity that has contributed to its popularity. Servers maintain no context information and tear down a connection immediately after providing a response. This virtue becomes a vice for an information service that benefits from (or requires) a session-based interaction, as is generally the case with reference databases and document retrieval systems.

Statefulness is essential to retain session context for the user (reusable result-sets, for example) and, in the case of fee-based services, eliminates the need to re-authenticate a user for each transaction. One solution lies in a hybrid HTTP-Z39.50 server, a stateless gateway to the session-based Z39.50 world.

Z39.50 and HTTP

The Z39.50 standard provides the basis for assuring interoperability among disparate document and database query systems, allowing vendors to implement search and retrieval systems that understand queries expressed in the standard manner, and supporting a common understanding of a retrieval session.

Normally, an HTTP server terminates after responding to a client request. Because our server is acting as a Z39.50 client for the HTTP client, it cannot terminate. If it did, the Z39.50 session would be terminated. The connection to the client has been torn down, as per the HTTP protocol; the server must have a way of recognizing returning clients. In our architecture, this is accomplished by putting a session ID in all URL's produced by the gatewayand returned by the client.

Session Reliability

An initial implementation of our architecture resulted in a multi-user server that maintained Z39.50 sessions for each Web browser. Although functional, any server failure resulted in the loss of all ``open'' sessions, clearly an undesirable result. The alternative is to provide some isolation of code- and data- space among the users by initiating individual gateway sessions for users.

The central multi-user HTTP server still exists, but it now acts simply as a message router between the Web browser and a Z39.50 gateway (Figure [2]). The message router needs sufficient functionality to determine if the database request is an initial request (which initializes a Z39.50 gateway) or a subsequent request (the request is routed to an existing Z39.50 session). This functional simplicity results in a more dependable piece of code, important for sustaining production-quality service. A separate Z39.50 gateway is started for each session, and is maintained until the session is terminated (sessions are ``aged'' and terminated when a specified interval without further requests has passed).

Interactive Transactions in a Stateless Environment

Database requests are initiated by the HTTP client (mediated via HTML forms). These forms are provided by the HTTP message router, which has the ability to satisfy simple HTTP GET requests. Upon receiving the form, the message router either initiates a Z39.50 gateway session, or routes the query through an existing Z39.50 session. The Z39.50 server then executes the search of the database, identifies results sets necessary to satisfy the request, and creates a dynamically generated HTML document which it sends to the Web client via the message router.

The HTML document returned in response to a search request will either indicate that no documents could be found to satisfy the request, or contain a short list of titles (and associated URLs that can be used to retrieve the complete documents). If a large number of documents satisfy the request, then the HTML result-set document will also include URL's that can be returned for another list of titles. This is done to avoid delivering unmanageably large result-sets to the client.

The Data Store

Note that in this initial architecture, the HTML store is maintained separately from the database. This is done currently as a production expedient. As the production process is stabilized, the HTML store will be subsumed into the database, thereby simplifying administration of the data store. This problem has received only modest attention in these early days of Web sites, but will loom ever larger as significant stores of documents are brought into the Web.

Another interesting possibility is not to store the HTML version of a document at all, but rather to translate the SGML to HTML on-the-fly. This option was considered, but rejected, in favor of a production process that would support more deliberate examination of the translation process output. One advantage of the on-the-fly approach would be to eliminate future translations of legacy documents subsequent to changes in the HTML standards. The Electronic Journals Online databases do, in fact, include SGML, and it is conceivable that at some point this option will be revisited.

Session-based URLs

Marrying the session-oriented world of Z39.50 with the stateless model of the Web gives rise to interesting issues related to the form and function of URLs. The precise form of Z39.50 URLs (or even whether it is necessary to have a canonical form) continues to be a subject of discussion in standardization circles. OCLC has implemented a URL-Z39.50 scheme which serves the intended purpose, though it may change to reflect experience in a production environment. Several points concerning these URLs merit attention.

Z39.50 URLs will not be persistent; the maintenance of session information is achieved by embedding a session ID in the URL, which is aged by the Z39.50 session gateway. A URL corresponding to a particular document in one session would not be valid for use at a later time without re-authentication of the user.
Navigational URLs will be generated dynamically to manage the session interactions, and will have no meaning outside the context of the particular session in which they are generated. For example, a URL defining a result-set or request for additional records in a result-set would have no meaning in isolation.
URLs containing queries, though also session-specific, might well be retained by a client for reuse (assuming reauthentication of the user). Thus, a standard profile query might be issued periodically by a user as part of a current awareness program.
Result-set URLs (again, non-persistent in nature) would include a URL for a complete result-set, as well as dynamically generated sub-list URLs to manage the presentation of small chunks of very large retrievals.

Conclusion

The architecture presented here provides a means of delivering complex scholarly publications using current (and somewhat limited) WWW protocols. The solutions are workable methods that will serve the needs of many information providers until the capabilities of this environment evolve. In exchange for the necessary compromises, publishers gain access to the fastest growing segment of the Internet (the World Wide Web) and the growing array of client software across all major hardware platforms.

Figures

[1] An example of 12083 SGML to HTML conversion as shown in a Web browser. Inline and display equations are examples of the incoporation of inline images in the running text.

[2] An architecture for an HTTP-Z39.50 Gateway

References

[1] Electronic Manuscript Preparation and Markup. International Organization for Standardization. NISO/ANSI/ISO 12083, 1994.

[2] Information Processing -- Text and Office Systems -- Standard Generalized Markup Language (SGML). International Organization for Standardization. Ref. No. ISO 8879:1986., 1986.

[3] John K. Ousterhout. Tcl and the Tk Toolkit. Addison-Wesley Publishing Company, 1994.

[4] Keith Shafer. SGML Grammar Structure. In Annual Review of OCLC Research, 1993.

[4] Tim Berners-Lee. Hypertext Markup Language (HTML), March 1993

Biographies

Stuart Weibel
Senior Research Scientist

Stuart Weibel has worked in the Office of Research at OCLC since 1985. During this time he has managed projects in the areas of automated cataloging, document capture and structure analysis, and electronic publishing, including the Chemistry Online Retrieval Experiment (CORE Project), a collaborative experiment to republish American Chemical Society journals electronically. He currently coordinates networked information services research projects in the Office of Research, including applications of World Wide Web technology and Internet protocol standardization efforts, including HTML and Uniform Resource Identifiers.

Eric Miller
Research Associate II

Eric Miller is the lead implementor for the X-windows SCEPTER interface, OCLC's experimental SGML browser and viewer, developed for the CORE project. He is a doctoral candidate in the Ohio State University Department of Geography. His research interests include the design and development of geographical information systems and volume visualization as well as electronic publishing and structured markup of text. He is primary implementor of the SGML-HTML translation process described in this paper.

Jean Godby
Associate Research Scientist

Jean Godby has worked at OCLC since 1988, and in the Office of Research since 1990. She has provided technical leadership and systems support in database design and implementation of a 500,000 page SGML database for the Chemistry Online Retrieval Experiment (CORE project). In addition, she manages a project for lexical analysis of text databases aimed at improving access to full-text documents. She is responsible for the image server that supports the translation of SGML to HTML in this paper.

Ralph LeVan
Senior Consulting Analyst

Ralph LeVan has been at OCLC since 1987, during which time he developed the Newton database software and Z39.50 server that support most OCLC database products. He is a member of the Z39.50 Implementors Group and Chairman of the OIW Special Interest Group for Library Automation. He implemented the HTTP-Z39.50 gateway described in this paper.