Project Manager: Stuart L. Weibel, Consulting Research Scientist
Abstract
The explosive growth of the World Wide Web (WWW) is due in part to the ease with which information can be made available to Web users. The simplicity of HTML and HTTP servers lowers the barriers to network publishing.
The high-quality rendering of HTML in WWW browsers such as Mosaic raises the aesthetic appeal of information and makes it more useful by virtue of enhanced readability. But the simplicity that makes WWW technology so appealing also makes it difficult to represent the complex markup and typography necessary for scholarly publishing. The need for extensive character sets and more effective interface facilities for inter- and intra-document navigation stretches the limits of the current standards that underlie the Web and its clients. In addition, the stateless nature of WWW client-server interactions presents certain challenges to the effective implementation of search and retrieval functionality so important to effective document retrieval systems.
OCLC distributes several scholarly journals under its Electronic Journals Online service, acting, in effect, as an electronic printer for scholarly publishers. As part of this effort, OCLC has prototyped a WWW-accessible version of these journals.
This report describes the problems encountered, details some of the short-term solutions, and highlights changes to existing standards that will enhance the use of the Web for scholarly electronic publishing. [Note: a version of this paper will appear in the forthcoming proceedings of the 1994 Chicago World Wide Web Conference to be published as part of a special issue in Elsevier's Scientific Computer Networks and ISDN Systems.]
Publishers are increasingly turning to Standard Generalized Markup Language (SGML) as the lingua franca for electronic representation of their products. SGML allows the representation of the logical structure of a document and is sufficiently flexible to support arbitrary rendering models, either paper-based or electronic.
HTML (HyperText Markup Language), a simple application of SGML-like markup, is the standard method for expressing document structure in the WWW (Berners-Lee, 1993). Its simplicity has contributed to its popularity and made Web publishing more accessible, but that same simplicity makes it difficult to express the full richness of conventionally published scholarly documents.
Three major problems loom large for the publication of technical scholarly information on the Web:
The first two are difficult, but are currently the subject of active development, and standardization of the resulting solutions is part of the agenda of the Internet Engineering Task Force (IETF) working group on HTML. An acceptable work-around is conversion of equations and tables to graphics.
The character-set problem, however, will require a consensus among
the networking community and implementation on all client
platforms. No straightforward solution can be expected in the
near term. A sufficient work- In spite of these problems, there is great incentive to offer
scholarly electronic publishing via the Web. Mosaic has set a new
standard by providing platform-independent access to networked
information through an aesthetically pleasing, intuitive interface.
In recognition of this, the American Institute of Physics and OCLC
introduced WWW access to the Applied Physics Letters Online journal
in January 1995 (fig. 1). To accomplish this goal, it was necessary
to develop a process for translating from a rich SGML markup to the
relatively less rich HTML representation without compromising
accuracy.
The architecture described here involves two areas of development:
A translation facility was developed to automatically
convert SGML documents to an HTML representation. Several functions
are integrated under the control of a Tcl
Fig. 1 Sample Page from Applied Physics Letters Online
(Tool Command Language) shell (Ousterhout,
1994). Tcl is a scripting language providing generic,
embeddable programming facilities that can easily be incorporated
into applications. The translator described here integrates the
powerful scripting facilities of Tcl with the tools developed
in OCLC's SGML Document Grammar Builder Project (Shafer,
1994); namely, grammar extraction, decomposition, analysis, and a
generic translation language.
Mapping of SGML to HTML requires a context-sensitive
approach: the same tag can mean different things, depending on its
context. For example, the <title> tag may be applied both to the
title of an article and to a series of titles that occur in
reference citations. Thus, it is not sufficient to map all
instances of <title> to, for example, the HTML major heading tag
<H1>. To do so would give equal display emphasis to the titles
listed in the bibliography of an article and to the title of the
article itself.
The source documents in this effort are provided by
the American Institute of Physics in 12083 SGML. The 12083
proposed standard is a recommended Document Type Definition (DTD)
for markup of scholarly information, and includes
specifications for marking up mathematical notation (Electronic
Manuscript Preparation and Markup, 1994).
The translator reads each 12083 SGML document and decomposes it
into a grammar tree. Each SGML character entity is translated into
either its HTML-specific counterpart or a Universal Resource
Locator (URL) pointing to a Graphic Image File Format (GIFF) of the
appropriate TrueType character. Each formula (i.e., equations or
mathematical notation) is extracted and translated from 12083 SGML
to TeX to generate a corresponding image. A URL for this image
is substituted in the appropriate location in the text, causing
web browsers to display the image when the document is loaded.
Figure 2 shows incorporation of inline images, as represented
by both inline and display equations, in the running text of a Web
browser.
Fig. 2 Inline Images in HTML Text Viewed by a Web Browser
Figures and tables are handled similarly. In these
cases, however, the URL points to a subsampled thumbnail GIFF
image hyperlinked to a corresponding full-size GIFF image.
The thumbnail image is reduced to a height of 125 pixels while
maintaining the aspect ratio, thereby reducing initial loading
burdens and providing a better-proportioned page display
(full-resolution figures in electronic documents are typically of
awkward proportion when included inline in running text). The
full-sized image is displayed by selecting the thumbnail image,
thereby invoking the appropriate external viewer.
Forward and backward hyperlinks are established
utilizing the tag, attribute, and value information corresponding
to each of the appropriate forward- and backward- referencing SGML
tags. These hyperlinks provide a mechanism for traversing the
individual document and allow the user to jump forward and backward
from references, figures, and tables.
The image-rendering process is based on an
existing software product developed at OCLC as the display
client for the Electronic Journals Online (EJO) electronic
publishing system. Guidon software is provided with each
subscription to any EJO journal. It is a Microsoft Windows
application optimized to display EJO documents. Since OCLC designs
TrueType fonts for any characters that publishers request (as
part of its electronic publication service), OCLC controls the
characters to display EJO documents. Our rendering and display
engine has been adapted to function as a bitmap server, rendering
a single character or a complex equation as required. These
resulting glyphs are stored as transparent GIFF images, linked to
the text by URLs.
Once the translation is complete, the newly created
HTML file and all associated image files are written into the HTML
store. The original SGML versions of the documents build an
inverted-file database for use by Guidon. This same database is
used to search the collection and generate pointers to the HTML
version of the document. It is important to note that the original
SGML markup is retained in this database, and this markup supports
rich searching capabilities. Thus, although the delivery of
scholarly journals into the Web involves some display formatting
compromises, it need not result in loss of structured document
searching capabilities.
The database system is OCLC's Newton database and search engine
designed for tree-structured data of arbitrary complexity. It is
currently in use for databases as large as 30 million records, and
is accessible via Z39.50 requests.
The translation architecture is not without problems,
and should be viewed as an interim solution until WWW protocols
evolve to support scholarly publishing. Some of the specific
pitfalls include:
The statelessness of the Web is one of the virtues of
the model: its simplicity has contributed to its popularity.
Servers maintain no context information and tear down a connection
immediately after providing a response. This virtue becomes a
vice for an information service that benefits from (or requires) a
session-based interaction, as do most reference databases and
document retrieval systems.
Statefulness is essential to retain session context for the user
(reusable result-sets, for example) and, in the case of fee-based
services, eliminates the need to reauthenticate a user for each
transaction. One solution lies in a hybrid HTTP-Z39.50 server, a
stateless gateway to the session-based Z39.50 world.
The Z39.50 standard provides the basis for assuring
interoperability among disparate document and database query
systems, allowing vendors to implement search and retrieval systems
that understand queries expressed in the standard manner, and
supporting a common understanding of a retrieval session.
Normally, an HTTP server terminates after responding to a client
request. Because our server is acting as a Z39.50 client for the
HTTP client, it cannot terminate. If it did, the Z39.50 session
would be terminated. The connection to the client has been torn
down, as per the HTTP protocol. The server must have a way of
recognizing returning clients. In our architecture, this is
accomplished by putting a session ID in all URLs produced by the
gateway and returned by the client.
The initial implementation of our architecture used
a multiuser server that maintained Z39.50 sessions for each Web
browser. Although this approach was generally functional, any
server failure resulted in the loss of all "open" sessions,
clearly an undesirable result. The alternative is to provide some
isolation of code- and data-space by initiating individual gateway
sessions for users.
The central multiuser HTTP server still exists, but it now acts
simply as a message router between the Web browser and a Z39.50
gateway (fig.3). The message router needs sufficient functionality
to determine if the database request is an initial request (which
initializes a Z39.50 gateway) or a subsequent request (the request
is routed to an existing Z39.50 session). This functional
simplicity results in a more dependable piece of code, important
for sustaining production-quality service. A separate Z39.50
gateway is started for each session, and is maintained until the
session is terminated.
Fig. 3 Architecture for an HTTP-Z39.50 Gateway
Database search requests are initiated by the HTTP
client using HTML forms. These forms are provided by the HTTP
message router, which can satisfy simple HTTP GET requests. Upon
receiving the form, the message router either initiates a Z39.50
gateway session or routes the query through an existing Z39.50
session. The Z39.50 server then searches the database, identifies
results sets, and creates a dynamically generated HTML document
which it sends to the Web client via the message router.
The HTML document returned in response to a search request either
indicates that no documents satisfy the request or contains a short
list of titles and associated URLs that can be used to retrieve the
complete documents. If a large number of documents satisfy the
request, then the HTML result-set document also includes URLs that
can be returned for another list of titles. This is done to avoid
delivering unmanageably large result-sets to the client.
Marrying the session-oriented world of Z39.50 with
the stateless model of the Web gives rise to interesting issues
related to the form and function of URLs. The precise form of
Z39.50 URLs remains under discussion in standardization
circles. OCLC has implemented a URL-Z39.50 scheme which serves the
intended purpose, though it may change to reflect experience in a
production environment. Several points concerning these URLs merit
attention.
The architecture presented here provides a means of
delivering complex scholarly publications using current WWW
protocols. The solutions are workable methods that will serve the
needs of many information providers until the capabilities of this
environment evolve. In exchange for the necessary compromises,
publishers gain access to the fastest growing segment of the
Internet (WWW) and the growing array of client software across all
major hardware platforms.
Berners-Lee, Tim. ftp://info.cern.ch/pub/www/doc/http-spec.ps Hypertext Markup
Language (HTML), March 1993.
Electronic Manuscript Preparation and Markup.
Bethesda, Maryland: National Information Standards Organization.
NISO/ANSI/ISO 12083, 1994.
Information ProcessingText and Office
SystemsStandard Generalized Markup Language (SGML). Geneva,
Switzerland: International Organization for Standardization. ISO
8879:1986., 1986.
Ousterhout, John K. Tcl and the Tk Toolkit. Reading,
Massachusetts: Addison-Wesley Publishing Company, 1994.
Shafer, Keith. 1994. "SGML Grammar Structure." Annual
Review of OCLC Research July 1992-June 1993, 39-40. Dublin, Ohio:
OCLC Online Computer Library Center, Inc.
Project Staff: Jean Godby, Associate Research
Scientist; Eric Miller, Research Assistant; Ralph LeVan, Senior
Consulting Systems Analyst; Vincent M. Tkac, Programmer Analyst
Translation Process
Context-Sensitive Mapping
Source Documents: 12083 SGML
Figures and Tables
Hyperlinks
Rendering of Glyphs, Figures, and Tables
Retaining SGML Structure
Pitfalls
A Stateful Union
Z39.50 and HTTP
Session Reliability
Interactive Transactions in a Stateless Environment
Session-based URLs
Conclusion
Notes