An Architecture for Scholarly Publishing on the World Wide Web
Stuart Weibel, Eric Miller, Jean Godby, Ralph LeVan
Office of Research, OCLC Online Computer Library Center, Inc., Dublin, Ohio
Questions or comments: Stuart Weibel, weibel@oclc.org
Abstract
The explosive growth of the World Wide Web (WWW) is due in part to the
ease with which information can be made available to Web users. The
simplicity of HTML and HTTP servers lowers the barriers to network
publishing.
The high-quality rendering of HTML in WWW browsers such as Mosaic
raises the aesthetic appeal of information and makes it more useful by
virtue of enhanced readability. But the simplicity that makes WWW
technology so appealing also makes it difficult to represent the
complex markup and typography necessary for scholarly publishing. The
need for extensive character sets and more effective interface
facilities for inter- and intra-document navigation stretch the limits
of the current standards that underlie the Web and its clients. In
addition, the stateless nature of WWW client-server interactions
presents certain challenges to the effective implementation of search
and retrieval functionality so important to effective document
retrieval systems.
OCLC distributes several scholarly journals under its Electronic
Journals Online program, acting, in effect, as an `electronic printer'
for scholarly publishers. As part of this effort, OCLC is prototyping a
WWW-accessible version of these journals.
This presentation will describe the problems encountered, detail some
of the short-term solutions, and highlight changes to existing
standards that will enhance the use of the Web for scholarly electronic
publishing.
Scholarly Publishing on the Web
Publishers are increasingly turning to Standard Generalized Markup
Language (SGML)
[2]
as the lingua franca for electronic
representation of their products. SGML allows the representation of
the logical structure of a document and is sufficiently flexible to
support arbitrary rendering models, either paper-based or electronic.
HTML (HyperText Markup Language), a simple application of SGML-like
markup, is the standard method for expressing document structure in
the WWW
[5]
. The simplicity of HTML has contributed to its popularity and
made Web publishing more accessible, but that same simplicity makes it
difficult to express the full richness of conventionally published
documents.
There are three major problems that loom large for the publication of
technical scholarly information on the Web:
- mathematical representation,
- a formal model for expressing tabular material, and
- the representation of non-standard character sets.
The first two are difficult, but are currently the subject of active
development, and standardization of the resulting solutions is part of
the agenda of the IETF working group on HTML. The acceptable
work-around for these problems is conversion of equations and tables to
graphics.
The character-set problem, however, will require a consensus among the
networking community and implementation on all client platforms; no
straightforward solution can be expected in the near term. The
work-around here is to generate bitmaps for unsupported characters
and insert them into the running text. This results in a mixture of small
bitmaps interspersed with normal characters (disadvantageous, as
discussed below), but allows correct representation of complex text
which may include unusual characters and mathematical equations.
In spite of these problems, there is great incentive to offer scholarly
electronic publishing via the Web. Mosaic has set a new standard by
providing platform-independent access to networked
information by way of an aesthetically pleasing, intuitive interface. In
recognition of this, the American Institute of Physics and OCLC will
introduce WWW access to the American Physical Letters Online
journal in January of 1995. To accomplish this goal, it has been
necessary to develop a translation process from a rich SGML markup to
the relatively less-rich HTML representation without compromising
accuracy.
The architecture described here involves two areas of development:
- translation of a rich, descriptive markup format (SGML),
to the simpler, rendering-oriented markup of HTML, and,
- the creation of a gateway from the stateless Web environment to a
session-based document search and retrieval system.
The Translation Process
A translation facility has been developed in the OCLC Office of
Research to automatically convert SGML documents to an HTML
representation. Several functions are integrated under the control of
a Tcl (Tool Command Language) shell
[3]. Tcl is a
scripting language providing generic, embeddable programming
facilities that can easily be incorporated into applications. Each
application can extend the core Tcl features with additional commands
specific to that application. The translator described here integrates
the powerful scripting facilities of Tcl with the the tools developed
in OCLC's SGML Document Grammar Builder Project
[4];
namely, grammar extraction, decomposition, analysis, and a generic
translation language.
Context-Sensitive Mapping
Mapping of SGML to HTML requires a context-sensitive approach: the same
tag can mean different things, depending on the context in which it is
found. For example, the <title> tag may be applied both to the title
of an article and to a series of titles that occur in reference
citations. Thus, it is not sufficient to map all instances of <title>
to, for example, the HTML major heading tag <H1>; to do so would give
the same display emphasis to the titles of documents referenced in
the bibliography of the articles as is given to the title of the
article itself.
Source Documents: 12083 SGML
The source documents in this effort are provided by the American
Institute of Physics in ``12083 SGML''. The 12083 proposed standard is a
recommended DTD for markup of scholarly information, and includes
specifications for marking up mathematical notation
[1].
The translator reads each 12083 SGML document and decomposes it into a
grammar tree. Each SGML entity in the document is translated into
either its HTML-specific counterpart or a URL pointing to a bitmap of
the appropriate TrueType(tm) character. Each formula (ie., equations
or mathematical notation) is extracted and translated from 12083 SGML
to \TeX\ and subsequently rendered to generate a corresponding bitmap. A
URL for the bitmap is substituted in the appropriate location in the
text, causing the bitmap object to be loaded by the Web browser when
the document is loaded. The incorpoation of inline images,
as represented by both inline and display equations, in the
running text of a Web browser, can be seen in Figure
[1].
Figures and Tables
Figures and tables are handled similarly. However, in these cases, the
URL points to a subsampled thumbnail GIF image, which is,
in turn, hyperlinked to a corresponding full-size GIF image. The
thumbnail image is reduced to a height of 125 pixels while maintaining
the aspect ratio, thereby reducing initial loading burdens and
providing a better-proportioned page display (full-resolution figures
in electronic documents are typically of awkward proportion when
included inline in running text). The full-sized image is displayed
by selecting the thumbnail image, thereby invoking the appropriate
external viewer.
Hyperlinks
Forward and backward hyperlinks are established utilizing the tag,
attribute, and value information corresponding to each of the
appropriate forward- and backward-referencing SGML tags. Utilizing
these tags and their appropriate reference attributes and values,
corresponding HTML anchors and hotlinks can be introduced utilizing the
HTML ANCHOR tag. These ANCHORs provide a mechanism for traversing
the individual document and allow the user to jump forward
and backward from references, figures, and tables.
Rendering of Glyphs, Figures, and Tables
The image-rendering process is based on an existing software product developed
at OCLC as the display client for the Electronic Journals
Online (EJO) electronic publishing system. This software, named Guidon
(tm), is provided with each subscription to any of the EJO journals. It
is a Microsoft Windows (tm) application that is optimized to display
EJO documents. Since OCLC designs TrueType (tm) fonts for any
characters that publishers request (as part of our electronic
publication service) we have control of all the characters necessary to
display EJO documents correctly. Our rendering and display engine has
been adapted to function as a bitmap server, rendering a single
character or a complex equation as required. The resulting glyphs are
stored as transparent GIF images, linked to the text by URLs.
Retaining SGML structure
Once the entire translation is complete, the newly created HTML file
and all associate image files are written into the HTML store. The
original SGML versions of the documents are used to build an
inverted-file database for use by OCLC's dedicated MS-Windows client,
Guidon. This same database is used to search the collection (as
described below) and generate pointers to the HTML version of the
document. It is important to note that the original SGML markup is
retained in this database, and this markup supports rich fielded
searching capabilities, even though much of this document structure is
lost in the translation to HTML for screen rendering. Thus, although
the delivery of scholarly journals into the Web involves some display
formatting compromises, it need not result in loss of structured document
searching capabilities.
The database system employed in this project is OCLC's
Newton database, a database and search engine designed for tree-structured
data of arbitrary complexity. It is currently in use for databases as
large as 30,000,000 records, and is accessible via Z39.50 requests.
Pitfalls
The translation architecture described here is not without problems,
and should be viewed as an interim solution until WWW protocols evolve
sufficiently to support the needs of scholarly publishing. Some of the
specific pitfalls include:
- Searching within documents: The searching facilities currently
available within browsers will not work properly with embedded bitmap
characters. For example, an author's name containing a non-standard
character can be represented using an embedded bitmapped glyph, but
searching and highlighting functions will not work properly for
such terms.
- Resizing of fonts: Since bitmapped glyphs are of fixed size, a user who
resizes the display font will find that the glyphs do not change
in size proportionately. This is acceptable for small changes in font size,
but the use of large fonts, important for those with visual impairments,
will result in unappealing page displays.
- Anomalous behavior among browsers: HTML is only now undergoing a
rigorous standardization process, and minor anomalies in browser
behavior sometimes result in document rendering that is unaesthetic
at best. Publishers will, for the near term, need to advise their
subscribers on the suitability of specific Web browsers for use with
their journals.
A Stateful Union
The statelessness of the Web is one of the virtues of the model, an
expression of the simplicity that has contributed to its popularity.
Servers maintain no context information and tear down a connection
immediately after providing a response. This virtue becomes a vice for
an information service that benefits from (or requires) a session-based
interaction, as is generally the case with reference databases and
document retrieval systems.
Statefulness is essential to retain session context for the user
(reusable result-sets, for example) and, in the case of fee-based
services, eliminates the need to re-authenticate a user for each
transaction. One solution lies in a hybrid HTTP-Z39.50 server, a
stateless gateway to the session-based Z39.50 world.
Z39.50 and HTTP
The Z39.50 standard provides the basis for assuring interoperability
among disparate document and database query systems, allowing vendors
to implement search and retrieval systems that understand queries
expressed in the standard manner, and supporting a common understanding
of a retrieval session.
Normally, an HTTP server terminates after responding to a client
request. Because our server is acting as a Z39.50 client for the HTTP
client, it cannot terminate. If it did, the Z39.50 session would be
terminated. The connection to the client has been torn down, as per
the HTTP protocol; the server must have a way of recognizing returning
clients. In our architecture, this is accomplished by putting a
session ID in all URL's produced by the gatewayand returned by the client.
Session Reliability
An initial implementation of our architecture resulted in a multi-user
server that maintained Z39.50 sessions for each Web browser. Although
functional, any server failure resulted in the loss of all ``open''
sessions, clearly an undesirable result. The alternative is to provide
some isolation of code- and data- space among the users by initiating
individual gateway sessions for users.
The central multi-user HTTP server still exists, but it now acts simply
as a message router between the Web browser and a Z39.50 gateway
(Figure [2]).
The message router needs sufficient functionality to determine if the
database request is an initial request (which initializes a Z39.50
gateway) or a subsequent request (the request is routed to an existing
Z39.50 session). This functional simplicity results in a more
dependable piece of code, important for sustaining production-quality
service. A separate Z39.50 gateway is started for each session, and is
maintained until the session is terminated (sessions are ``aged'' and
terminated when a specified interval without further requests has passed).
Interactive Transactions in a Stateless Environment
Database requests are initiated by the HTTP client (mediated via HTML
forms). These forms are provided by the HTTP message router, which has
the ability to satisfy simple HTTP GET requests. Upon receiving the
form, the message router either initiates a Z39.50 gateway session, or
routes the query through an existing Z39.50 session. The Z39.50
server then executes the search of the database, identifies results
sets necessary to satisfy the request, and creates a dynamically
generated HTML document which it sends to the Web client via the message
router.
The HTML document returned in response to a search request will either
indicate that no documents could be found to satisfy the request, or
contain a short list of titles (and associated URLs that can be used to
retrieve the complete documents). If a large number of documents
satisfy the request, then the HTML result-set document will also
include URL's that can be returned for another list of titles. This is
done to avoid delivering unmanageably large result-sets to the client.
The Data Store
Note that in this initial architecture, the HTML store is maintained
separately from the database. This is done currently as
a production expedient. As the production process is stabilized, the
HTML store will be subsumed into the database, thereby simplifying
administration of the data store. This problem has received only
modest attention in these early days of Web sites, but will loom ever
larger as significant stores of documents are brought into the Web.
Another interesting possibility is not to store the HTML version of a
document at all, but rather to translate the SGML to HTML on-the-fly.
This option was considered, but rejected, in favor of a production
process that would support more deliberate examination of the
translation process output. One advantage of the on-the-fly approach
would be to eliminate future translations of legacy documents
subsequent to changes in the HTML standards. The Electronic Journals
Online databases do, in fact, include SGML, and it is conceivable that
at some point this option will be revisited.
Session-based URLs
Marrying the session-oriented world of Z39.50 with the stateless model
of the Web gives rise to interesting issues related to the form and
function of URLs. The precise form of Z39.50 URLs (or even whether it
is necessary to have a canonical form) continues to be a subject of
discussion in standardization circles. OCLC has implemented a
URL-Z39.50 scheme which serves the intended purpose, though it may
change to reflect experience in a production environment. Several
points concerning these URLs merit attention.
- Z39.50 URLs will not be persistent; the maintenance of session
information is achieved by embedding a session ID in the URL, which
is aged by the Z39.50 session gateway. A URL corresponding to a
particular document in one session would not be valid for use at a
later time without re-authentication of the user.
- Navigational URLs will be generated dynamically to manage the session
interactions, and will have no meaning outside the context of the
particular session in which they are generated. For example, a URL
defining a result-set or request for additional records in a result-set
would have no meaning in isolation.
- URLs containing queries, though also session-specific, might well be
retained by a client for reuse (assuming reauthentication of the
user). Thus, a standard profile query might be issued periodically by
a user as part of a current awareness program.
- Result-set URLs (again, non-persistent in nature) would include a URL
for a complete result-set, as well as dynamically generated sub-list
URLs to manage the presentation of small chunks of very large
retrievals.
Conclusion
The architecture presented here provides a means
of delivering complex scholarly publications using current (and
somewhat limited) WWW protocols. The solutions are workable methods
that will serve the needs of many information providers until the
capabilities of this environment evolve. In exchange for the necessary
compromises, publishers gain access to the fastest growing segment of
the Internet (the World Wide Web) and the growing array of client
software across all major hardware platforms.
Figures
[1]
An example of 12083 SGML to HTML conversion as shown in a Web browser.
Inline and display equations are examples of the incoporation
of inline images in the running text.
[2]
An architecture for an HTTP-Z39.50 Gateway
References
[1]
Electronic Manuscript Preparation and Markup. International Organization for Standardization.
NISO/ANSI/ISO 12083, 1994.
[2]
Information Processing -- Text and Office Systems -- Standard Generalized Markup Language (SGML).
International Organization for Standardization. Ref. No. ISO 8879:1986., 1986.
[3]
John K. Ousterhout. Tcl and the Tk Toolkit. Addison-Wesley Publishing Company, 1994.
[4]
Keith Shafer. SGML Grammar Structure. In Annual Review of OCLC Research, 1993.
[5]
Tim Berners-Lee. Hypertext Markup Language (HTML),
March 1993
Biographies
Stuart Weibel
Senior Research Scientist
Stuart Weibel has worked in the Office of Research at OCLC since 1985.
During this time he has managed projects in the areas of automated
cataloging, document capture and structure analysis, and electronic
publishing, including the Chemistry Online Retrieval Experiment (CORE
Project), a collaborative experiment to republish American Chemical
Society journals electronically. He currently coordinates networked
information services research projects in the Office of Research,
including applications of World Wide Web technology and Internet
protocol standardization efforts, including HTML and Uniform Resource
Identifiers.
Eric Miller
Research Associate II
Eric Miller is the lead implementor for the X-windows SCEPTER
interface, OCLC's experimental SGML browser and viewer, developed for
the CORE project. He is a doctoral candidate in the Ohio State
University Department of Geography. His research interests include the
design and development of geographical information systems and volume
visualization as well as electronic publishing and structured markup of
text. He is primary implementor of the SGML-HTML translation process
described in this paper.
Jean Godby
Associate Research Scientist
Jean Godby has worked at OCLC since 1988, and in the Office of Research
since 1990. She has provided technical leadership and systems support
in database design and implementation of a 500,000 page SGML database
for the Chemistry Online Retrieval Experiment (CORE project). In
addition, she manages a project for lexical analysis of text databases
aimed at improving access to full-text documents. She is responsible
for the image server that supports the translation of SGML to HTML in
this paper.
Ralph LeVan
Senior Consulting Analyst
Ralph LeVan has been at OCLC since 1987, during which time he developed
the Newton database software and Z39.50 server that support most OCLC
database products. He is a member of the Z39.50 Implementors Group and
Chairman of the OIW Special Interest Group for Library Automation. He
implemented the HTTP-Z39.50 gateway described in this paper.