The future of HTML, from the perspective of Elsevier Science

[This local archive copy mirrored from the canonical site: http://www.w3.org/MarkUp/future/papers/rahtz-im-72115.html; links may not have complete integrity, so use the canonical document at this URL if possible.]

The future of HTML, from the perspective of Elsevier Science

Sebastian Rahtz and Herbert van Zijl

April 9, 1998

1 Background

Elsevier Science is one of the main scientific publishers in the world. We publish primary research journals in the major fields, including medicine, chemistry, social sciences, mathematics, and physics, producing over 30000 papers a year. Since the early 1990s, the company has been committed to long-term storage of all its core material in SGML. While almost all Elsevier journals are currently printed in the traditional way, they are also stored in two formats, SGML and PDF, preserving on the one hand the detailed structure of the source, and on the other the exact typography used for the printed page. The SGML is against a DTD which has been developed over the last 6-7 years to meet our needs in areas such as math and bibliographies.

2 Document storage in HTML?

Clearly, it would be quite impossible for us, in general, to use HTML as our primary storage medium, since this would not let us perform essential tasks like:

validation of document data structures
imposition of editorial control
generation of navigational aids such as tables of contents directly from the document itself
generation of cross-document (or even intra-document, such as the bibliograpical citations which are very important in this field) links in more than an ad hoc and manual manner
addressing or management of objects smaller or larger than a single document
efficient re-use of document components
search within semantically significant components of a document

The last point highlights a truism about the HTML specification, the fact that it is semantically impoverished, and presentation-oriented. Its semantical poverty is trivially demonstrated by the fact that it lacks any way of distinguishing (for example) personal names, which are essential for indexing in our material. Worse, it allows for an increasingly complex range of ways of specifying the way that a range of text should be rendered, rather than any way of specifying what kind of an object the text is.

In some ways, the definition and implementation of Cascading Style Sheets provides a way around some of the HTML problems, since it allows semantics to be derived in a roundabout way from the presentation style markup. Thus
<H2 CLASS="sectionHead">Results</H2>

is more useful than just
<H2>Results</H2>

but this poor man's architectural mapping is hardly flexible enough. Applications like scientific publishing require features like re-ordering the components of a document, and selecting subsets of it.

3 Elsevier's current usage of HTML

Since 1995, we have been experimenting with electronic journals, produced by using a mixture of HTML (generated from SGML) and PDF, and since 1997 we have offered an extremely large inter-linked database of articles via Science Direct. All the projects to date have a number of features in common:

Pre-extraction of key fields (such as author names) for external indexing and searching
Linking of papers to backup resources, such as abstract databases
Use of PDF to provide better quality printout than that which can be derived from HTML
Use of fixed size GIF images to display mathematics and special symbols.

While these systems `work', they are less than ideal in all of the above areas:

Because key fields are pre-extracted, no information is left in the target HTML file about where they came from, so no re-indexing can be done by the readers, and back-referencing is clumsy to set up
Linking to external resources is static, in the simple HTML model, and cannot easily accomodate changes in the resources
We cannot use the target HTML to provide flexible and dynamic printing, because of the lack of semantic information in the markup
The fixed rendering of math and special symbols is expensive in development and production time, and is seriously inflexible.

In general, our production processes have been costly to set up, and are not producing products that are flexible enough for the user. All the flexibility that we introduce is at the generation end of the process, which has a potentially worrying impact on our business, since we cannot offer enough to the client. For instance, to allow a client a choice in document display between full article, summary of headings, or just front matter forces us either maintain parallel `canned' variants, or perform on-the-fly reconversion, which is inefficient and limited in its abilities to those features provided by the system authors.

4 Future plans and requirements

Elsevier Science can develop its Web-based offerings in at least three ways (although the priority of these is arguable):

Switching flexibility in presentation to the client side of the process, by preserving semantic markup across the delivery, instead of pre-rendering it to HTML
Increasing sophistication and richness in linking, either inside the document database, or to external resources
More interactive documents, with embedded applications

Not surprisingly, these three directions coincide with developments in the last year related to XML:

If we switch to XML as a delivery medium, we would obviously provide style sheets to provide for the common requirements, but we can also allow third-parties to develop applications to render the information in different ways.
Linking is increasingly seen as adding value to basic material. The extended link and pointer mechanism proposed for XML has many applications in scientific publication, and has the potential to considerably simplify a whole swathe of application development within the company. For example, a simple feature like `back-referencing' from bibliographies currently requires pre-processing, and devious add-in scripts to hard-wire an interface; but if the functionality was a standard feature of XLink-enabled XML browsers, we could reduce our work to style sheets.
Special-purpose markup languages are immensely important in scientific publishing. The two obvious applications are mathematics and chemistry. If we capture and preserve the semantics of math formulae, or molecular models, we can produce much more interactive products.

It is relatively clear in what directions Elsevier can enhance its products, and it is also clear that we can do all that is required using existing technology -- at a price. At present, it seems at least likely that adoption of XML, and its companion forthcoming linking and style sheet standards, will be a sensible move for Elsevier. It is also very likely that the specialized applications of XML for mathematics (MathML) and chemistry (CML) will soon be utilized in our publications.

5 Do we need HTML?

If, as suggested in the last section, our future lies in providing flexibility on the client side of delivery, we could become essentially independent of HTML. It then becomes a browser decision whether to convert XML markup into HTML for presentation, or render it directly.

It might be appropriate to compare HTML to PostScript. The latter language was a huge breakthrough in providing developers with a single interface language for many different rendering engines, and enabled typesetting to be a consumer product. In the same way, HTML brought multi-media authoring to the average consumer, by providing a single, accessible, language. Subsequently, PostScript gave birth to Display PostScript, and then PDF, which added all the functionality needed for screen, as well as paper, rendering, and HTML has slowly acquired more and more presentation features, and a style sheet language. If we look at the same file rendered to HTML+CSS, and PDF, it is clear that the differences are quite small:

PDF files usually carry fonts with them, which makes them larger than HTML files; thus this document is about 9K in HTML, and 28K in PDF -- but PDF without the fonts is only 14K.
While PDF files contain line and page breaking information, this is a fairly small addition to document size and rendering complexity. The real difference is probably the word-level justification, hyphenation etc, and white-space placement.
Neither format retains much worthwhile semantic structure -- any indexing is either crude crunching of every word, or uses predefined catalogue structures
HTML's style sheet mechanism is more efficient than the fixed layout of PDF, but only allows simplistic adjustment by the end-user.

We suggest that, far from eschewing, and being ashamed of, presentation, HTML might evolve further in the direction of PDF. Just as the designers of PostScript and PDF sought to provide access to all the features needed by typographers, HTML can define the low-level functionality of screen documents, freeing browser writers to concentrate on the user interface instead of the complexities of rendering XML directly.

HTML could become an invisible language in the next millenium, written only by other, very specialized, software. However, if we make HTML more presentational, so it can serve as the output language for XSL, it will always carry its legacy around. Originally, HTML was not meant to be a language to specify presentation, and it does not seem like an ideal basis for such a language. A fresh start on what could be dubbed XPL (eXtensible Presentation Language) might be better made starting from something like PostScript, Lout or troff.