[This local archive copy mirrored from the canonical site: http://www.w3.org/MarkUp/future/papers/rahtz-im-72115.html; links may not have complete integrity, so use the canonical document at this URL if possible.]
Elsevier Science is one of the main scientific publishers in the world. We publish primary research journals in the major fields, including medicine, chemistry, social sciences, mathematics, and physics, producing over 30000 papers a year. Since the early 1990s, the company has been committed to long-term storage of all its core material in SGML. While almost all Elsevier journals are currently printed in the traditional way, they are also stored in two formats, SGML and PDF, preserving on the one hand the detailed structure of the source, and on the other the exact typography used for the printed page. The SGML is against a DTD which has been developed over the last 6-7 years to meet our needs in areas such as math and bibliographies.
Clearly, it would be quite impossible for us, in general, to use HTML as our primary storage medium, since this would not let us perform essential tasks like:
The last point highlights a truism about the HTML specification, the fact that it is semantically impoverished, and presentation-oriented. Its semantical poverty is trivially demonstrated by the fact that it lacks any way of distinguishing (for example) personal names, which are essential for indexing in our material. Worse, it allows for an increasingly complex range of ways of specifying the way that a range of text should be rendered, rather than any way of specifying what kind of an object the text is.
In some ways, the definition and implementation of Cascading Style Sheets provides a way around some of the HTML problems, since it allows semantics to be derived in a roundabout way from the presentation style markup. Thus
Since 1995, we have been experimenting with electronic journals, produced by using a mixture of HTML (generated from SGML) and PDF, and since 1997 we have offered an extremely large inter-linked database of articles via Science Direct. All the projects to date have a number of features in common:
While these systems `work', they are less than ideal in all of the above areas:
In general, our production processes have been costly to set up, and are not producing products that are flexible enough for the user. All the flexibility that we introduce is at the generation end of the process, which has a potentially worrying impact on our business, since we cannot offer enough to the client. For instance, to allow a client a choice in document display between full article, summary of headings, or just front matter forces us either maintain parallel `canned' variants, or perform on-the-fly reconversion, which is inefficient and limited in its abilities to those features provided by the system authors.
Elsevier Science can develop its Web-based offerings in at least three ways (although the priority of these is arguable):
Not surprisingly, these three directions coincide with developments in the last year related to XML:
It is relatively clear in what directions Elsevier can enhance its products, and it is also clear that we can do all that is required using existing technology -- at a price. At present, it seems at least likely that adoption of XML, and its companion forthcoming linking and style sheet standards, will be a sensible move for Elsevier. It is also very likely that the specialized applications of XML for mathematics (MathML) and chemistry (CML) will soon be utilized in our publications.
If, as suggested in the last section, our future lies in providing flexibility on the client side of delivery, we could become essentially independent of HTML. It then becomes a browser decision whether to convert XML markup into HTML for presentation, or render it directly.
It might be appropriate to compare HTML to PostScript. The latter language was a huge breakthrough in providing developers with a single interface language for many different rendering engines, and enabled typesetting to be a consumer product. In the same way, HTML brought multi-media authoring to the average consumer, by providing a single, accessible, language. Subsequently, PostScript gave birth to Display PostScript, and then PDF, which added all the functionality needed for screen, as well as paper, rendering, and HTML has slowly acquired more and more presentation features, and a style sheet language. If we look at the same file rendered to HTML+CSS, and PDF, it is clear that the differences are quite small:
We suggest that, far from eschewing, and being ashamed of, presentation, HTML might evolve further in the direction of PDF. Just as the designers of PostScript and PDF sought to provide access to all the features needed by typographers, HTML can define the low-level functionality of screen documents, freeing browser writers to concentrate on the user interface instead of the complexities of rendering XML directly.
HTML could become an invisible language in the next millenium, written only by other, very specialized, software. However, if we make HTML more presentational, so it can serve as the output language for XSL, it will always carry its legacy around. Originally, HTML was not meant to be a language to specify presentation, and it does not seem like an ideal basis for such a language. A fresh start on what could be dubbed XPL (eXtensible Presentation Language) might be better made starting from something like PostScript, Lout or troff.