SGML on the Web: too little too soon, or too much too late?

[Mirrored from: http://www.sgmlbelux.be/96/burnard.htm]

SGML on the Web: too little too soon, or too much too late?

Lou Burnard
Oxford University Computing Services

E-mail : lou@vax.oxford.ac.uk

Keywords: SGML on the Web, HTML, DSSSL, XML

What's the beef?

This audience probably does not need to be reminded what SGML is (Standard Generalized Markup Language: the international standard for structured document interchange, ISO 8879 (1986) if you've forgotten). However, it might be helpful to remind you of what exactly the Web is, particularly now that URLs and cyberspace have become an established part of mass media, junk capitalism, and other life as we know it. The best definition of the World-Wide Web I have come across was from Dave Raggett, who pointed out that the Web has exactly three components:

a set of protocols for exchanging data (HTTP, FTP, Gopher, telnet, etc.)
a name space within which data objects can be identified (URLs)
an interchange format (HTML)

Given the immense success of the World Wide Web, it is not unreasonable to ask what more anyone could reasonably require. As they say, "If it ain't broke, why fix it?". I'd like to begin by rehearsing some of the things that have proved to be wrong with that simple architecture.

First, the use of existing protocols. This has always seemed to me one of the greatest strengths of the Web's original design. By allowing from the start for an object to specify, either directly or by implication, that a client intended to deal with it must be able to launch some specific application, the Web has always been extensible in new and unpredictable ways. It makes it attractive for vendors wishing to make sure we all use package X without at the same time preventing us from glorying in the CGI craziness provided by hacker Y. When the Web was first designed, it was for many, simply a way of conveniently integrating many existing TCP/IP-based tools (hence, I suspect, the name 'Mosaic'). Then, as it evolved into the great docuverse imagined by Ted Nelson and other hyperheroes, the HTML protocol began to seem the most important. Unfortunately, in at least one respect, this protocol is Broken As Designed: its object granularity is fixed and file-based (necessitating large amounts of clever cache-management to maintain acceptable performance in an era of ever shrinking band-width). While immense ingenuity has gone into implementing such necessary components of the docuverse as authentication, encryption, synchronous transmission of sound and video, etc., the results are inevitably complex, ad hoc, and in a permanent state of evolution. This may not, of course, be entirely a bad thing (as a system design principle, for example, it has served the natural world pretty well) but it makes life difficult for those concerned with the longer term view of our emerging global information systems.

The URL naming system has been so successful that I am a little shy of making any criticism of it. Nevertheless, I am not the first to wonder whether it might not be improvable. The number of broken links in the universe, and the difficulties constantly encountered in keeping them unbroken; the difficulty of identifying objects on the Web other than by a fragile addressing scheme, tied to specific instantiations of objects rather than an abstract name; the impossibility of reliably identifying higher level groups of objects; all point to something fundamentally wrong. If we think of the Web as a kind of library, it is one in which books have no ISBNs, no bibliographic control, no accession numbers, and no agreed set of subject-level descriptors. It took several centuries for those necessary mechanisms to evolve for print-based information-carriers: it is depressing to think that none of that expertise seems to be carried forward into non-print based media.

Lastly, what is wrong with HTML? Well, rather a lot, if we compare it with other general purpose document type definitions. At the risk of reminding you of something rather obvious, the HTML DTD tries to cater for the immense and glorious variety of structures that exist in electronic resources by taking the line of least resistance, and pretending that documents have no structure at all. Compare for example the following two declarations:

<!ELEMENT Book - - ((Title, TitleAbbrev?)?, BookInfo?, ToC?, LoT*, Preface*,
                (((%chapter.gp;)+, Reference*) | Part+ | Reference+ |
                Article+), (%appendix.gp;)*, Glossary?, Bibliography?,
                (%index.gp;)*, LoT*, ToC? ) +(%ubiq.gp;) >

<!ENTITY % html.content "HEAD, BODY">
<!ELEMENT HTML O O  (%html.content)>
<!ENTITY % body.content "(%heading | %text | %block | HR | ADDRESS)*">
<!ELEMENT BODY O O  %body.content>

The first, from the DocBook DTD, makes explicit that books potentially contain a number of subcomponents, each of which is distinguishable, and has a proper place. The second, which is from the HTML 2.0 DTD, states that the body of an HTML document contains just about anything in just about order. (I have often wondered why HTML did not simply use ANY as content model for the body and have done). There is a place, of course, for such content models (particularly in DTDs such as the TEI, where an unpredictable richness of element types is available), but their downside in the HTML world should not be forgotten.

HTML's permissiveness makes it difficult or impossible to do exactly many of the things for which we go to the trouble of making information digitally accessible. Specifically, it is hard to:

validate document data structures (for example where documents are to be managed by database software)
impose editorial control (for example in co-operatively authored projects)
generate navigational aids such as tables of contents directly from the document itself
generate or manage cross-document (or even intra-document) links in anything other than an ad hoc and manual manner
address or manage objects smaller or larger than a single document
efficiently re-use document components
search within semantically significant components of a document

This last difficulty highlights a further major case of drawbacks resulting from the nature of the HTML document type definition: it is semantically impoverished, and it is presentation-oriented. By semantically impoverished, I do not simply mean that HTML lacks any way of distinguishing say personal names and institutional names, or even names at all; indeed it is provides no way of marking up any kind of textual object other than headings, lists, and (arguably) paragraphs. By presentation-oriented , I mean that HTML compensates for this serious lack only by allowing for an increasingly complex range of ways of specifying the way that a span of text should be rendered, rather than any way of specifying what kind of an object the span is. The relationship between what an object is, and how it is rendered, has exercised much theoretical debate, which I will not rehearse here, but one key fact remains: all SGML systems are predicated on the assumption that markup is introduced in ordered to distinguish semantic categories of various kinds, the meaning of which are rarely limited to how they should be rendered. On the contrary, the assumption is that they may be rendered in many different ways by different applications. This is hard, or impossible, with HTML.

This focus on bold and italic, on headings, and bulletted items, would matter less if HTML were extensible (or if its host environment allowed for its substitution by a more expressive DTD). It would also matter less if HTML were even adequate as a data format for large-scale commercial publishing. But neither of these is the case. If we compare even the best of HTML tools with even the worst of generic SGML tools, we note that the hardwiring of the HTML tool to a particular set of tags (with or without proprietary extensions) make it impossible for the user to extend the tool's functionality in any way. By separating out formatting and structuring issues, even the humblest of SGML tools allows the user to retain complete control over the data.

The advent of HTML stylesheets appears to address this limitation, by extending the choice of formatting options available to HTML tools in a number of useful ways. However, the stylesheet mechanism as so far defined lacks several aspects of output control typically supported by generic SGML tools. It cannot for example be used as means of re-ordering the components of a document, or of selecting parts of it in some application-specific manner -- both of which are perfectly reasonable requirements in mature technical publishing environments, and both of which are easily achieved by current generic SGML document processing systems.

So why don't we just drop HTML?

Leaving aside the economic, political, and sociological answers to this question, there is at least one important respect in which I have rather undersold the case for HTML in the discussion so far. HTML really only suffers when compared to generic (i.e. extensible) SGML from an author's or publisher's standpoint. Readers, on the other hand, don't care whether the display on their screen came from a state of the art object-oriented database, from a postscript file, or by the careful application of black magic, as long as it looks nice. But all readers would like to be author and publishers too -- that empowerment is after all what the Web was supposed to offer us. Moreover, the quality of service delivered by a network publication surely is not solely measured by the dramatic presentational effects it uses: sooner or later the reliability and sophistication of its content becomes a marketing advantage. Despite this interdependence, it may be helpful to re-assess the usefulness of HTML on either side of the client/server divide.

As a server format, HTML has some fairly evident drawbacks. Despite its cheapness, and its low start-up costs, any serious long term investment in service provision based on HTML documents as the primary storage method is unlikely to be wise. The headaches of maintaining consistent links in any moderately dynamic collection simply do not bear thinking about. A hybrid system, where document management and control is carried out by a database system, linked into a static collection of HTML documents is possible, but will require as much investment as would a stand-alone native SGML document system, without any of the intrinsic benefits. At the risk of rehearsing the obvious, the advantages for server management of using a generic SGML database system are manifold:

SGML is an international standard; the products of vendors supporting it are therefore immune to current and future Internet politics, vendor wars, and ad hoc HTML extensions alike.
The extensibility of generic SGML means that documents can be marked-up according to publishers' particular needs, whether these are to satisfy niche markets or to gain competitive advantage, and also in ways appropriate to the particular type of document.
Off-the-shelf SGML tools are available to assist in the authoring of formally validated documents, and the enforcement of in-house editorial principles.
Links, indexes, and similar navigational aids can be generated directly from the structure of documents
Queries against document databases can be more precise, for example by specifying context in SGML terms; this leads to quicker and cheaper query processing with better results, at little or no additional cost.
Documents can easily be reused for a variety of purposes; variant versions (for example, printed or online, scholarly or school, full or abridged) of the same document can be generated as required, with minimal problems of internal consistency.
On-demand documents can be configured in different user-specified ways (not just different typographical treatments)
Management and administration of large document repositories is facilitated.
A necessary bridge to the future deployment of object-oriented authoring/publishing systems

On the client side, the balance is in favour of HTML

Sophisticated and feature-rich browsers are already widely deployed on almost every platform.
Customization and extension of HTML browsers, whether by use of style sheets, plug-ins, add-ons, or mothers' little helpers, is a familiar notion to the Web user community.
Simple local customization is simply done and, with the availability of style sheets, can become reasonably sophisticated, comparable with what is now available with the current generation of SGML browsers.

For the moment, it seems reasonable therefore to try to get the best of both worlds, by using SGML on the server side, with HTML as a delivery vehicle. Not only does this seem reasonable, it is indeed what a number of serious electronic publishers are already doing. Before asking whether this is indeed the optimal solution, I would like to compare two variations on this basic strategy: one in which the server does all the work, and the other in which the client does.

Getting the best of both worlds

Figure 1

Figure 1 shows the kind of architecture already widely deployed by a significant number of information providers on the Web. Information is stored, managed, retrieved and displayed under the control of an SGML aware DBMS of some kind. For the purposes of this argument it makes no difference whether this DBMS in fact handles native SGML or uses some kind of surrogate storage such as a traditional relational database interfaced with a document store. Queries against the database are made via an HTML interface, translated on the fly into SGML queries at the server end; results from the database are dynamically down-translated to HTML for transmission to the client. Several major SGML vendors already market products which perform in this way (examples include EBT's DynaWeb and Open Text's Latitude); moreover the complexity of implementing this solution with existing programming tools via the CGI interface is well within the scope of any competent Web programmer.

The strengths of this approach are largely self evident. Any existing investment in a substantial SGML repository is protected, while the benefits of future investment and development in such a system can be passed on in terms of an enhanced service. Serving HTML is essentially no different from choosing any of the other many different publishing media which such central repositories expect to have to support during their lifetime. Service providers enjoy all the advantages of using SGML for resource management as outlined above. The major drawback to this solution lies in its high set-up cost, the need to develop application-specific translators, and the possible need to fine-tune the system for optimal performance, all of which may make it unattractive for small repositories. These costs should however be weighed against the benefits of providing a more responsive and sophisticated enquiry system, able, for example, to deliver just the portion of a large document which is required. Less obviously, it should be noted that the effect of this architecture is to reduce the HTML client to the status of a second class citizen, unable to access directly the full complexity of the document repository.

Figure 2

Here the responsibility for generating SGML-aware enquiries and for handling the SGML result stream is devolved to the client. In practice, most currently existing implementations of this strategy focus only on the second part of the problem. SGML objects are regarded as special kinds of document, with an appropriate MIME-type which the browser must be configured to handle appropriately. Any SGML-aware browser or application could be used to do this, but at present the world's current favourite appears to be Softquad's Panorama, no doubt largely because a free version of this browser is available.

The success of this strategy depends crucially on the extent to which the helper application is tightly coupled with the user agent which has launched it. If, for example, the SGML document contains links to other documents, it is rather important that the browser can hand these back to the user agent for processing in a seamless and efficient manner. Equally, when the user agent invokes the SGML browser it is rather important that it provides all the environmental resources needed for it to do its job, for example a DTD and a stylesheet, or their equivalents.

SGML-aware Java and ActiveX add-ons for existing user agents will no doubt come to market sooner rather later, with consequent improvements in performance and facilities. (Some have remarked that "SGML gives Java something to do"). It remains to be seen however how effectively they can address the first half of the problem touched on above: the need to address SGML objects or document fragments of which only the server has knowledge. This is particularly true of metadata, and essential for true distributed document processing.

One does not need to think very long or hard about the functionality needed before it becomes apparent that the combination of user agent plus SGML-aware browser is beginning to look very much like a new form of user agent entirely: a generic SGML client. I will conclude therefore by discussing some aspects of the case for expecting full generic SGML support from Web clients.

Why not do the job properly?

If SGML-to-HTML servers are complex, expensive, and CPU-intensive database applications which only large corporations can afford, and hybrid clients like Panorama provide only half of the functionality needed, at twice the cost, we clearly need to see a new breed of software before we can deliver on some of the promises of the world of structured documents. However sophisticated our servers, existing user agents will remain unable to take full advantage of the potential richness of the SGML documents already existing in the world, still less those which are being created, so long as they persist in regarding anything beyond HTML as outside their preserve.

What is required for future Web user agents to be able to receive and process any SGML object in the way that they are currently able only to handle HTML? There are two halves to the answer. Firstly, servers providing SGML objects need to deliver along with them some kind of wrapper indicating the document's structural description (DTD) and stylesheet information defining its rendition. Secondly, clients must be able to unpack this package of information correctly, delegating the actual processing of the document to appropriate subcomponents responsible for parsing, constructing, and rendering it. This approach necessitates the creation of a number of specifications, key elements of which are listed below:

specifications for the packaging of SGML objects and fragments: this work is currently being undertaken by a technical committee of the SGML Open consortium of vendors;
specifications for the transmission of SGML entities: the SGML Open Catalog mechanism goes a long way to meeting this need, though its interoperability with MIME-based mechanisms remains unclear;
specifications for document structuring and rendering
specifications for document link semantics based on HyTime
a simplified version of SGML

Work on specifying all of these components, in the context of the Web, is already well advanced, as a result of substantial discussion and serious work within the appropriate expert communities, most notably within working groups of the W3C consortium. For an overview of the current situation, see http://www.w3.org/pub/WWW/MarkUp/SGML. I will conclude with a few remarks on two of these only: those concerned with document rendering, and the need for a simplified SGML.

At present, each SGML browser has its own proprietary SGML stylesheet language. No content provider could reasonably be expected to design and supply a different stylesheet for every possible target. Some kind of generic stylesheet mechanism is thus clearly essential. At present two candidates for this mechanism present themselves: the Cascading Style Sheet (CSS) mechanism, and the ISO standard Document Style Semantics and Specification Language (DSSSL). The advantage of the latter is not simply that it has emerged from the standards community after nearly a decade of very hard work, nor that real implementations of it are now freely available. It is simply that cascading styles don't have enough power for the job.

As currently defined, the Cascading Style Sheets lack the concept of a parse tree essential to correct processing of an SGML document. Consequently:

you cannot take an element (a chapter title perhaps) from one part of the tree for re-use in another (say, a page header);
you cannot treat all sibling elements (say all but the first paragraph in a division) in a particular way;
you cannot treat elements differently dependent on their context (for example headings of a figure as opposed to headings of a chapter).

Because there are no programming language features, a CSS style sheet lacks decision structures, modularization, variables, arithmetic calculation. As a way of improving the way that HTML texts are rendered on screen (provided that they are in Western alphabets), it is adequate, but as a generic solution to the problem of rendering SGML documents, it lacks a lot.

The key advantage of DSSSL lies in its modular design. It integrates three key components:

a language for querying SGML documents
a language for specifying transformations from one SGML document into another
a language for associating formatting characteristics with an SGML document

These components interact as shown in figure 3 below:

Figure 3

A full description of DSSSL is beyond the scope of this paper: a good description of the DSSSL-Online subset (from which the above figure is taken), and a number of other tutorials are freely available from James Clark's DSSSL pages (at http://www.jclark.com/dsssl) and elsewhere. Its key features for the present argument are as follows:

it incorporates document transformation as a distinct exercise from document rendering;
the rendering component retains access to all parts of the SGML input;
free software tools implementing key parts of the specification are already available.

Consequently, anything which can be expressed in the SGML definitions underlying a document repository can be used in the creation of the view of it which a particular client sees. A user agent with a suitable DSSSL specification can handle whatever SGML structures are obtained from a true SGML server, reordering, selecting, combining, and rendering SGML elements according to a formally complete specification.

The big question in all this remains: if SGML is so great, why has it not taken over the world already? Amongst (varyingly sensible) answers to this which I won't pursue further are the argument that it has, at least as far as serious document management is concerned; the argument that taking over the world is not the object of the exercise since SGML vendors and advocates are culpably uninterested in developing software for the common man or woman; and the argument that there is an inherent contradiction between the goals of SGML and the goals of the politico-industrial-military complex which currently runs the data processing industry. However, the question requires an answer, and perhaps the development of XML will provide it.

XML (eXtensible Markup Language) is a new activity of the W3C SGML work group, which is due to see the light of day at the end of 1996, with a targeted implementation date of March 1997. Its goal is to define a leaner, simpler, subset of the SGML metalanguage, better suited to use on the Internet, able to support a wide variety of applications, and with a concise formal design. A set of design principles (available at http://www.textuality.com/sgml-erb/dd-1996-0001.html) spells out what is meant by "leaner and simpler", but is a little less clear on what is meant by a "subset of SGML". Over the last three or four months, a select group of about fifty SGML experts have been debating, with all the vigour and obsessive attention to detail so characteristic of the breed, exactly which parts of the SGML elephant should be cast to the wolves following the sledge on its way towards the promised land of XML implementability.

Amongst topics which have been discussed I list only a few to give some flavour of the radical nature of what is being proposed:

disallow variant concrete syntaxes
rationalization of the rules about where whitespace and record boundaries are significant
abolition of most optional SGML features
abolition of most minimization conventions
abolition of the need for a DTD for all kinds of processing
mandatory support for wide character sets such as Unicode

A smaller subset of the group has also been voting on about a hundred specific aspects of ISO 8879 which need to be dropped, revised, or retained, to support these objectives. This electronic electoral process (carried out over the Web, needless to say) was completed at the start of October, and the XML editors will presumably be spending the next few months either reconciling their own decisions with the views expressed, or coming up with some pretty convincing ideas as to why they have not followed them. Publication of a complete XML specification early in the new year will, it is hoped, remove the last obstacle to the emergence of a new breed of truly SGML-aware user agents on the Web, able to take full advantage of the true potential of the information revolution that began ten years ago with the publication of ISO 8879.