HTML to the Max A Manifesto for Adding SGML Intelligence to the World-Wide Web

C. M. Sperberg-McQueen Robert F. Goldstein 15 Sep 1994

Abstract

HTML demonstrates that SGML markup is useful for networked information. How can it be made even more useful? One way is to extend the tag set from HTML to HTML2, etc. We argue here for a more radical approach: full SGML awareness in WWW. We believe the difficulties are small, the cost affordable, and the advantages overwhelming.

SGML is a metalanguage for defining markup languages; HTML is just one instance of this infinite family. At present, documents in other SGML document types must be translated into HTML for display by a Mosaic client --- sometimes this imposes unacceptable information loss.

WWW browsers could handle other SGML document types without translation by launching a general-purpose SGML browser to view them, as they now launch graphics viewers; a better solution overall would be to build SGML display into the WWW browsers themselves. Either way, display of an SGML document would be controlled by a style sheet using a small number of display primitives ('bold', 'line break', etc.) to specify the rendition of each element type. For 'well-known' document type definitions (DTDs) like HTML, style sheets could be distributed with the browser, or built in. For other DTDs, the browser would fetch a style sheet from the server. Using style sheets, browser software can also make it easy to customize document display.

DTDs and style sheets can be designed to accommodate extensions, ensuring that authors can make small extensions to the tag set with no change whatsoever in the target browsers and virtually no performance penalty.

A Simple Proposal

Note: this is an opinionated paper, not so much because we think the issues are all black and white as because (a) the ideas we are pushing will be clearer to most readers if we exaggerate them slightly, (b) we don't have enough space to expound all the nuances, and (c) black and white contrasts are more fun to talk and hear about.

Let us start with a simple proposal. The current generation of Web client software, when it receives an HTML document, does something like this:

In the current generation of software, the information used in deciding how to process the tag is hard-coded into the Web client: at code-writing time, the programmer decides how to format each tag, based on the descriptions of typical renderings given in the HTML specification. (Of course this isn't quite true, since default fonts can often be changed at run time or even interactively. But the list of tags and their semantics is fixed by the definition of HTML. Also, decisions about line breaks, justification, etc., are not left to the user.)

Our simple proposal is that all of this continue much as it does now, but that the rules for processing each tag (and by implication, the complete list of tags that can be processed) be loaded dynamically at the time the document is fetched from the server. Web browsers should implement a table of processing options, which they can load from disk (or remote server) and use when processing an HTML (or HTML++++) document.

The two key points are:

Perhaps the most important advantage of this approach is that each author can define his or her own tag set, with its own attached "meaning", without caring (much) about which browser will be used to view the document. Note that "meaning" can have two meanings -- (1) how the tagged object is displayed, and (2) what the relationship of the tagged object is to other tagged objects and to the human. We need not wait for an endless succession of HTML extensions to wend their way through a committee. In our view, the power of semantics belongs at the authors' fingertips, not the programmers'.

All sorts of information can be encoded in a rich tag set, but no one can master a superset-of-everything explicitly. In our proposal, a musician can annotate music, a mathematician can mark an equation, a programmer can comment on code, a statistician can select a column of data, and a database query can have as many different "submit" buttons as one wants.

Currently in HTML, however, there are at most two solutions for these problems: pre or img src= , both of which preserve a simple display but destroy the vital information necessary for a wide variety of sophisticated displays or other post-processing. If, however, a browser can preserve information from arbitrary tags, then all sorts of post-processing is possible, not excluding the mundane display-on-a-terminal.

The musician might view the document, and then import a few bars into a MIDI program. The mathematician might view an equation, then import it into Mathematica and solve it. The statistician might import a few rows from a table into SAS and compute a standard deviation. And so on. SGML doesn't make it happen, but it does make it possible.

Some other obvious advantages:

The upside of our proposal is that authors will have complete control over their tag sets, but the downside is that this control must be expressed in terms of a new style-sheet language. We believe this is a substantial gain, because a sensible language that can define tags is much richer than any pre-defined tag set. But a widespread implementation must come about either by widespread agreement, or by one person doing a great job and giving away a gazillion copies of a new (or improved) browser.

How is This Related to SGML at All?

SGML is a metalanguage for defining markup languages; HTML is one example of an SGML-defined language. Our proposal above does not require browsers to parse and handle and validate arbitrary SGML in the usual way that real SGML editors do. Indeed it does not even force a newly-defined tag set to be SGML compatible. But restricting Web documents to bona fide SGML documents is the only sensible way to go, for two reasons.

First, sufficiency. SGML is rich enough to provide good solutions to virtually all of the network's markup requirements for many years to come. SGML provides a public, non-proprietary method for interchange of data of all kinds. It is particularly suited for capturing the structured nature of text, and it coexists well with graphics (and other special encoded files) in any format. Roughly speaking, SGML is naturally suited for defining a record type in an arbitrary object-oriented database. If sending tree-structured objects (such as HTML+++ documents) is useful to the Web, SGML is the tool of choice.

Second, necessity. Although browsers need not validate SGML documents, they (or various external viewers) may find it useful to do so. Error detection and recovery is just one use of validation. Without the assurance of a valid SGML document, it would be fairly difficult for a browser to post-process a document --- for example to export a complicated mathematical equation to a clipboard for import into Mathematica. Although a style sheet suffices for many applications, it cannot replace a proper SGML document type definition for other uses. Allowing non-SGML documents opens a large can of worms for viewers to be built in the next few years, but doesn't seem to have any real advantages.

Using External SGML Browsers

There are two obvious approaches to providing better support for SGML on the Web. The first is to treat it like any specialized data format, and to launch specialized browsers to display data in that form. This approach is described in this section. The other approach, integrating SGML awareness, beyond HTML awareness, into Web browsers, is described in the next section.

Using existing software, it is easy to support SGML as a specialized data type. For example, we have implemented a demonstration of SGML on the Web, using the commercial SGML editor Author/Editor, by SoftQuad, Inc., by taking the following two steps:

Using this approach, we can exploit SGML for a number of uses to which HTML is not now suited:

Unfortunately, processing SGML with an external browser does have some limitations and drawbacks. Most important, we cannot, with current software and protocols, use an external SGML browser to browse hyperlinked documents: or rather, we could, but the browser has no way of notifying the client that the user has clicked on a link end, so there is no way to traverse the hyperlinks, which tends to defeat the purpose of network-based hypertext. Of course, this limitation applies with equal force to all data formats handled with external browsers. It might be removed by defining callback functions or some other method of communication between the WWW client and the external browser, as suggested recently by Bruce R. Schatz and Joseph B. Hardin.

If we with to distribute SGML documents over the Web without restricting the set of SGML tags which can be used, however, it is not enough simply for the Web server to label the data with its proper type, and for the client to launch an SGML browser to display it. SGML documents are, formally speaking, instances of general document types, which in turn are formally defined by SGML document type definitions (DTDs). When public DTDs are used, the DTD may not need to be transmitted, because the recipient (the client) may already have a copy. The DTD may not be a public one, however: it may be an ad hoc DTD designed for a particular purpose, like the DTD for software documentation we mentioned earlier. And even if the DTD is a public one, the client may not have a copy handy. So it is essential that the client be able to find a copy of the DTD used by an SGML document.

We propose that any HTTP server which distributes an SGML document should be responsible for providing the DTD for that document, on demand, and that the HTTP header for the document itself provide a universal resource identifier for the DTD. If we read the HTTP specification right, the WWW-Link field should be used for this purpose:

WWW-Link:  href='ftp://ftp-tei.uic.edu/pub/tei/dtd/tei2.dtd';
           rel='DTD'

In practice it is equally essential that the SGML browser know or be told how to handle the document in question. In SGML systems, the desired handling or processing of a given document type is kept rigorously separate from the specification of its legal form; the document is tightly bound to its formal specification, but only loosely bound to any particular method of processing it. Different processing specifications, or style sheets, can thus easily be introduced, and the same input document can be processed in multiple ways. SGML thus enables texts to have the same kind of controlled redundancy and processing independence which databases have secured for other types of data.

If a Web client is going to be able to handle any arbitrary SGML DTD, then it must always be able to find a description of what kinds of 'handling' will be required --- at the simplest level, the external SGML browser needs to know how to display the document on screen. We propose that once again, the HTTP server be responsible for supplying a style sheet on demand, for any document it has provided. Style sheet specifications for public DTDs (such as HTML, the Text Encoding Initiative encoding scheme, or ISO 12083) will presumably become widespread common property; users could easily customize them locally, and servers might well provide access to more than one style for a given DTD.

We need to choose a standard language in which these style sheets are to be written, however. Most existing SGML editors and browsers do have style sheet mechanisms; unfortunately, they are currently product-specific, not standardized. It would be insane, however, to expect authors or publishers to formulate multiple style sheets for their documents, one in each proprietary style-sheet language. If the World Wide Web is to take serious advantage of SGML, this means that a common style-sheet language for browsing and forms, at least, must be agreed on.

As with DTDs, we propose that the WWW-Link field of the HTPP header be used to identify one or more style sheets suitable for a given document. For example:

WWW-Link:  href='ftp://gluon.cc.uic.edu//pub/tei/styles/tei2.style';
           rel='style-sheet'; title='Basic TEI Style Sheet'
WWW-Link:  href='ftp://gluon.cc.uic.edu//pub/tei/styles/tei2beta.style';
           rel='style-sheet'; title='TEI Style Sheet, alternate form'

More on style sheets below.

In summary, successful widespread use of external SGML browsers seems to require that a number of steps be taken.

Integrating SGML Support in the Web Browser without Losing Your Mind

Some users, and some implementors, may prefer another approach to SGML support: namely, integrating SGML knowledge and support in the Web client, rather than externalizing it into an external browser. Such integration offers the same advantages as the integration of ftp and gopher knowledge, or support for common graphics formats, into the Web client: it allows the user access to more function through a single unified user interface. Equally, it has the same drawbacks: a Web browser is unlikely to provide SGML support on a par with a stand-alone SGML system, just as it is unlikely to provide graphics facilities which equal those of specialized graphics programs. We are not going to try to argue the case pro and con here; we think it does make sense to integrate SGML intelligence directly into Web clients, and we propose in this section to outline rather briefly what would be involved, and how to keep the task manageable. (N.B. some SGML technical terms are used without warning or definition in the discussion which follows; implementors will need to learn those terms if they want to write conforming software, but other readers can just skip over the technical bits.)

In a Web browser with full support for SGML, everything would work pretty much as described in the previous section, except that upon seeing data labeled, for example, "text/x-sgml-tei", the browser would not launch an external viewer. Instead, it would locate a style sheet for the DTD in question (first looking locally, and then requesting the style sheet from the server if need be), and then display the document for the user, as specified by the style sheet.

In order to make it easier for Web clients to support SGML, we propose the following Web-wide conventions:

If these conventions are followed, the client need not implement a full SGML parser, merely a non-validating minimal SGML parser, which is somewhat simpler. The implementation becomes simpler still, however, if we adopt a couple of strict application conventions, which make the parser's obligations even simpler to fulfill. Even minimal parsers are required to know that empty elements have no end-tags; even non-validating parsers are required to treat newline sequences in different ways, depending on how various elements were declared in the DTD. We can eliminate these requirements by adopting these rules:

Together with the commitment to server-side validation, these two conventions allow the client's SGML parser to ignore the document type declaration (DTD) for a document almost entirely. The start and end of every non-empty element in the document is explicitly marked, and the empty elements are all identified in the style sheet. Because newlines are allowed to be significant to the application only in well defined restricted areas, the parser need not attempt to implement the newline rules of the SGML standard. The parser must scan the DTD only for SGML entity declarations, since it must be able to expand entity references in the document instance.

We could eliminate the need to scan the DTD even for entity references, if we adopt the rule that the server will expand all such references, except for those needed to provide access to Latin characters with diacritics, Greek characters, and the like, which are generally included in standard entity sets issued by ISO. It is probably better, however, to allow for at least the possibility of client-side expansion of entities. This will reduce bandwidth requirements in some cases (functioning like a client-side INCLUDE), but a more important reason is that entity declarations are crucial to SGML interfaces to data in other notations (such as graphics files). Where it is possible, of course, server-side expansion of entity references is desirable, since it completely eliminates the client's need to read the DTD. We propose, therefore, that the HTTP header specify (in ways to be determined, possibly by a content-encoding field) whether entity references will be pre-expanded by the server or not.

We are placing as much as possible of the responsibility for validation and processing on the server, rather than on the client. Documents are (on average) read by clients more often than they are updated on the server, so server-side validation requires less work overall. Servers typically run on more powerful machines, so they have more capacity to do the work; keeping the client's responsibilities simple is critical if we want browsers to be usable even on low-end machines. (And remember, no matter what population you are looking at, half the machines in question will be below the median in speed and power.)

Of course, server-side validation will also complicate the process of publishing documents on the Web: they will have to be validated before becoming available. This will require that publishers be given software which can validate and normalize SGML documents; fortunately, the public-domain parser SGMLS can readily be used for validation, and with a few tweaks it can also be used as the basis for an SGML normalizer. Of course, most people who will be interested in providing access to SGML documents over the Web already have SGML software, and validation and normalization of documents may already be part of their normal routine. (Of course, server-side normalization is necessary only in order to make it easier to include SGML intelligence in the Web client itself; existing stand-alone SGML browsers are normally able to handle markup minimization, i.e. to perform their own normalization.)

If the conventions we propose are adopted, the task of the SGML parser in the WWW client is reduced to recognizing and processing the following types of SGML markup:

There is no need for detailed discussion of these kinds of markup. Existing HTML software typically handles start-tags, end-tags, entity references, and comments; the HTML specification documents but deprecates both processing instructions and marked sections. Conforming parsers must, however, properly recognize and process them. Fortunately, their syntax is relatively straightforward and adds very little to the complexity of the parser.

If the principles of minimal SGML, server-side validation, and the application conventions regarding newlines and empty elements are accepted, then it will be relatively simple to integrate SGML conforming parsers into existing or new WWW client software. Like SGML support using external browsers, of course, this method also requires use of the HTTP header to indicate the location of the DTD and style sheet, and the adoption by the Web community of a common style-sheet language for use in WWW display of documents. We turn now to the description of that style-sheet language.

Style Sheet Languages

Style sheets, like DTDs, are auxiliary documents, meta-documents, which describe the structure and processing of some set of base documents. DTDs have a standard notation prescribed by ISO standard; style sheets currently have no standard notation, though ISO is currently balloting an international standard called the Document Style Semantics and Specification Language (DSSSL), which will provide a standard syntax for style sheets and other specifications for SGML processing.

Semantics

Semantically, the style sheet language to be supported must be kept simple, at least at its bottom level. To be useful with existing software, the style sheet language has to allow the specification of fonts, typesizes, colors, and the like, but clients must not be required to support every high-end feature mentioned in the style-sheet language. There have to be specific and well understood methods of specifying the fallback processing to be performed if certain style primitives are not available.

Ideally, the style sheet language should be declarative, not procedural, and should allow style sheets to exploit the structure of SGML documents to the fullest. Styles must be able to vary with the structural location of the element: paragraphs within notes may be formatted differently from paragraphs in the main text. Styles must be able to vary with the attribute values of the element in question: a quotation of type "display" may need to be formatted differently from a quotation of type "inline". They may even need to vary with the attribute values of other elements: items in numbered lists will look different from items in bulleted lists.

At the same time, the language has to be reasonably easy to interpret in a procedural way: implementing the style sheet language should not become the major challenge in implementing a Web client.

The semantics should be additive: It should be possible for users to create new style sheets by adding new specifications to some existing (possibly standard) style sheet. This should not require copying the entire base style sheet; instead, the user should be able to store locally just the user's own changes to the standard style sheet, and they should be added in at browse time. This is particularly important to support local modifications of standard DTDs.

The style-sheet language semantics must be full enough to allow the description of current HTML browsers, restricted enough to make implementation feasible, and constructed with a view to later extension and expansion. As a first step toward understanding what would be required, we have extracted a set of semantic primitives from the descriptions of 'typical rendering' for each element in the specification of HTML. We discovered that a set of fourteen or so primitives suffice to express all the processing described there; these primitives include indications of whether the element is laid out inline or processed as a block; whether internal newlines are respected; whether vertical space is generated before or after it, the margins within which the contents are to be set, etc. We won't go into further detail, since the point of this paper is to talk about incorporating SGML intelligence into the Web, not to propose a specific style-sheet language for adoption by the Web community. A quick description of the style primitives, with examples of style-sheet specifications for some HTML elements, has been posted on the UIC Web server at http://tigger.cc.uic.edu/~cmsmcq/style-primitives.doc; a revised version, in HTML, will be posted later.

Syntax

Syntactically, the style sheet language must be very simple, preferably trivial to parse. One obvious possibility: formulate the style sheet language as an SGML DTD, so that each style sheet will be an SGML document. Since the browser already knows how to parse SGML, no extra effort will be needed.

Another approach would involve adopting the syntax of some existing language with a very simple syntax: TCL and Lisp come to mind, but Scheme is probably the most plausible candidate in this line of though, since DSSSL incorporates Scheme for some purposes (e.g. for variables and to express conditionality).

We recommend strongly that a subset of DSSSL be used to formulate style sheets for use on the World Wide Web; with the completion of the standards work on DSSSL, there is no reason for any community to invent their own style-sheet language from scratch. The full DSSSL standard may well be too demanding to implement in its entirety, but even if that proves true, it provides only an argument for defining a subset of DSSSL that must be supported, not an argument for rolling our own. Unlike home-brew specifications, a subset of a standard comes with an automatically predefined growth path. We expect to work on the formulation of a usable, implementable subset of DSSSL for use in WWW style sheets, and invite all interested parties to join in the effort.

How Do We Get There from Here?

We envisage the Web moving towards full SGML support in stages, each stage providing a bit of added function. The first stage will see three very loosely coupled developments:

The second stage will naturally grow upon the first:

The third phase will be the explosive one.

Ultimately, we will emerge into a Web with a small number of standard DTDs, each with many individual variations. Awareness of SGML, at the very least use of dynamic style sheets, will be commonly built into the basic browser, although commercial spawnable viewers will provide specialized function. The style sheets for each major DTD will be available on many different servers, and will probably be cached on most clients; only the individual deltas will be transmitted with each document. And the fact that more information is preserved in each document download will mean increased use of specialized secondary "interactive viewers" like SAS, Mathematica or Maple or gnuplot, and so forth.

References

  1. ACH/ACL/ALLC (Association for Computers and the Humanities, Association for Computational Linguistics, and Association for Literary and Linguistic Computing). Guidelines for Electronic Text Encoding and Interchange, ed. C. M. Sperberg-McQueen and Lou Burnard. Chicago, Oxford: Text Encoding Initiative, 1994.
  2. Berners-Lee, Tim, and Daniel Connolly. Hypertext Markup Language: A Representation of Textual Information and Metainformation for Retrieval and Interchange. (Draft, expired 14 January 1994.)
  3. ISO (International Organization for Standardization). ISO 8879-1986 (E). Information processing --- Text and Office Systems --- Standard Generalized Markup Language (SGML). First edition --- 1986-10-15. [Geneva]: ISO, 1986.
  4. Schatz, Bruce R., and Joseph B. Hardin. NCSA Mosaic and the World Wide Web: Global Hypermedia Protocols for the Internet. Science 265 (12 August 1994): 895-901.