HTML to the Max A Manifesto for Adding SGML Intelligence to the World-Wide Web

C. M. Sperberg-McQueen Robert F. Goldstein 15 Sep 1994

Abstract

HTML demonstrates that SGML markup is useful for networked information. How can it be made even more useful? One way is to extend the tag set from HTML to HTML2, etc. We argue here for a more radical approach: full SGML awareness in WWW. We believe the difficulties are small, the cost affordable, and the advantages overwhelming.

SGML is a metalanguage for defining markup languages; HTML is just one instance of this infinite family. At present, documents in other SGML document types must be translated into HTML for display by a Mosaic client --- sometimes this imposes unacceptable information loss.

WWW browsers could handle other SGML document types without translation by launching a general-purpose SGML browser to view them, as they now launch graphics viewers; a better solution overall would be to build SGML display into the WWW browsers themselves. Either way, display of an SGML document would be controlled by a style sheet using a small number of display primitives ('bold', 'line break', etc.) to specify the rendition of each element type. For 'well-known' document type definitions (DTDs) like HTML, style sheets could be distributed with the browser, or built in. For other DTDs, the browser would fetch a style sheet from the server. Using style sheets, browser software can also make it easy to customize document display.

DTDs and style sheets can be designed to accommodate extensions, ensuring that authors can make small extensions to the tag set with no change whatsoever in the target browsers and virtually no performance penalty.

A Simple Proposal

Note: this is an opinionated paper, not so much because we think the issues are all black and white as because (a) the ideas we are pushing will be clearer to most readers if we exaggerate them slightly, (b) we don't have enough space to expound all the nuances, and (c) black and white contrasts are more fun to talk and hear about.

Let us start with a simple proposal. The current generation of Web client software, when it receives an HTML document, does something like this:

Scan the HTML document for tags.
Read the tag element name (generic identifier) which indicates what kind of tag it is (a p tag, a ul tag, etc.).
Decide how to process the tag: cause a line break, change the font, etc.
Process the tag and look for the next one.

In the current generation of software, the information used in deciding how to process the tag is hard-coded into the Web client: at code-writing time, the programmer decides how to format each tag, based on the descriptions of typical renderings given in the HTML specification. (Of course this isn't quite true, since default fonts can often be changed at run time or even interactively. But the list of tags and their semantics is fixed by the definition of HTML. Also, decisions about line breaks, justification, etc., are not left to the user.)

Our simple proposal is that all of this continue much as it does now, but that the rules for processing each tag (and by implication, the complete list of tags that can be processed) be loaded dynamically at the time the document is fetched from the server. Web browsers should implement a table of processing options, which they can load from disk (or remote server) and use when processing an HTML (or HTML++++) document.

The two key points are:

Each document would have an attached list of allowed tags, and a table describing how to process the tags. (We'll discuss performance and implementation issues later.)
Code to process an arbitrary set of tags is scarcely more complicated than what browsers do now.

Perhaps the most important advantage of this approach is that each author can define his or her own tag set, with its own attached "meaning", without caring (much) about which browser will be used to view the document. Note that "meaning" can have two meanings -- (1) how the tagged object is displayed, and (2) what the relationship of the tagged object is to other tagged objects and to the human. We need not wait for an endless succession of HTML extensions to wend their way through a committee. In our view, the power of semantics belongs at the authors' fingertips, not the programmers'.

All sorts of information can be encoded in a rich tag set, but no one can master a superset-of-everything explicitly. In our proposal, a musician can annotate music, a mathematician can mark an equation, a programmer can comment on code, a statistician can select a column of data, and a database query can have as many different "submit" buttons as one wants.

Currently in HTML, however, there are at most two solutions for these problems: pre or img src= , both of which preserve a simple display but destroy the vital information necessary for a wide variety of sophisticated displays or other post-processing. If, however, a browser can preserve information from arbitrary tags, then all sorts of post-processing is possible, not excluding the mundane display-on-a-terminal.

The musician might view the document, and then import a few bars into a MIDI program. The mathematician might view an equation, then import it into Mathematica and solve it. The statistician might import a few rows from a table into SAS and compute a standard deviation. And so on. SGML doesn't make it happen, but it does make it possible.

Some other obvious advantages:

The tables serve as simple style sheets; We can maintain more than one of them, so that the user can customize the display in different ways for different kinds of documents.
The same document could be displayed in different ways for different audiences. Many manuals, for example, have paragraphs marked "Do not read this your first time through."

The upside of our proposal is that authors will have complete control over their tag sets, but the downside is that this control must be expressed in terms of a new style-sheet language. We believe this is a substantial gain, because a sensible language that can define tags is much richer than any pre-defined tag set. But a widespread implementation must come about either by widespread agreement, or by one person doing a great job and giving away a gazillion copies of a new (or improved) browser.

How is This Related to SGML at All?

SGML is a metalanguage for defining markup languages; HTML is one example of an SGML-defined language. Our proposal above does not require browsers to parse and handle and validate arbitrary SGML in the usual way that real SGML editors do. Indeed it does not even force a newly-defined tag set to be SGML compatible. But restricting Web documents to bona fide SGML documents is the only sensible way to go, for two reasons.

First, sufficiency. SGML is rich enough to provide good solutions to virtually all of the network's markup requirements for many years to come. SGML provides a public, non-proprietary method for interchange of data of all kinds. It is particularly suited for capturing the structured nature of text, and it coexists well with graphics (and other special encoded files) in any format. Roughly speaking, SGML is naturally suited for defining a record type in an arbitrary object-oriented database. If sending tree-structured objects (such as HTML+++ documents) is useful to the Web, SGML is the tool of choice.

Second, necessity. Although browsers need not validate SGML documents, they (or various external viewers) may find it useful to do so. Error detection and recovery is just one use of validation. Without the assurance of a valid SGML document, it would be fairly difficult for a browser to post-process a document --- for example to export a complicated mathematical equation to a clipboard for import into Mathematica. Although a style sheet suffices for many applications, it cannot replace a proper SGML document type definition for other uses. Allowing non-SGML documents opens a large can of worms for viewers to be built in the next few years, but doesn't seem to have any real advantages.

Using External SGML Browsers

There are two obvious approaches to providing better support for SGML on the Web. The first is to treat it like any specialized data format, and to launch specialized browsers to display data in that form. This approach is described in this section. The other approach, integrating SGML awareness, beyond HTML awareness, into Web browsers, is described in the next section.

Using existing software, it is easy to support SGML as a specialized data type. For example, we have implemented a demonstration of SGML on the Web, using the commercial SGML editor Author/Editor, by SoftQuad, Inc., by taking the following two steps:

tell the server to identify the documents appropriately: this involves adding lines of the form "AddType text/x-sgml-tei tei" to the src.conf file in the /etc/httpd directory
tell the client to accept text/x-sgml-tei, etc., and to launch a dedicated SGML browser for them, by adding lines of the form "text/x-sgml-tei; ae -I %s" to the user's .mailcap file

Using this approach, we can exploit SGML for a number of uses to which HTML is not now suited:

Mathematical equations in the text can be displayed properly without use of graphics or PRE elements; they can also be exported to a file, in Maple format, imported into Maple, and solved or plotted. In the interests of full disclosure, we should point out that the DTD used for this demonstration is a mockup, not a full DTD for math.
Tables can be edited more conveniently, using SGML-based table editing facilities, and displayed with normal formatting, including dynamic resizing, rather than as a PRE element. Like equations, tables can be exported in some standard format, and re-imported into other application programs, such as spreadsheets or database management systems.
We could allow computer center staff members to edit the SGML documents which we maintain to document the software available on our central systems; these documents use a specialized DTD designed for the application, and while it is possible to write an editor for them using an elaborate series of Mosaic forms and the Common Gateway Interface, writing style sheets for a general SGML editor is much less work.

Unfortunately, processing SGML with an external browser does have some limitations and drawbacks. Most important, we cannot, with current software and protocols, use an external SGML browser to browse hyperlinked documents: or rather, we could, but the browser has no way of notifying the client that the user has clicked on a link end, so there is no way to traverse the hyperlinks, which tends to defeat the purpose of network-based hypertext. Of course, this limitation applies with equal force to all data formats handled with external browsers. It might be removed by defining callback functions or some other method of communication between the WWW client and the external browser, as suggested recently by Bruce R. Schatz and Joseph B. Hardin.

If we with to distribute SGML documents over the Web without restricting the set of SGML tags which can be used, however, it is not enough simply for the Web server to label the data with its proper type, and for the client to launch an SGML browser to display it. SGML documents are, formally speaking, instances of general document types, which in turn are formally defined by SGML document type definitions (DTDs). When public DTDs are used, the DTD may not need to be transmitted, because the recipient (the client) may already have a copy. The DTD may not be a public one, however: it may be an ad hoc DTD designed for a particular purpose, like the DTD for software documentation we mentioned earlier. And even if the DTD is a public one, the client may not have a copy handy. So it is essential that the client be able to find a copy of the DTD used by an SGML document.

We propose that any HTTP server which distributes an SGML document should be responsible for providing the DTD for that document, on demand, and that the HTTP header for the document itself provide a universal resource identifier for the DTD. If we read the HTTP specification right, the WWW-Link field should be used for this purpose:

WWW-Link:  href='ftp://ftp-tei.uic.edu/pub/tei/dtd/tei2.dtd';
           rel='DTD'

In practice it is equally essential that the SGML browser know or be told how to handle the document in question. In SGML systems, the desired handling or processing of a given document type is kept rigorously separate from the specification of its legal form; the document is tightly bound to its formal specification, but only loosely bound to any particular method of processing it. Different processing specifications, or style sheets, can thus easily be introduced, and the same input document can be processed in multiple ways. SGML thus enables texts to have the same kind of controlled redundancy and processing independence which databases have secured for other types of data.

If a Web client is going to be able to handle any arbitrary SGML DTD, then it must always be able to find a description of what kinds of 'handling' will be required --- at the simplest level, the external SGML browser needs to know how to display the document on screen. We propose that once again, the HTTP server be responsible for supplying a style sheet on demand, for any document it has provided. Style sheet specifications for public DTDs (such as HTML, the Text Encoding Initiative encoding scheme, or ISO 12083) will presumably become widespread common property; users could easily customize them locally, and servers might well provide access to more than one style for a given DTD.

We need to choose a standard language in which these style sheets are to be written, however. Most existing SGML editors and browsers do have style sheet mechanisms; unfortunately, they are currently product-specific, not standardized. It would be insane, however, to expect authors or publishers to formulate multiple style sheets for their documents, one in each proprietary style-sheet language. If the World Wide Web is to take serious advantage of SGML, this means that a common style-sheet language for browsing and forms, at least, must be agreed on.

As with DTDs, we propose that the WWW-Link field of the HTPP header be used to identify one or more style sheets suitable for a given document. For example:

WWW-Link:  href='ftp://gluon.cc.uic.edu//pub/tei/styles/tei2.style';
           rel='style-sheet'; title='Basic TEI Style Sheet'
WWW-Link:  href='ftp://gluon.cc.uic.edu//pub/tei/styles/tei2beta.style';
           rel='style-sheet'; title='TEI Style Sheet, alternate form'

Integrating SGML Support in the Web Browser without Losing Your Mind

Some users, and some implementors, may prefer another approach to SGML support: namely, integrating SGML knowledge and support in the Web client, rather than externalizing it into an external browser. Such integration offers the same advantages as the integration of ftp and gopher knowledge, or support for common graphics formats, into the Web client: it allows the user access to more function through a single unified user interface. Equally, it has the same drawbacks: a Web browser is unlikely to provide SGML support on a par with a stand-alone SGML system, just as it is unlikely to provide graphics facilities which equal those of specialized graphics programs. We are not going to try to argue the case pro and con here; we think it does make sense to integrate SGML intelligence directly into Web clients, and we propose in this section to outline rather briefly what would be involved, and how to keep the task manageable. (N.B. some SGML technical terms are used without warning or definition in the discussion which follows; implementors will need to learn those terms if they want to write conforming software, but other readers can just skip over the technical bits.)

In a Web browser with full support for SGML, everything would work pretty much as described in the previous section, except that upon seeing data labeled, for example, "text/x-sgml-tei", the browser would not launch an external viewer. Instead, it would locate a style sheet for the DTD in question (first looking locally, and then requesting the style sheet from the server if need be), and then display the document for the user, as specified by the style sheet.

In order to make it easier for Web clients to support SGML, we propose the following Web-wide conventions:

The client should not need to validate the document against the DTD: the server should guarantee that what it sends is a valid SGML document.
The client should not need to support SGML's sometimes baroque rules for the omission and abbreviation of tags: all tags should be physically present in their full form in the text as transmitted from the server. That is: the server should guarantee that what it sends is a minimal SGML document.

If these conventions are followed, the client need not implement a full SGML parser, merely a non-validating minimal SGML parser, which is somewhat simpler. The implementation becomes simpler still, however, if we adopt a couple of strict application conventions, which make the parser's obligations even simpler to fulfill. Even minimal parsers are required to know that empty elements have no end-tags; even non-validating parsers are required to treat newline sequences in different ways, depending on how various elements were declared in the DTD. We can eliminate these requirements by adopting these rules:

WWW SGML applications are allowed to distinguish newlines from other white space only in contexts where SGML tags are not allowed (e.g. in pre-formatted elements such as PRE, or in CDATA marked sections).
The style sheet specification for a given document type must indicate explicitly and accurately which element types are declared as EMPTY elements in the definition of that document type.

Together with the commitment to server-side validation, these two conventions allow the client's SGML parser to ignore the document type declaration (DTD) for a document almost entirely. The start and end of every non-empty element in the document is explicitly marked, and the empty elements are all identified in the style sheet. Because newlines are allowed to be significant to the application only in well defined restricted areas, the parser need not attempt to implement the newline rules of the SGML standard. The parser must scan the DTD only for SGML entity declarations, since it must be able to expand entity references in the document instance.

We could eliminate the need to scan the DTD even for entity references, if we adopt the rule that the server will expand all such references, except for those needed to provide access to Latin characters with diacritics, Greek characters, and the like, which are generally included in standard entity sets issued by ISO. It is probably better, however, to allow for at least the possibility of client-side expansion of entities. This will reduce bandwidth requirements in some cases (functioning like a client-side INCLUDE), but a more important reason is that entity declarations are crucial to SGML interfaces to data in other notations (such as graphics files). Where it is possible, of course, server-side expansion of entity references is desirable, since it completely eliminates the client's need to read the DTD. We propose, therefore, that the HTTP header specify (in ways to be determined, possibly by a content-encoding field) whether entity references will be pre-expanded by the server or not.

The HTTP header will indicate whether entity references in the SGML document are pre-expanded by the server, or whether they must be expanded by the client.
If the HTTP header indicates that all entity references are pre-expanded, the client should not need to expand any SGML entities except those defined in standard public entity sets (such as the ISO entity sets Latin 1, Latin 2, Greek 1 and 2, etc.); the server will expand all entity references before transmitting the document.
If the HTTP header indicates that entity references are not pre-expanded, the client must be prepared to scan the DTD for entity declarations and expand references to entities declared in this way, as well as being prepared to expand references to standard entities such as those in ISO Latin 1, etc.

We are placing as much as possible of the responsibility for validation and processing on the server, rather than on the client. Documents are (on average) read by clients more often than they are updated on the server, so server-side validation requires less work overall. Servers typically run on more powerful machines, so they have more capacity to do the work; keeping the client's responsibilities simple is critical if we want browsers to be usable even on low-end machines. (And remember, no matter what population you are looking at, half the machines in question will be below the median in speed and power.)

Of course, server-side validation will also complicate the process of publishing documents on the Web: they will have to be validated before becoming available. This will require that publishers be given software which can validate and normalize SGML documents; fortunately, the public-domain parser SGMLS can readily be used for validation, and with a few tweaks it can also be used as the basis for an SGML normalizer. Of course, most people who will be interested in providing access to SGML documents over the Web already have SGML software, and validation and normalization of documents may already be part of their normal routine. (Of course, server-side normalization is necessary only in order to make it easier to include SGML intelligence in the Web client itself; existing stand-alone SGML browsers are normally able to handle markup minimization, i.e. to perform their own normalization.)

If the conventions we propose are adopted, the task of the SGML parser in the WWW client is reduced to recognizing and processing the following types of SGML markup:

start-tags, which indicate either the location of an empty element, or the beginning of an element with content; start-tags may have attributes, which must be parsed correctly and which may affect display processing
end-tags, which indicate the end of elements
entity references, which may be used to include external data, for boiler plate, or for non-transmittable characters
marked sections, which may indicate that a section of the document should be ignored, or included, or that markup of various kinds should not be recognized within the section; marked sections are commonly used for conditional version-specific text, examples of SGML tagging, etc.
comments (or, as the SGML standard calls them, comment declarations)
processing instructions, which are an SGML construct which allows processing-specific information to be embedded in documents, contrary to the normal practice of SGML systems, as long as it is disinfected by being explicitly delimited and locatable.

There is no need for detailed discussion of these kinds of markup. Existing HTML software typically handles start-tags, end-tags, entity references, and comments; the HTML specification documents but deprecates both processing instructions and marked sections. Conforming parsers must, however, properly recognize and process them. Fortunately, their syntax is relatively straightforward and adds very little to the complexity of the parser.

If the principles of minimal SGML, server-side validation, and the application conventions regarding newlines and empty elements are accepted, then it will be relatively simple to integrate SGML conforming parsers into existing or new WWW client software. Like SGML support using external browsers, of course, this method also requires use of the HTTP header to indicate the location of the DTD and style sheet, and the adoption by the Web community of a common style-sheet language for use in WWW display of documents. We turn now to the description of that style-sheet language.

Style Sheet Languages

Style sheets, like DTDs, are auxiliary documents, meta-documents, which describe the structure and processing of some set of base documents. DTDs have a standard notation prescribed by ISO standard; style sheets currently have no standard notation, though ISO is currently balloting an international standard called the Document Style Semantics and Specification Language (DSSSL), which will provide a standard syntax for style sheets and other specifications for SGML processing.

Semantics

Semantically, the style sheet language to be supported must be kept simple, at least at its bottom level. To be useful with existing software, the style sheet language has to allow the specification of fonts, typesizes, colors, and the like, but clients must not be required to support every high-end feature mentioned in the style-sheet language. There have to be specific and well understood methods of specifying the fallback processing to be performed if certain style primitives are not available.

Ideally, the style sheet language should be declarative, not procedural, and should allow style sheets to exploit the structure of SGML documents to the fullest. Styles must be able to vary with the structural location of the element: paragraphs within notes may be formatted differently from paragraphs in the main text. Styles must be able to vary with the attribute values of the element in question: a quotation of type "display" may need to be formatted differently from a quotation of type "inline". They may even need to vary with the attribute values of other elements: items in numbered lists will look different from items in bulleted lists.

At the same time, the language has to be reasonably easy to interpret in a procedural way: implementing the style sheet language should not become the major challenge in implementing a Web client.

The semantics should be additive: It should be possible for users to create new style sheets by adding new specifications to some existing (possibly standard) style sheet. This should not require copying the entire base style sheet; instead, the user should be able to store locally just the user's own changes to the standard style sheet, and they should be added in at browse time. This is particularly important to support local modifications of standard DTDs.

The style-sheet language semantics must be full enough to allow the description of current HTML browsers, restricted enough to make implementation feasible, and constructed with a view to later extension and expansion. As a first step toward understanding what would be required, we have extracted a set of semantic primitives from the descriptions of 'typical rendering' for each element in the specification of HTML. We discovered that a set of fourteen or so primitives suffice to express all the processing described there; these primitives include indications of whether the element is laid out inline or processed as a block; whether internal newlines are respected; whether vertical space is generated before or after it, the margins within which the contents are to be set, etc. We won't go into further detail, since the point of this paper is to talk about incorporating SGML intelligence into the Web, not to propose a specific style-sheet language for adoption by the Web community. A quick description of the style primitives, with examples of style-sheet specifications for some HTML elements, has been posted on the UIC Web server at http://tigger.cc.uic.edu/~cmsmcq/style-primitives.doc; a revised version, in HTML, will be posted later.

Syntax

Syntactically, the style sheet language must be very simple, preferably trivial to parse. One obvious possibility: formulate the style sheet language as an SGML DTD, so that each style sheet will be an SGML document. Since the browser already knows how to parse SGML, no extra effort will be needed.

Another approach would involve adopting the syntax of some existing language with a very simple syntax: TCL and Lisp come to mind, but Scheme is probably the most plausible candidate in this line of though, since DSSSL incorporates Scheme for some purposes (e.g. for variables and to express conditionality).

We recommend strongly that a subset of DSSSL be used to formulate style sheets for use on the World Wide Web; with the completion of the standards work on DSSSL, there is no reason for any community to invent their own style-sheet language from scratch. The full DSSSL standard may well be too demanding to implement in its entirety, but even if that proves true, it provides only an argument for defining a subset of DSSSL that must be supported, not an argument for rolling our own. Unlike home-brew specifications, a subset of a standard comes with an automatically predefined growth path. We expect to work on the formulation of a usable, implementable subset of DSSSL for use in WWW style sheets, and invite all interested parties to join in the effort.

How Do We Get There from Here?

We envisage the Web moving towards full SGML support in stages, each stage providing a bit of added function. The first stage will see three very loosely coupled developments:

The makers of current Web browsers that allow external viewers will modify their clients to allow the external viewers to pass back information, such as URLs, to the main program. Although there are good reasons to do this, independently of SGML, this will make it much easier for those programmers willing to provide SGML-aware external viewers.
Some administrators of HTTP servers will start to use SGML internally, even if the end product is HTML. We are currently doing this -- as mentioned above, we maintain a set of SGML files listing many attributes of software loaded on our machine. A single SGML file might be preprocessed two or more different ways into different HTML (or flat ASCII) files, depending on what particular information we want to make available. But changes or additions to the information are only made to the original SGML files.
A committee, or maybe a couple brave souls, will develop or adapt a style sheet language. They will then build dynamic style-sheet capability into a browser, not necessarily because they are desperate for SGML, but because it is a clean, versatile way to program. And because they want to easily customize the appearance of HTML documents on different physical displays.

The second stage will naturally grow upon the first:

The addition of two-way communication between WWW clients and external viewers will allow commercial SGML editors to be used as spawnable external viewers, although the style-sheet language they use may not be the final one adopted by the Web community. We will almost certainly do this to acquire new functions in forms processing, although we expect to pre-cache the style sheets for a small number of document types, rather than download the style sheets from the server every time.
Those sites that already use SGML internally will be tempted to transmit SGML documents over the Web to the external viewers, but to only use this facility within the organization that controls both client and server.
Some programmers will experiment with writing their own SGML-aware Web clients or stand-alone external viewers (public domain SGML parsers already exist), and various ideas on style sheet languages will develop further. Someone will figure out exactly how to make DSSSL apply.

The third phase will be the explosive one.

A popular style sheet language will emerge, probably by dint of a widely available external viewer. Those information providers with SGML experience will then make the original SGML documents available, probably in parallel with HTML-impoverished versions.
A small number of "standard" SGML Document Type Definitions will be published: HTML naturally, the TEI document type definition, some math extensions, some table extensions, some graphics extensions, and some forms extensions. Most immediate applications will be satisfied by one of these DTDs. This will make it easy for an author to pick the closest useful DTD, make a few additions, and publish a new document with the relevant information properly marked --- without having to first become an SGML guru.
As more information appears in SGML form, more people will obtain appropriate viewers. Whatever style sheet language happens to be most popular at that time will become the de facto standard.

Ultimately, we will emerge into a Web with a small number of standard DTDs, each with many individual variations. Awareness of SGML, at the very least use of dynamic style sheets, will be commonly built into the basic browser, although commercial spawnable viewers will provide specialized function. The style sheets for each major DTD will be available on many different servers, and will probably be cached on most clients; only the individual deltas will be transmitted with each document. And the fact that more information is preserved in each document download will mean increased use of specialized secondary "interactive viewers" like SAS, Mathematica or Maple or gnuplot, and so forth.

References

ACH/ACL/ALLC (Association for Computers and the Humanities, Association for Computational Linguistics, and Association for Literary and Linguistic Computing). Guidelines for Electronic Text Encoding and Interchange, ed. C. M. Sperberg-McQueen and Lou Burnard. Chicago, Oxford: Text Encoding Initiative, 1994.
Berners-Lee, Tim, and Daniel Connolly. Hypertext Markup Language: A Representation of Textual Information and Metainformation for Retrieval and Interchange. (Draft, expired 14 January 1994.)
ISO (International Organization for Standardization). ISO 8879-1986 (E). Information processing --- Text and Office Systems --- Standard Generalized Markup Language (SGML). First edition --- 1986-10-15. [Geneva]: ISO, 1986.
Schatz, Bruce R., and Joseph B. Hardin. NCSA Mosaic and the World Wide Web: Global Hypermedia Protocols for the Internet. Science 265 (12 August 1994): 895-901.