SGML and HTML: Eric Severson
****************** WELCOME TO SGML NEWSWIRE ******************
* To subscribe, send mail to email@example.com. *
* To receive a current table of contents and instructions *
* for ordering back issues, specify "send toc" *
* in the message body. *
* (Please pass along to interested colleagues) *
A PROPOSAL FOR THE SCALABILITY OF HTML
The following whitepaper discusses the fundamental differences
between HTML and SGML and seeks to influence the development of
the next version of HTML. The author welcomes all feedback --
his email address is included below.
HOW SGML AND HTML REALLY FIT TOGETHER:
THE CASE FOR A SCALABLE HTML
By Eric Severson, Avalanche/Interleaf
(c) Avalanche Development Company/Interleaf Inc.
Permission to redistribute this whitepaper is granted, provided
that no changes are made, and that this notice and the above
attributions are included in all copies.
As I look around the World Wide Web community I note three
1. Everyone is enthusiastic (absolutely wildly
enthusiastic?) about the Web, HTML, and all the
promise they offer.
2. Everyone wants the Web (and HTML) to be simple --
after all, that's a lot of the appeal.
3. When you pin them down, everyone actually has very
different applications in mind, covering a dauntingly
large range of requirements.
Though few realize it yet (they're perhaps still too enamored
with the "glitz" phase of all this), there is potentially a huge
trade-off to be dealt with in reconciling points (2) and (3)
above. In particular, corporate users (the focus of a lot of
the discussion) see the Web as a way to drastically change the
way by which their information is delivered and used. But as
someone who has dealt extensively with these corporate
information requirements, I am very aware that these needs --
and this information -- are often anything but simple.
To add to the confusion, people are continually asking about the
differences between HTML and SGML, and how they may or may not
fit together. It was striking to me that all four opening
keynotes at a recent publishing conference mentioned HTML and
the World Wide Web (we even had a demo!), and most of them also
mentioned SGML. Yet nobody seemed to know how to tie these two
things together in a coherent way. Q&A sessions were filled
with questions like: "What's the difference between SGML and
HTML?" and "If I have HTML, do I still need SGML?"
There is a strictly technical answer, of course, and we should
SGML is an international standard for encoding the logical
structure and content of documents that has been around since
1986. By focusing on the structure and intent of a document,
rather than proprietary formatting codes, it maximizes
accessibility and reusability of the underlying information.
SGML lets us find information more effectively because we can
use internal document structure to help guide our search. It
lets us reuse the information more effectively because SGML
documents are not tied to any particular format, application, or
To create an SGML application, you define a set of document
objects (e.g. "title," "chapter," "paragraph," "list") and their
relationships (e.g. "document title must always occur first,"
"chapters can contain a title followed by any number of
paragraphs and lists" etc.). These objects and relationships
are described in something called a Document Type Definition
(DTD), which forms the heart of the application. Objects in an
SGML document are then tagged according to the rules specified
in the DTD. Originally used primarily for complex technical
documentation (such as for the U.S. Department of Defense CALS
initiative), SGML has emerged as the format of choice for a
great variety of applications. It is especially powerful (and
popular) where information needs to be collaboratively authored,
managed as a large collection of documents, and delivered
electronically in multiple forms.
Thus it is not surprising that HTML, the format that drives
documents across the World Wide Web, has been defined using
SGML. Formally speaking, HTML is a specific SGML application,
with its own Document Type Definition and a fixed tag set.
Therefore, the strictly technical answer is this: it doesn't
make sense to choose SGML versus HTML -- you're already using
SGML if you're using HTML.
However, what's really behind the questions is something more
fundamental: should HTML be viewed as a general interchange
format, or is it just a lowest common denominator useful
primarily for casual applications? Could it be used as the
archival format, or should we always produce HTML from some
other source, such as an SGML-based information repository?
Some of these answers seem pretty clear. At this point in time,
HTML is clearly not rich enough to be a mainstream repository
format. Furthermore, because each individual application has a
variety of unique needs, I would recommend that HTML never be
looked at this way. SGML in general (i.e. your own DTD) is the
right way to build a corporate information repository; HTML is a
specific vehicle to put that information on the Web. Yet, if
HTML is not meant to be reasonably rich, then a significant
amount of its appeal may be lost. That is, rather than being a
"universal solvent" for information being passed across the Web,
it becomes just another intermediate format driving a particular
set of browsing software.
The developers of HTML 3.0 have set out to ensure that it does
not repeat the complexity of CALS and other classic "tech doc"
applications. Having been on the committee that architected the
CALS standard, I think I understand what they mean. Faced with
an extremely large and diverse community of potential users, and
constantly deluged with specific user requests for "just one
more" added feature, it was all too easy to let the standard get
out of control. While significant progress has been made in
modularizing what was once a terribly complicated monolith, new
users and new requirements still demand a cumbersome
registration process within a central CALS "library."
The World Wide Web, of course, has a potential set of users and
requirements that are orders of magnitude larger than even the
U.S. Department of Defense. Furthermore, the idea of attempting
to keep everyone's favorite set of objects and attributes in a
worldwide library is ludicrous. So there is no doubt that HTML
developers are exactly right: the potential to add unconstrained
complexity to HTML is staggering, and the historical evolution
of CALS cannot be the model we follow. However, it isn't so
clear to me that we must go to the other extreme, making HTML
The problem with simplicity is that you lose information. Not
the words of the document themselves (the content), not even the
links that actually make the "web" between documents, but all
the other information necessary to do the really interesting
things. This is what defines the difference between an
intriguing but clearly casual application, and one which
actually carries the freight for corporate and government
users. For example:
1. A simple, base-level HTML dooms every document to
be rendered in "lowest common denominator" format.
The rich formatting characteristics of more "serious"
documents is simply not possible, since the only
distinctions available are between a few basic headings
and paragraphs. I may have twenty types of "paragraphs"
in my document (abstract, biography, preface, note,
introduction, glossary term and definition, etc.), all of
which are formatted differently to enhance readability and
communication power. But on the Web, they will all look
the same: I am forced to call all of them "paragraphs,"
and, of course, all objects called "paragraph" will
receive the same format.
2. A simple, base-level HTML takes away most of our
ability to intelligently search Web documents based on
their inner structure.
It is increasingly being recognized that, in terms of the
Information Explosion, the Web is at once our best friend
and our worst nightmare. Everyone can witness the power
of following links from a document that's been retrieved,
but no one really has an answer for how to find that
document in the first place. I can search "What's New"
URLs, or use an (almost certainly quite limited) indexing
service, maybe even have a computer program automatically
look for specific words and phrases. Across a mushrooming
worldwide collection of information, however, this is
literally like looking for a needle in a haystack.
Wouldn't it help if we could limit such automated searches
only to document abstracts, or look for keywords only if
they occur in some other structural context?
3. A simple, base-level HTML makes it very hard to do
anything interesting with a Web document once it's been
As people actually try to use (and reuse) information
they've found on the Web, this problem is just beginning
to be understood. Once I've reduced a document to its
lowest common denominator, I can't just snap my fingers
and get it back. If I've collapsed twenty kinds of
paragraphs into one, there's no good way to recover the
original twenty kinds. I've likened this to condensing
Hamlet into a simple "I Can Read Too" book suitable for
first graders, then trying to use that version to
reconstitute the original text. Obviously, it won't
We are thus sitting on the horns of a dilemma. On the one hand,
focusing on accommodating the entire range of needs could lead
to a horribly complicated standard, pieced together like a
patchwork quilt to satisfy the whims of the loudest political
factions. But on the other hand, the danger of focusing on
simplicity is that people may not have the power to do the
things that originally attracted them to the Web. If that
happens, there may be a backlash -- a "big chill" as it were --
when this becomes obvious.
In the face of this, some have suggested doing away with HTML,
or using it only for very simple documents. In their view, Web
documents should be based on SGML in general -- that is, using
an unlimited variety of DTDs at each information provider's
discretion. The argument, which makes some sense, is that any
one DTD -- even HTML -- can't possibly cover everyone's
application requirements. Rather than try to standardize what
is inherently uncontrollable, let's give people the full power
of SGML and let them design their own uses.
In fact, I am a supporter of generalized SGML on the Web. It
certainly is the way to go when building information
repositories within organizations, and should be allowed as a
Web format as well. However, on the Web this creates another
problem which is actually a result of SGML's intrinsic
flexibility. Since there is no guarantee that individual DTDs
will bear any resemblance to each other, a generalized-SGML Web
has its own potential to create a "Tower of Babel." When
everyone freely designs their own SGML applications, even the
most basic notions about backbone document structures and links
may not be shared in common. What you get when you search,
format and download information may be completely
unpredictable. Presumably each document will give you its DTD,
but you will not have a good way to immediately relate that
document and DTD to other documents and your own applications.
So which way should we go, simple or powerful? My answer is to
strongly support the notion of HTML as the backbone for the Web,
to be used for the vast majority of Web applications. However,
I believe that HTML can and must be made SCALABLE. We must take
specific steps to ensure that HTML 3.0 is architected simply,
elegantly, not biased to anyone's particular application, but
carefully engineered for extensibility that can run the gamut of
what most people will want to do on the Web.
To clarify what I mean by "scalable", let me propose an acid
test for this property:
1. If I want to send a message consisting only of the
words "Hello World!", it takes no more than two to
three HTML tags.
2. If I want to publish my corporate annual report,
looking reasonably close to its paper version, and
enabling powerful information retrieval keyed to text,
financial tables, and graphics, it takes a lot more tags
but I can still do it without a problem.
3. If I want to send my corporate annual report, via the
same HTML file, to both a highly sophisticated
graphical browser / retrieval engine, and to a simple
"dumb terminal" viewer, I can do this knowing that
both can easily do something reasonable with the
information. That is, nothing about HTML would
make it hard to build a simplistic viewer that could
deal with any document, no matter how sophisticated
the markup. Furthermore, I do not need to know the
properties of each potential browser/viewer when I
encode my HTML; all HTML-compliant browsers/viewers
should be able to handle any HTML file in an effective
One might observe that I want to have my cake, eat it too, and
still get a second choice for dessert. This would seem like a
lot to ask for, but it's actually not as hard as it sounds.
I like to solve problems by analogy, and in this case I like the
analogy of a television signal. Like the Web, television allows
an extremely large and diverse audience to access information in
a previously unprecedented way. It also started out simply,
consisting only of a black-and-white picture and mono sound.
However, technology advanced as time went on, viewers demanded
more, and things began to get complicated. Remember when a show
being "in color" was still new enough to brag about? Of course,
for a while most people still watched such shows in
black-and-white. In the 1980's the same thing happened with "in
stereo." Nowadays, we've added surround sound, closed
captioning, secondary audio program, etc., accessible only to
those with specialized receivers.
Now, if we were to follow the original CALS model in designing
standards for television, we might be requiring everyone to have
a full-blown home theater system, given that the standard has
evolved to incorporate all those features. Alternatively, if
the "keep it simple at all costs" model were followed, we would
all be required to watch everything in simple black-and-white,
since that's the lowest common denominator. Luckily, we did
something smarter with television: we made it modular and
scalable. In fact, I can still receive today's terribly rich
and complex television signal on my 1960's vintage
black-and-white set, or, if I want, can invest in a home theater
system or anything in between. It's completely my choice based
on what I think I really need. For the World Wide Web, simple
browsers like Lynx and Mosaic ought to be adequate for many
people and many applications, but corporate users (and others)
will want to add richer information accessible by users with
more sophisticated browsers, search programs, and downloading
Concretely, a simple but scalable HTML should rest upon three
1. Enable a reasonable framework for navigation and
linking, but impose no more explicit structural
constraints than necessary.
We certainly need nested heading levels, and a few basic
hierarchical constraints, but we should avoid at all costs
building a DTD that tries to enforce desired document
structure of any sort. Our job is always to describe
rather than prescribe. If somebody wants to check complex
(CALS-like) structure using SGML (e.g. "all lists must
have at least two items"), they can do it with their own,
more restrictive DTD. What is required from HTML is
enough basic structure to do effective top-level
navigation, form links, describe sensible tables, etc. --
and sufficient richness to describe semantic (not
structural) distinctions between other elements.
2. Use only the absolute minimum number of required
(vs. optional) tags and attributes.
In CALS, it takes a huge number of tags to encode the
message "Hello World" because so many elements are
explicitly required. CALS was not designed to be
scalable; it was designed to enforce military
specifications for complex documents, and to enable a
sophisticated database integrating text with product data
and other information. In HTML we should require elements
and attributes only where things wouldn't make sense
without them. For example, it should be required that
links always have a specified target. It should not be
required that documents always have a main heading, even
though that might seem like a generally good idea.
3. Most importantly, use an object-oriented approach in
which a small set of standard element classes are defined
(i.e. several levels of headings and standard paragraphs),
but within which users can add arbitrary complexity. Then
add more complex objects -- such as tables, figures, math,
forms, etc. --as a series of optional, well-defined
This last is the heart of my proposal. The idea here is to
define a very simple base set of element classes -- no more
complicated than the base elements in HTML 2.0 -- while allowing
users to create a rich set of additional semantic distinctions
by creating their own subclasses.
For example, HTML currently contains a "paragraph" element to
cover most forms of text within heading levels. If you would
like a richer set of distinctions (e.g. preface, introduction,
abstract, note, warning, caution, etc.) -- distinctions that may
be crucial to effective information retrieval as well as
formatting -- you're out of luck. Today, all of these elements
must be mapped to "paragraph", because that's all there is to
As alluded to above, the usual answer would be to create SGML
elements for each of these, and hope we covered all the
distinctions anyone would ever want to make. Of course, this is
going to make the DTD more and more complicated, and the list of
"needed" elements will in reality continue to grow. But what if
we treat "paragraph" as an element class, for which the user can
define application-specific subclass distinctions like preface,
introduction, etc.? We could also define a standard, simple set
of class attributes (e.g. "id" so we can always establish links
in a uniform way), but also allow users to add other attributes
(e.g. "user level") as they please. Essentially, the object
class model could look like a lot like HTML+; the user-definable
subclasses would provide the extensibility.
This sort of approach, for which there is precedent in SGML
(e.g. architectural forms in HyTime, semantic attributes in
SDL/HDL), gives us the best of both worlds. HTML stays simple,
yet is extremely extensible. There is tremendous latent power,
yet it's still quite easy to send simple memos and "Hello World"
messages. This also meets the requirement of being able to use
both simple and sophisticated viewers on the same HTML file. A
more complex, application-specific viewer could take full
advantage of all semantic distinctions and attributes to drive
powerful formatting and information retrieval. A simple,
generalized viewer can ignore the application-specific markup
yet use the basic object classes to make reasonable formatting
decisions and provide basic navigation and linking.
Of course, like the generalized-SGML approach described above,
there is a potential "Tower of Babel" problem with uncontrolled
subclasses. However, the crucial difference is that the basic
hierarchy and linking mechanisms (like the black-and-white
signal) are always there, shared in common between all
applications and all browsers.
What about tables, equations, and other more complex objects?
Could these meet the two to three tag "Hello World" test? Well,
probably not, in the sense that we will need a few more than two
to three "required" tags and attributes to minimally specify
these objects. However, we can still keep their base models as
simple as possible, leaving room for extensibility as described
above. And, although marking up tables and equations adds
complexity to HTML as a whole, dealing with tables and equations
must be in itself completely optional. We should formalize
these features in precise layers, making it possible for a
browser to state whether it can handle math, for example, and
providing standard workarounds if not. This is already being
done for graphics in HTML 2.0: if a browser can't display the
image, there is a place to specify a textual description which
can be displayed instead.
So how do SGML and HTML really fit together? HTML is an SGML
application, a specific way to use SGML optimized for simplicity
and consistency in distributing electronic data over the Web.
It is not in competition with SGML, nor should it be confused
with a format suitable for archiving source information in a
corporate repository. However, it has the potential of being a
really smart use of SGML, taking advantage of SGML's flexibility
without giving up a common backbone structure that any browser
can readily interpret. In fact, it is precisely this
combination of simplicity and extensibility that will enable the
Web to be viable for serious online publishers.
Furthermore, I think the development of HTML 3.0 is the crucial
window of opportunity in which we must accomplish these goals.
Later will be too late, since people will already have built too
much around the 3.0 framework. Earlier would have been too
early, since we wouldn't have had the critical mass to consider
these questions properly. As they said in the movie
"Ghostbusters," we've got the tools, we've got the talent --
let's take this opportunity to do it right.
* * *
Mr. Severson is scheduled to speak on this topic at the Web
World conference in Orlando, Florida (January 30-31, 1995) and
Santa Clara, California (April 20-21, 1995). He is also
tentatively planning to speak at the Seybold Seminars in Boston,
Massachusetts (March 28-30, 1995).
* SGML NEWSWIRE LIST MANAGER *
* Linda Turner *
* Corporate Communications *
* Avalanche *
* 4999 Pearl East Circle *
* Suite 100 *
* Boulder, CO 80301 *
* firstname.lastname@example.org *
* email@example.com *
* Vox: (303) 449-5032 *
* Fax: (303) 449-3246 *