SGML and HTML: Eric Severson

SGML and HTML: Eric Severson

****************** WELCOME TO SGML NEWSWIRE ******************
*                                                            *
*      To subscribe, send mail to sgmlinfo@avalanche.com.    *
*							     *
*   To receive a current table of contents and instructions  *
*         for ordering back issues, specify "send toc"       *
*                   in the message body.                     *
*                                                            *
*         (Please pass along to interested colleagues)       *
*                                                            *
**************************************************************

A PROPOSAL FOR THE SCALABILITY OF HTML 
======================================

The following whitepaper discusses the fundamental differences
between HTML and SGML and seeks to influence the development of
the next version of HTML.  The author welcomes all feedback --
his email address is included below.


               HOW SGML AND HTML REALLY FIT TOGETHER:
                   THE CASE FOR A SCALABLE HTML


              By Eric Severson, Avalanche/Interleaf
                        eric@avalanche.com
                         (303) 449-5032

         (c) Avalanche Development Company/Interleaf Inc.
                          January 1995

Permission to redistribute this whitepaper is granted, provided
that no changes are made, and that this notice and the above
attributions are included in all copies.


As I look around the World Wide Web community I note three 
things:

   1. Everyone is enthusiastic (absolutely wildly 
      enthusiastic?) about the Web, HTML, and all the 
      promise they offer.

   2. Everyone wants the Web (and HTML) to be simple -- 
      after all, that's a lot of the appeal.

   3. When you pin them down, everyone actually has very 
      different applications in mind, covering a dauntingly 
      large range of requirements.

Though few realize it yet (they're perhaps still too enamored
with the "glitz" phase of all this), there is potentially a huge
trade-off to be dealt with in reconciling points (2) and (3)
above.  In particular, corporate users (the focus of a lot of
the discussion) see the Web as a way to drastically change the
way by which their information is delivered and used.  But as
someone who has dealt extensively with these corporate
information requirements, I am very aware that these needs --
and this information -- are often anything but simple.

To add to the confusion, people are continually asking about the
differences between HTML and SGML, and how they may or may not
fit together.  It was striking to me that all four opening
keynotes at a recent publishing conference mentioned HTML and
the World Wide Web (we even had a demo!), and most of them also
mentioned SGML.  Yet nobody seemed to know how to tie these two
things together in a coherent way.  Q&A sessions were filled
with questions like: "What's the difference between SGML and
HTML?" and "If I have HTML, do I still need SGML?"

There is a strictly technical answer, of course, and we should
start there:

SGML is an international standard for encoding the logical
structure and content of documents that has been around since
1986.  By focusing on the structure and intent of a document,
rather than proprietary formatting codes, it maximizes
accessibility and reusability of the underlying information.
SGML lets us find information more effectively because we can
use internal document structure to help guide our search.  It
lets us reuse the information more effectively because SGML
documents are not tied to any particular format, application, or
publishing platform.

To create an SGML application, you define a set of document
objects (e.g. "title," "chapter," "paragraph," "list") and their
relationships (e.g. "document title must always occur first,"
"chapters can contain a title followed by any number of
paragraphs and lists" etc.).  These objects and relationships
are described in something called a Document Type Definition
(DTD), which forms the heart of the application.  Objects in an
SGML document are then tagged according to the rules specified
in the DTD.  Originally used primarily for complex technical
documentation (such as for the U.S. Department of Defense CALS
initiative), SGML has emerged as the format of choice for a
great variety of applications.  It is especially powerful (and
popular) where information needs to be collaboratively authored,
managed as a large collection of documents, and delivered
electronically in multiple forms.

Thus it is not surprising that HTML, the format that drives
documents across the World Wide Web, has been defined using
SGML.  Formally speaking, HTML is a specific SGML application,
with its own Document Type Definition and a fixed tag set.
Therefore, the strictly technical answer is this: it doesn't
make sense to choose SGML versus HTML -- you're already using
SGML if you're using HTML.

However, what's really behind the questions is something more
fundamental: should HTML be viewed as a general interchange
format, or is it just a lowest common denominator useful
primarily for casual applications?  Could it be used as the
archival format, or should we always produce HTML from some
other source, such as an SGML-based information repository?
Some of these answers seem pretty clear.  At this point in time,
HTML is clearly not rich enough to be a mainstream repository
format.  Furthermore, because each individual application has a
variety of unique needs, I would recommend that HTML never be
looked at this way.  SGML in general (i.e. your own DTD) is the
right way to build a corporate information repository; HTML is a
specific vehicle to put that information on the Web.  Yet, if
HTML is not meant to be reasonably rich, then a significant
amount of its appeal may be lost.  That is, rather than being a
"universal solvent" for information being passed across the Web,
it becomes just another intermediate format driving a particular
set of browsing software.

The developers of HTML 3.0 have set out to ensure that it does
not repeat the complexity of CALS and other classic "tech doc"
applications.  Having been on the committee that architected the
CALS standard, I think I understand what they mean.  Faced with
an extremely large and diverse community of potential users, and
constantly deluged with specific user requests for "just one
more" added feature, it was all too easy to let the standard get
out of control.  While significant progress has been made in
modularizing what was once a terribly complicated monolith, new
users and new requirements still demand a cumbersome
registration process within a central CALS "library."

The World Wide Web, of course, has a potential set of users and
requirements that are orders of magnitude larger than even the
U.S. Department of Defense.  Furthermore, the idea of attempting
to keep everyone's favorite set of objects and attributes in a
worldwide library is ludicrous.  So there is no doubt that HTML
developers are exactly right: the potential to add unconstrained
complexity to HTML is staggering, and the historical evolution
of CALS cannot be the model we follow.  However, it isn't so
clear to me that we must go to the other extreme, making HTML
staggeringly simple.

The problem with simplicity is that you lose information.  Not
the words of the document themselves (the content), not even the
links that actually make the "web" between documents, but all
the other information necessary to do the really interesting
things.  This is what defines the difference between an
intriguing but clearly casual application, and one which
actually carries the freight for corporate and government
users.  For example:

   1. A simple, base-level HTML dooms every document to 
      be rendered in "lowest common denominator" format.

      The rich formatting characteristics of more "serious"
      documents is simply not possible, since the only
      distinctions available are between a few basic headings
      and paragraphs.  I may have twenty types of "paragraphs"
      in my document (abstract, biography, preface, note,
      introduction, glossary term and definition, etc.), all of
      which are formatted differently to enhance readability and
      communication power.  But on the Web, they will all look
      the same: I am forced to call all of them "paragraphs,"
      and, of course, all objects called "paragraph" will
      receive the same format.

   2. A simple, base-level HTML takes away most of our
      ability to intelligently search Web documents based on
      their inner structure.

      It is increasingly being recognized that, in terms of the
      Information Explosion, the Web is at once our best friend
      and our worst nightmare.  Everyone can witness the power
      of following links from a document that's been retrieved,
      but no one really has an answer for how to find that
      document in the first place.  I can search "What's New"
      URLs, or use an (almost certainly quite limited) indexing
      service, maybe even have a computer program automatically
      look for specific words and phrases.  Across a mushrooming
      worldwide collection of information, however, this is
      literally like looking for a needle in a haystack.
      Wouldn't it help if we could limit such automated searches
      only to document abstracts, or look for keywords only if
      they occur in some other structural context?

   3. A simple, base-level HTML makes it very hard to do
      anything interesting with a Web document once it's been
      downloaded.

      As people actually try to use (and reuse) information
      they've found on the Web, this problem is just beginning
      to be understood.  Once I've reduced a document to its
      lowest common denominator, I can't just snap my fingers
      and get it back.  If I've collapsed twenty kinds of
      paragraphs into one, there's no good way to recover the
      original twenty kinds.  I've likened this to condensing
      Hamlet into a simple "I Can Read Too" book suitable for
      first graders, then trying to use that version to
      reconstitute the original text.  Obviously, it won't
      work.

We are thus sitting on the horns of a dilemma.  On the one hand,
focusing on accommodating the entire range of needs could lead
to a horribly complicated standard, pieced together like a
patchwork quilt to satisfy the whims of the loudest political
factions.  But on the other hand, the danger of focusing on
simplicity is that people may not have the power to do the
things that originally attracted them to the Web.  If that
happens, there may be a backlash -- a "big chill" as it were --
when this becomes obvious.

In the face of this, some have suggested doing away with HTML,
or using it only for very simple documents.  In their view, Web
documents should be based on SGML in general -- that is, using
an unlimited variety of DTDs at each information provider's
discretion.  The argument, which makes some sense, is that any
one DTD -- even HTML -- can't possibly cover everyone's
application requirements.  Rather than try to standardize what
is inherently uncontrollable, let's give people the full power
of SGML and let them design their own uses.

In fact, I am a supporter of generalized SGML on the Web.  It
certainly is the way to go when building information
repositories within organizations, and should be allowed as a
Web format as well.  However, on the Web this creates another
problem which is actually a result of SGML's intrinsic
flexibility.  Since there is no guarantee that individual DTDs
will bear any resemblance to each other, a generalized-SGML Web
has its own potential to create a "Tower of Babel."  When
everyone freely designs their own SGML applications, even the
most basic notions about backbone document structures and links
may not be shared in common.  What you get when you search,
format and download information may be completely
unpredictable.  Presumably each document will give you its DTD,
but you will not have a good way to immediately relate that
document and DTD to other documents and your own applications.

So which way should we go, simple or powerful?  My answer is to
strongly support the notion of HTML as the backbone for the Web,
to be used for the vast majority of Web applications.  However,
I believe that HTML can and must be made SCALABLE.  We must take
specific steps to ensure that HTML 3.0 is architected simply,
elegantly, not biased to anyone's particular application, but
carefully engineered for extensibility that can run the gamut of
what most people will want to do on the Web.

To clarify what I mean by "scalable", let me propose an acid
test for this property:

   1. If I want to send a message consisting only of the 
      words "Hello World!", it takes no more than two to  
      three HTML tags.

   2. If I want to publish my corporate annual report,
      looking reasonably close to its paper version, and
      enabling powerful information retrieval keyed to text,
      financial tables, and graphics, it takes  a lot more tags
      but I can still do it without a problem.

   3. If I want to send my corporate annual report, via the 
      same HTML file, to both a highly sophisticated 
      graphical browser / retrieval engine, and to a simple 
      "dumb terminal" viewer, I can do this knowing that 
      both can easily do something reasonable with the 
      information.  That is, nothing about HTML would 
      make it hard to build a simplistic viewer that could 
      deal with any document, no matter how sophisticated 
      the markup.  Furthermore, I do not need to know the 
      properties of each potential browser/viewer when I 
      encode my HTML; all HTML-compliant browsers/viewers
      should be able to handle any HTML file in an effective
      manner.

One might observe that I want to have my cake, eat it too, and
still get a second choice for dessert.  This would seem like a
lot to ask for, but it's actually not as hard as it sounds.

I like to solve problems by analogy, and in this case I like the
analogy of a television signal.  Like the Web, television allows
an extremely large and diverse audience to access information in
a previously unprecedented way.  It also started out simply,
consisting only of a black-and-white picture and mono sound.
However, technology advanced as time went on, viewers demanded
more, and things began to get complicated.  Remember when a show
being "in color" was still new enough to brag about?  Of course,
for a while most people still watched such shows in
black-and-white.  In the 1980's the same thing happened with "in
stereo."  Nowadays, we've added surround sound, closed
captioning, secondary audio program, etc., accessible only to
those with specialized receivers.

Now, if we were to follow the original CALS model in designing
standards for television, we might be requiring everyone to have
a full-blown home theater system, given that the standard has
evolved to incorporate all those features.  Alternatively, if
the "keep it simple at all costs" model were followed, we would
all be required to watch everything in simple black-and-white,
since that's the lowest common denominator.   Luckily, we did
something smarter with television: we made it modular and
scalable.  In fact, I can still receive today's terribly rich
and complex television signal on my 1960's vintage
black-and-white set, or, if I want, can invest in a home theater
system or anything in between.  It's completely my choice based
on what I think I really need.  For the World Wide Web, simple
browsers like Lynx and Mosaic ought to be adequate for many
people and many applications, but corporate users (and others)
will want to add richer information accessible by users with
more sophisticated browsers, search programs, and downloading
requirements.

Concretely, a simple but scalable HTML should rest upon three
fundamental principles:

   1. Enable a reasonable framework for navigation and
      linking, but impose no more explicit structural
      constraints than necessary.

      We certainly need nested heading levels, and a few basic
      hierarchical constraints, but we should avoid at all costs
      building a DTD that tries to enforce desired document
      structure of any sort.  Our job is always to describe
      rather than prescribe.  If somebody wants to check complex
      (CALS-like) structure using SGML (e.g. "all lists must
      have at least two items"), they can do it with their own,
      more restrictive DTD.  What is required from HTML is
      enough basic structure to do effective top-level
      navigation, form links, describe sensible tables, etc. --
      and sufficient richness to describe semantic (not
      structural) distinctions between other elements.

   2. Use only the absolute minimum number of required 
      (vs. optional) tags and attributes.

      In CALS, it takes a huge number of tags to encode the
      message "Hello World" because so many elements are
      explicitly required.  CALS was not designed to be
      scalable; it was designed to enforce military
      specifications for complex documents, and to enable a
      sophisticated database integrating text with product data
      and other information.  In HTML we should require elements
      and attributes only where things wouldn't make sense
      without them.  For example, it should be required that
      links always have a specified target.  It should not be
      required that documents always have a main heading, even
      though that might seem like a generally good idea.

   3. Most importantly, use an object-oriented approach in
      which a small set of standard element classes are defined
      (i.e. several levels of headings and standard paragraphs),
      but within which users can add arbitrary complexity.  Then
      add more complex objects -- such as tables, figures, math,
      forms, etc. --as a series of optional, well-defined
      layers.

This last is the heart of my proposal.  The idea here is to
define a very simple base set of element classes -- no more
complicated than the base elements in HTML 2.0 -- while allowing
users to create a rich set of additional semantic distinctions
by creating their own subclasses.

For example, HTML currently contains a "paragraph" element to
cover most forms of text within heading levels.  If you would
like a richer set of distinctions (e.g. preface, introduction,
abstract, note, warning, caution, etc.) -- distinctions that may
be crucial to effective information retrieval as well as
formatting -- you're out of luck.  Today, all of these elements
must be mapped to "paragraph", because that's all there is to
choose from.

As alluded to above, the usual answer would be to create SGML
elements for each of these, and hope we covered all the
distinctions anyone would ever want to make.  Of course, this is
going to make the DTD more and more complicated, and the list of
"needed" elements will in reality continue to grow.  But what if
we treat "paragraph" as an element class, for which the user can
define application-specific subclass distinctions like preface,
introduction, etc.?  We could also define a standard, simple set
of class attributes (e.g. "id" so we can always establish links
in a uniform way), but also allow users to add other attributes
(e.g.  "user level") as they please.  Essentially, the object
class model could look like a lot like HTML+; the user-definable
subclasses would provide the extensibility.

This sort of approach, for which there is precedent in SGML
(e.g.  architectural forms in HyTime, semantic attributes in
SDL/HDL), gives us the best of both worlds.  HTML stays simple,
yet is extremely extensible.  There is tremendous latent power,
yet it's still quite easy to send simple memos and "Hello World"
messages.  This also meets the requirement of being able to use
both simple and sophisticated viewers on the same HTML file.  A
more complex, application-specific viewer could take full
advantage of all semantic distinctions and attributes to drive
powerful formatting and information retrieval.  A simple,
generalized viewer can ignore the application-specific markup
yet use the basic object classes to make reasonable formatting
decisions and provide basic navigation and linking.

Of course, like the generalized-SGML approach described above,
there is a potential "Tower of Babel" problem with uncontrolled
subclasses.  However, the crucial difference is that the basic
hierarchy and linking mechanisms (like the black-and-white
signal) are always there, shared in common between all
applications and all browsers.

What about tables, equations, and other more complex objects?
Could these meet the two to three tag "Hello World" test?  Well,
probably not, in the sense that we will need a few more than two
to three "required" tags and attributes to minimally specify
these objects.  However, we can still keep their base models as
simple as possible, leaving room for extensibility as described
above.  And, although marking up tables and equations adds
complexity to HTML as a whole, dealing with tables and equations
must be in itself completely optional.  We should formalize
these features in precise layers, making it possible for a
browser to state whether it can handle math, for example, and
providing standard workarounds if not.  This is already being
done for graphics in HTML 2.0: if a browser can't display the
image, there is a place to specify a textual description which
can be displayed instead.


So how do SGML and HTML really fit together?  HTML is an SGML
application, a specific way to use SGML optimized for simplicity
and consistency in distributing electronic data over the Web.
It is not in competition with SGML, nor should it be confused
with a format suitable for archiving source information in a
corporate repository.  However, it has the potential of being a
really smart use of SGML, taking advantage of SGML's flexibility
without giving up a common backbone structure that any browser
can readily interpret.  In fact, it is precisely this
combination of simplicity and extensibility that will enable the
Web to be viable for serious online publishers.

Furthermore, I think the development of HTML 3.0 is the crucial
window of opportunity in which we must accomplish these goals.
Later will be too late, since people will already have built too
much around the 3.0 framework.  Earlier would have been too
early, since we wouldn't have had the critical mass to consider
these questions properly.  As they said in the movie
"Ghostbusters," we've got the tools, we've got the talent --
let's take this opportunity to do it right.


                            * * *

Mr. Severson is scheduled to speak on this topic at the Web
World conference in Orlando, Florida (January 30-31, 1995) and
Santa Clara, California (April 20-21, 1995).  He is also
tentatively planning to speak at the Seybold Seminars in Boston,
Massachusetts (March 28-30, 1995).

**************************************************************
*                SGML NEWSWIRE LIST MANAGER                  *
*                                                            *
*                       Linda Turner                         *
*                 Corporate Communications                   *
*                         Avalanche		             * 
*                  4999 Pearl East Circle   		     *
*		          Suite 100 			     *
*                     Boulder, CO 80301                      * 
*                   sgmlinfo@avalanche.com		     *
*                    linda@avalanche.com                     *
*                    Vox: (303) 449-5032                     *
*                    Fax: (303) 449-3246                     *
**************************************************************