SGML: Revised HTML/SGML Whitepaper (Severson)


From owner-majordomo@ebt.com Thu Apr  6 18:04:59 1995
Return-Path: <owner-majordomo@ebt.com>
Received: from ebt-inc.ebt.com by utafll.uta.edu (4.1/25-eef)
	id AA20820; Thu, 6 Apr 95 18:00:27 CDT
Received: from curly.csn.net (root@curly.csn.net [199.117.160.3]) by ebt-inc.ebt.com (8.6.9/8.6.9) with ESMTP id PAA20879 for <sgml-internet@ebt.com>; Thu, 6 Apr 1995 15:14:18 -0400
Received: (from Uavalanc@localhost) by curly.csn.net (8.6.9/8.6.9) with UUCP id NAA10228 for sgml-internet@ebt.com; Thu, 6 Apr 1995 13:07:43 -0600
Received: from vincent.avalanche.com (vincent.avalanche.com [198.51.45.94]) by alaya.avalanche.com (8.6.10/8.6.10) with ESMTP id MAA18491 for <sgml-internet@ebt.com>; Thu, 6 Apr 1995 12:55:04 -0600
From: Eric Severson <erics@avalanche.com>
Received: (erics@localhost) by vincent.avalanche.com (8.6.10/8.6.9) id MAA06159 for sgml-internet@ebt.com; Thu, 6 Apr 1995 12:55:36 -0600
Date: Thu, 6 Apr 1995 12:55:36 -0600
Message-Id: <199504061855.MAA06159@vincent.avalanche.com>
To: sgml-internet@ebt.com
Subject: Revised HTML/SGML Whitepaper
Status: R

Attached is a revised version of my HTML/SGML whitepaper, in
which I have clarified the way in which I propose SGML and
HTML fit together using architectural forms. I would appreciate
your comments...

Thanks,

Eric

*******************************************************************
Eric Severson                 TEL +1 (303) 449-5032
Interleaf/Avalanche           FAX +1 (303) 449-7381 or -3246
4999 Pearl East Circle        EMAIL eric@avalanche.com
Suite 100
Boulder, CO 80301
USA

Chair, SGML/Open Table Interchange Issues Technical Committee
*******************************************************************     

               HOW SGML AND HTML REALLY FIT TOGETHER:
               THE CASE FOR A SCALABLE HTML/SGML WEB


By Eric Severson, Interleaf/Avalanche, April 1995


As I look around the World Wide Web community I note three 
things:

   1. Everyone is enthusiastic (absolutely wildly 
      enthusiastic?) about the Web, HTML, and all the 
      promise they offer.

   2. Everyone wants the Web (and HTML) to be simple -- 
      after all, that's a lot of the appeal.

   3. When you pin them down, everyone actually has very 
      different applications in mind, covering a dauntingly 
      large range of requirements.

Though few realize it yet (they're perhaps still too enamored with
the "glitz" phase of all this), there is potentially a huge trade-off
to be dealt with in reconciling points (2) and (3) above.  In
particular, corporate users (the focus of a lot of the discussion)
see the Web as a way to drastically change the way by which their
information is delivered and used.  But as someone who has dealt
extensively with these corporate information requirements, I am
very aware that these needs -- and this information -- are often
anything but simple.

To add to the confusion, people are continually asking about the 
differences between HTML and SGML, and how they may or may 
not fit together.  It was striking to me that all four opening 
keynotes at a recent publishing conference mentioned HTML and 
the World Wide Web (we even had a demo!), and most of them 
also mentioned SGML.  Yet nobody seemed to know how to tie 
these two things together in a coherent way.  Q&A sessions were 
filled with questions like: "What's the difference between SGML 
and HTML?" and "If I have HTML, do I still need SGML?"

There is a strictly technical answer, of course, and we should
start there:

SGML is an international standard for encoding the logical 
structure and content of documents that has been around since 
1986.  By focusing on the structure and intent of a document, 
rather than proprietary formatting codes, it maximizes accessibility 
and reusability of the underlying information.  SGML lets us find 
information more effectively because we can use internal 
document structure to help guide our search.  It lets us reuse 
the information more effectively because SGML documents are 
not tied to any particular format, application, or publishing 
platform.

To create an SGML application, you define a set of document 
objects (e.g. "title," "chapter," "paragraph," "list") and their 
relationships (e.g. "document title must always occur first," 
"chapters can contain a title followed by any number of paragraphs 
and lists" etc.).  These objects and relationships are described in 
something called a Document Type Definition (DTD), which 
forms the heart of the application.  Objects in an SGML document 
are then tagged according to the rules specified in the DTD.  
Originally used primarily for complex technical documentation 
(such as for the U.S. Department of Defense CALS initiative), 
SGML has emerged as the format of choice for a great variety of 
applications.  It is especially powerful (and popular) where
information needs to be collaboratively authored, managed as a
large collection of documents, and delivered electronically
in multiple forms.

Thus it is not surprising that HTML, the format that drives 
documents across the World Wide Web, has been defined using 
SGML.  Formally speaking, HTML is a specific SGML 
application, with its own Document Type Definition and a fixed 
tag set.  Therefore, the strictly technical answer is this: it
doesn't make sense to choose SGML versus HTML -- you're already 
using SGML if you're using HTML.

However, what's really behind the questions is something more
fundamental: should HTML be viewed as a general interchange format,
or is it just a lowest common denominator useful primarily for
casual applications?  Could it be used as the archival format,
or should we always produce HTML from some other source, such as
an SGML-based information repository?  Some of these answers seem
pretty clear.  At this point in time, HTML is clearly not rich
enough to be a mainstream repository format.  Furthermore, because
each individual application has a variety of unique needs, I would
recommend that HTML never be looked at this way.  SGML in general
(i.e. your own DTD) is the right way to build a corporate
information repository; HTML is a specific vehicle to put that
information on the Web.  Yet, if HTML is not meant to be reasonably
rich, then a significant amount of its appeal may be lost.
That is, rather than being a "universal solvent" for information
being passed across the Web, it becomes just another intermediate
format driving a particular set of browsing software.

The developers of HTML 3.0 have set out to ensure that it does not 
repeat the complexity of CALS and other classic "tech doc"
applications.  Having been on the committee that architected the
CALS standard, I think I understand what they mean.  Faced with
an extremely large and diverse community of potential users, and
constantly deluged with specific user requests for "just 
one more" added feature, it was all too easy to let the standard 
get out of control.  While significant progress has been made in 
modularizing what was once a terribly complicated monolith, new 
users and new requirements still demand a cumbersome 
registration process within a central CALS "library."

The World Wide Web, of course, has a potential set of users and 
requirements that are orders of magnitude larger than even the 
U.S. Department of Defense.  Furthermore, the idea of attempting 
to keep everyone's favorite set of objects and attributes in a 
worldwide library is ludicrous.  So there is no doubt that 
HTML developers are exactly right: the potential to add 
unconstrained complexity to HTML is staggering, and the 
historical evolution of CALS cannot be the model we follow.  
However, it isn't so clear to me that we must go to the other 
extreme, making HTML staggeringly simple.

The problem with simplicity is that you lose information.  Not the 
words of the document themselves (the content), not even the links 
that actually make the "web" between documents, but all the other 
information necessary to do the really interesting things.  This is 
what defines the difference between an intriguing but clearly 
casual application, and one which actually carries the freight
for corporate and government users:

   1. A simple, base-level HTML dooms every document to 
      be rendered in "lowest common denominator" format.

      The rich formatting characteristics of more "serious" 
      documents is simply not possible, since the only 
      distinctions available are between a few basic headings 
      and paragraphs.  I may have twenty types of 
      "paragraphs" in my document (abstract, biography, 
      preface, note, introduction, glossary term and 
      definition, etc.), all of which are formatted differently 
      to enhance readability and communication power.  But 
      on the Web, they will all look the same: I am forced to 
      call all of them "paragraphs," and, of course, all 
      objects called "paragraph" will receive the same 
      format.

   2. A simple, base-level HTML takes away most of our 
      ability to intelligently search Web documents based on 
      their inner structure.

      It is increasingly being recognized that, in terms of the 
      Information Explosion, the Web is at once our best 
      friend and our worst nightmare.  Everyone can witness 
      the power of following links from a document that's 
      been retrieved, but no one really has an answer for 
      how to find that document in the first place.  I can 
      search "What's New" URLs, or use an (almost 
      certainly quite limited) indexing service, maybe even 
      have a computer program automatically look for 
      specific words and phrases.  Across a mushrooming 
      worldwide collection of information, however, this is 
      literally like looking for a needle in a haystack.   
      Wouldn't it help if we could limit such automated 
      searches only to document abstracts, or look for 
      keywords only if they occur in some other structural 
      context?

   3. A simple, base-level HTML makes it very hard to do 
      anything interesting with a Web document once it's 
      been downloaded.

      As people actually try to use (and reuse) information 
      they've found on the Web, this problem is just 
      beginning to be understood.  Once I've reduced a 
      document to its lowest common denominator, I can't 
      just snap my fingers and get it back.  If I've collapsed 
      twenty kinds of paragraphs into one, there's no good 
      way to recover the original twenty kinds.  I've likened 
      this to condensing Hamlet into a simple "I Can Read 
      Too" book suitable for first graders, then trying to use 
      that version to reconstitute the original text.  
      Obviously, it won't work.

We are thus sitting on the horns of a dilemma.  On the one hand, 
focusing on accommodating the entire range of needs could lead to 
a horribly complicated standard, pieced together like a patchwork 
quilt to satisfy the whims of the loudest political factions.  But on 
the other hand, the danger of focusing on simplicity is that people 
may not have the power to do the things that originally attracted 
them to the Web.  If that happens, there may be a backlash -- a "big 
chill" as it were -- when this becomes obvious.

In the face of this, some have suggested doing away with HTML, 
or using it only for very simple documents.  In their view, Web 
documents should be based on SGML in general -- that is, using an 
unlimited variety of DTDs at each information provider's 
discretion.  The argument, which makes some sense, is that any 
one DTD -- even HTML -- can't possibly cover everyone's 
application requirements.  Rather than try to standardize what is 
inherently uncontrollable, let's give people the full power of 
SGML and let them design their own uses.

Although I am also a strong SGML supporter, I disagree with
this conclusion.  For all the power within SGML, I believe that
HTML should remain as the primary data format that forms the
backbone of the Web.  This is true for three important reasons:

   1. People have already accepted HTML as the underlying Web
      format.  Attempting to shift the momentum away from HTML
      would be confusing, counter-productive, and likely to fail.

   2. While it is important that more complex documents can be
      included on the Web, most documents will continue to be
      relatively simple.  There is no reason to force people to
      use generalized SGML -- requiring them to create or select
      application-specific DTDs -- when HTML (with a few small
      enhancements) will be adequate for most real-life uses.

   3. Even if the Web consisted only of SGML documents, it would
      be necessary to invent something akin to HTML as a set of
      common semantics to be shared between all the various SGML
      applications.  Without these common semantics, it will not
      be possible to easily navigate, search, format, or reuse
      Web documents in a standardized way.

I believe we should maintain an HTML-centric Web, but one in
which HTML and SGML coexist and are used purposefully and
intelligently together.  Rather than choosing one over the other,
I would like to see HTML and SGML brought together to form a
truly SCALABLE architecture in which both simple and complex
information can be handled effectively without sacrificing HTML's
current simplicity.  In this vision, HTML will scale smoothly both
within itself and into the use of SGML in general.

To clarify what I mean by "scalable", let me propose an acid test 
for this property:

   1. If I want to send a message consisting only of the 
      words "Hello World!", it takes no more than two to  
      three HTML tags.

   2. If I want to publish my corporate annual report, 
      looking reasonably close to its paper version, and 
      enabling powerful information retrieval keyed to text, 
      financial tables, and graphics, it takes a lot more tags --
      maybe even its own customized SGML application -- but 
      but I can still do it on the Web without a problem.

   3. If I want to send my corporate annual report, via the 
      same HTML file, to both a highly sophisticated 
      graphical browser / retrieval engine, and to a simple 
      "dumb terminal" viewer, I can do this knowing that 
      both can easily do something reasonable with the 
      information.  That is, nothing about HTML (or its
      relationship to SGML) would make it hard to build a
      simplistic viewer that could deal with any document,
      no matter how sophisticated the markup.  Furthermore,
      this should remain true whether I have used straight
      HTML or a custom SGML DTD as the underlying encoding.

One might observe that I want to have my cake, eat it too, and still 
get a second choice for dessert.  This would seem like a lot to ask 
for, but it's actually not as hard as it sounds.

I like to solve problems by analogy, and in this case I like the 
analogy of a television signal.  Like the Web, television allows
an extremely large and diverse audience to access information in
a previously unprecedented way.  It also started out simply, 
consisting only of a black-and-white picture and mono sound.  
However, technology advanced as time went on, viewers 
demanded more, and things began to get complicated.  Remember 
when a show being "in color" was still new enough to brag about?  
Of course, for a while most people still watched such shows in 
black-and-white.  In the 1980's the same thing happened with "in 
stereo."  Nowadays, we've added surround sound, closed 
captioning, secondary audio program, etc., accessible only to
those with specialized receivers.

Now, if we were to follow the original CALS model in designing 
standards for television, we might be requiring everyone to have a 
full-blown home theater system, given that the standard has 
evolved to incorporate all those features.  Alternatively, if the 
"keep it simple at all costs" model were followed, we would all be 
required to watch everything in simple black-and-white, since 
that's the lowest common denominator.   Luckily, we did 
something smarter with television: we made it modular and 
scalable.  In fact, I can still receive today's terribly rich and 
complex television signal on my 1960's vintage black-and-white 
set, or, if I want, can invest in a home theater system or anything 
in between.  It's completely my choice based on what I think I 
really need.  For the World Wide Web, simple browsers like Lynx
and Mosaic ought to be adequate for many people and many
applications, but corporate users (and others) will want to add
richer information accessible by users with more sophisticated
browsers, search programs, and downloading requirements.

Concretely, a simple but scalable HTML should rest upon three 
fundamental principles:

   1. Enable a reasonable framework for navigation and 
      linking, but impose no more explicit structural 
      constraints than necessary.

      We certainly need nested heading levels, and a few 
      basic hierarchical constraints, but we should avoid at 
      all costs building a DTD that tries to enforce desired 
      document structure of any sort.  Our job is always to 
      describe rather than prescribe.  If somebody wants to 
      check complex (CALS-like) structure using SGML 
      (e.g. "all lists must have at least two items"), they can 
      do it with their own, more restrictive DTD.  What is 
      required from HTML is enough basic structure to do 
      effective top-level navigation, form links, describe 
      sensible tables, etc. -- and sufficient richness to 
      describe semantic (not structural) distinctions between 
      other elements.

   2. Use only the absolute minimum number of required 
      (vs. optional) tags and attributes.

      In CALS, it takes a huge number of tags to encode the 
      message "Hello World" because so many elements are 
      explicitly required.  CALS was not designed to be 
      scalable; it was designed to enforce military 
      specifications for complex documents, and to enable a 
      sophisticated database integrating text with product 
      data and other information.  In HTML we should 
      require elements and attributes only where things 
      wouldn't make sense without them.  For example, it 
      should be required that links always have a specified 
      target.  It should not be required that documents 
      always have a main heading, even though that might 
      seem like a generally good idea.

   3. Most importantly, use an object-oriented approach in 
      which a small set of standard element classes are 
      defined (i.e. several levels of headings and standard 
      paragraphs), but within which users can add arbitrary 
      complexity.  Then add more complex objects -- such 
      as tables, figures, math, forms, etc. --as a series of 
      optional, well-defined layers.

This last point is the heart of my proposal.  The idea here
is to define a very simple base set of element classes -- no
more complicated than the base elements in HTML 2.0 -- while
allowing users to create a rich set of additional semantic
distinctions by creating their own "subclasses."

For example, HTML currently contains a "paragraph" element to 
cover most forms of text within heading levels.  If you would like 
a richer set of distinctions (e.g. preface, introduction, abstract, 
note, warning, caution, etc.) -- distinctions that may be crucial to 
effective information retrieval as well as formatting -- you're out 
of luck.  Today, all of these elements must be mapped to "paragraph",
because that's all there is to choose from.

Hence the motivation for customized SGML -- it allows you to make
up any element names you wish.  But there's another problem.  SGML
gives you the ultimate flexibility and scalability but in itself
provides no standardization of even the most basic concepts of
navigation and linking.  IT WASN'T MEANT TO.  One DTD designer
can call a hyperlink an "a", the next may call it an "xref",
another might choose to call it a "joe" or a "schmoe".  

The point is, with fully general SGML there's no way to predict
the relationship between the elements in an arbitrary DTD and the
semantics my browser / searcher / download process needs to know.
Certainly, if I search across the Web (soon the world's largest
virtual document collection),  I would like to be able to follow
links and understand navigation hierarchy without having to
analyze each individual use of SGML and figure out how to map it
to a common set of semantics.  Having the DTD available -- in the
absence of some additional method of tying things together --
doesn't help me a bit.

So we must do something smarter: generalize HTML to be viewed as
providing the common set of semantics (classes of objects) that can
be understood by any Web browser, search engine or other application.
Using this approach, all individual Web documents would be encoded
in SGML, using whatever DTD is desired by the user, but each SGML
element in the DTD would be mapped to one of the standard HTML
classes.  For example, my DTD may have elements called ARTICLE,
ABSTRACT, PREFACE, PARA and JUMP.  So that any Web browser could
immediately understand the basic structure of my documents, I would
provide a standard mapping which shows that ARTICLE is a heading
level, ABSTRACT, PREFACE and PARA are all a form of paragraph,
and JUMP is a hyperlink (most likely this would be accomplished
using the "architectural forms" approach from SGML's companion
HyTime standard).

The majority of simple Web documents would continue to use the
simple SGML DTD called HTML, in which the elements map one-for-one
to the common object classes.  However, even this DTD could be
made more extensible to allow a measure of scalability without
moving to a separate custom-defined SGML application (most likely
this would be accomplished using some form of class/subclass
attributes within the base HTML DTD).


So how do SGML and HTML really fit together?  HTML is an 
SGML application, a specific way to use SGML optimized for 
simplicity and consistency in distributing electronic data over the 
Web.  It is not in competition with SGML, nor should it be 
confused with a format suitable for archiving source information 
in a corporate repository.  However, it has the potential of being a 
really smart use of SGML, taking advantage of SGML's flexibility 
without giving up a common backbone structure that any browser 
can readily interpret.  In fact, it is precisely this combination of 
simplicity and extensibility that will enable the Web to be viable 
for serious online publishers.

* * *