SGML: Revised HTML/SGML Whitepaper (Severson)
From owner-majordomo@ebt.com Thu Apr 6 18:04:59 1995
Return-Path: <owner-majordomo@ebt.com>
Received: from ebt-inc.ebt.com by utafll.uta.edu (4.1/25-eef)
id AA20820; Thu, 6 Apr 95 18:00:27 CDT
Received: from curly.csn.net (root@curly.csn.net [199.117.160.3]) by ebt-inc.ebt.com (8.6.9/8.6.9) with ESMTP id PAA20879 for <sgml-internet@ebt.com>; Thu, 6 Apr 1995 15:14:18 -0400
Received: (from Uavalanc@localhost) by curly.csn.net (8.6.9/8.6.9) with UUCP id NAA10228 for sgml-internet@ebt.com; Thu, 6 Apr 1995 13:07:43 -0600
Received: from vincent.avalanche.com (vincent.avalanche.com [198.51.45.94]) by alaya.avalanche.com (8.6.10/8.6.10) with ESMTP id MAA18491 for <sgml-internet@ebt.com>; Thu, 6 Apr 1995 12:55:04 -0600
From: Eric Severson <erics@avalanche.com>
Received: (erics@localhost) by vincent.avalanche.com (8.6.10/8.6.9) id MAA06159 for sgml-internet@ebt.com; Thu, 6 Apr 1995 12:55:36 -0600
Date: Thu, 6 Apr 1995 12:55:36 -0600
Message-Id: <199504061855.MAA06159@vincent.avalanche.com>
To: sgml-internet@ebt.com
Subject: Revised HTML/SGML Whitepaper
Status: R
Attached is a revised version of my HTML/SGML whitepaper, in
which I have clarified the way in which I propose SGML and
HTML fit together using architectural forms. I would appreciate
your comments...
Thanks,
Eric
*******************************************************************
Eric Severson TEL +1 (303) 449-5032
Interleaf/Avalanche FAX +1 (303) 449-7381 or -3246
4999 Pearl East Circle EMAIL eric@avalanche.com
Suite 100
Boulder, CO 80301
USA
Chair, SGML/Open Table Interchange Issues Technical Committee
*******************************************************************
HOW SGML AND HTML REALLY FIT TOGETHER:
THE CASE FOR A SCALABLE HTML/SGML WEB
By Eric Severson, Interleaf/Avalanche, April 1995
As I look around the World Wide Web community I note three
things:
1. Everyone is enthusiastic (absolutely wildly
enthusiastic?) about the Web, HTML, and all the
promise they offer.
2. Everyone wants the Web (and HTML) to be simple --
after all, that's a lot of the appeal.
3. When you pin them down, everyone actually has very
different applications in mind, covering a dauntingly
large range of requirements.
Though few realize it yet (they're perhaps still too enamored with
the "glitz" phase of all this), there is potentially a huge trade-off
to be dealt with in reconciling points (2) and (3) above. In
particular, corporate users (the focus of a lot of the discussion)
see the Web as a way to drastically change the way by which their
information is delivered and used. But as someone who has dealt
extensively with these corporate information requirements, I am
very aware that these needs -- and this information -- are often
anything but simple.
To add to the confusion, people are continually asking about the
differences between HTML and SGML, and how they may or may
not fit together. It was striking to me that all four opening
keynotes at a recent publishing conference mentioned HTML and
the World Wide Web (we even had a demo!), and most of them
also mentioned SGML. Yet nobody seemed to know how to tie
these two things together in a coherent way. Q&A sessions were
filled with questions like: "What's the difference between SGML
and HTML?" and "If I have HTML, do I still need SGML?"
There is a strictly technical answer, of course, and we should
start there:
SGML is an international standard for encoding the logical
structure and content of documents that has been around since
1986. By focusing on the structure and intent of a document,
rather than proprietary formatting codes, it maximizes accessibility
and reusability of the underlying information. SGML lets us find
information more effectively because we can use internal
document structure to help guide our search. It lets us reuse
the information more effectively because SGML documents are
not tied to any particular format, application, or publishing
platform.
To create an SGML application, you define a set of document
objects (e.g. "title," "chapter," "paragraph," "list") and their
relationships (e.g. "document title must always occur first,"
"chapters can contain a title followed by any number of paragraphs
and lists" etc.). These objects and relationships are described in
something called a Document Type Definition (DTD), which
forms the heart of the application. Objects in an SGML document
are then tagged according to the rules specified in the DTD.
Originally used primarily for complex technical documentation
(such as for the U.S. Department of Defense CALS initiative),
SGML has emerged as the format of choice for a great variety of
applications. It is especially powerful (and popular) where
information needs to be collaboratively authored, managed as a
large collection of documents, and delivered electronically
in multiple forms.
Thus it is not surprising that HTML, the format that drives
documents across the World Wide Web, has been defined using
SGML. Formally speaking, HTML is a specific SGML
application, with its own Document Type Definition and a fixed
tag set. Therefore, the strictly technical answer is this: it
doesn't make sense to choose SGML versus HTML -- you're already
using SGML if you're using HTML.
However, what's really behind the questions is something more
fundamental: should HTML be viewed as a general interchange format,
or is it just a lowest common denominator useful primarily for
casual applications? Could it be used as the archival format,
or should we always produce HTML from some other source, such as
an SGML-based information repository? Some of these answers seem
pretty clear. At this point in time, HTML is clearly not rich
enough to be a mainstream repository format. Furthermore, because
each individual application has a variety of unique needs, I would
recommend that HTML never be looked at this way. SGML in general
(i.e. your own DTD) is the right way to build a corporate
information repository; HTML is a specific vehicle to put that
information on the Web. Yet, if HTML is not meant to be reasonably
rich, then a significant amount of its appeal may be lost.
That is, rather than being a "universal solvent" for information
being passed across the Web, it becomes just another intermediate
format driving a particular set of browsing software.
The developers of HTML 3.0 have set out to ensure that it does not
repeat the complexity of CALS and other classic "tech doc"
applications. Having been on the committee that architected the
CALS standard, I think I understand what they mean. Faced with
an extremely large and diverse community of potential users, and
constantly deluged with specific user requests for "just
one more" added feature, it was all too easy to let the standard
get out of control. While significant progress has been made in
modularizing what was once a terribly complicated monolith, new
users and new requirements still demand a cumbersome
registration process within a central CALS "library."
The World Wide Web, of course, has a potential set of users and
requirements that are orders of magnitude larger than even the
U.S. Department of Defense. Furthermore, the idea of attempting
to keep everyone's favorite set of objects and attributes in a
worldwide library is ludicrous. So there is no doubt that
HTML developers are exactly right: the potential to add
unconstrained complexity to HTML is staggering, and the
historical evolution of CALS cannot be the model we follow.
However, it isn't so clear to me that we must go to the other
extreme, making HTML staggeringly simple.
The problem with simplicity is that you lose information. Not the
words of the document themselves (the content), not even the links
that actually make the "web" between documents, but all the other
information necessary to do the really interesting things. This is
what defines the difference between an intriguing but clearly
casual application, and one which actually carries the freight
for corporate and government users:
1. A simple, base-level HTML dooms every document to
be rendered in "lowest common denominator" format.
The rich formatting characteristics of more "serious"
documents is simply not possible, since the only
distinctions available are between a few basic headings
and paragraphs. I may have twenty types of
"paragraphs" in my document (abstract, biography,
preface, note, introduction, glossary term and
definition, etc.), all of which are formatted differently
to enhance readability and communication power. But
on the Web, they will all look the same: I am forced to
call all of them "paragraphs," and, of course, all
objects called "paragraph" will receive the same
format.
2. A simple, base-level HTML takes away most of our
ability to intelligently search Web documents based on
their inner structure.
It is increasingly being recognized that, in terms of the
Information Explosion, the Web is at once our best
friend and our worst nightmare. Everyone can witness
the power of following links from a document that's
been retrieved, but no one really has an answer for
how to find that document in the first place. I can
search "What's New" URLs, or use an (almost
certainly quite limited) indexing service, maybe even
have a computer program automatically look for
specific words and phrases. Across a mushrooming
worldwide collection of information, however, this is
literally like looking for a needle in a haystack.
Wouldn't it help if we could limit such automated
searches only to document abstracts, or look for
keywords only if they occur in some other structural
context?
3. A simple, base-level HTML makes it very hard to do
anything interesting with a Web document once it's
been downloaded.
As people actually try to use (and reuse) information
they've found on the Web, this problem is just
beginning to be understood. Once I've reduced a
document to its lowest common denominator, I can't
just snap my fingers and get it back. If I've collapsed
twenty kinds of paragraphs into one, there's no good
way to recover the original twenty kinds. I've likened
this to condensing Hamlet into a simple "I Can Read
Too" book suitable for first graders, then trying to use
that version to reconstitute the original text.
Obviously, it won't work.
We are thus sitting on the horns of a dilemma. On the one hand,
focusing on accommodating the entire range of needs could lead to
a horribly complicated standard, pieced together like a patchwork
quilt to satisfy the whims of the loudest political factions. But on
the other hand, the danger of focusing on simplicity is that people
may not have the power to do the things that originally attracted
them to the Web. If that happens, there may be a backlash -- a "big
chill" as it were -- when this becomes obvious.
In the face of this, some have suggested doing away with HTML,
or using it only for very simple documents. In their view, Web
documents should be based on SGML in general -- that is, using an
unlimited variety of DTDs at each information provider's
discretion. The argument, which makes some sense, is that any
one DTD -- even HTML -- can't possibly cover everyone's
application requirements. Rather than try to standardize what is
inherently uncontrollable, let's give people the full power of
SGML and let them design their own uses.
Although I am also a strong SGML supporter, I disagree with
this conclusion. For all the power within SGML, I believe that
HTML should remain as the primary data format that forms the
backbone of the Web. This is true for three important reasons:
1. People have already accepted HTML as the underlying Web
format. Attempting to shift the momentum away from HTML
would be confusing, counter-productive, and likely to fail.
2. While it is important that more complex documents can be
included on the Web, most documents will continue to be
relatively simple. There is no reason to force people to
use generalized SGML -- requiring them to create or select
application-specific DTDs -- when HTML (with a few small
enhancements) will be adequate for most real-life uses.
3. Even if the Web consisted only of SGML documents, it would
be necessary to invent something akin to HTML as a set of
common semantics to be shared between all the various SGML
applications. Without these common semantics, it will not
be possible to easily navigate, search, format, or reuse
Web documents in a standardized way.
I believe we should maintain an HTML-centric Web, but one in
which HTML and SGML coexist and are used purposefully and
intelligently together. Rather than choosing one over the other,
I would like to see HTML and SGML brought together to form a
truly SCALABLE architecture in which both simple and complex
information can be handled effectively without sacrificing HTML's
current simplicity. In this vision, HTML will scale smoothly both
within itself and into the use of SGML in general.
To clarify what I mean by "scalable", let me propose an acid test
for this property:
1. If I want to send a message consisting only of the
words "Hello World!", it takes no more than two to
three HTML tags.
2. If I want to publish my corporate annual report,
looking reasonably close to its paper version, and
enabling powerful information retrieval keyed to text,
financial tables, and graphics, it takes a lot more tags --
maybe even its own customized SGML application -- but
but I can still do it on the Web without a problem.
3. If I want to send my corporate annual report, via the
same HTML file, to both a highly sophisticated
graphical browser / retrieval engine, and to a simple
"dumb terminal" viewer, I can do this knowing that
both can easily do something reasonable with the
information. That is, nothing about HTML (or its
relationship to SGML) would make it hard to build a
simplistic viewer that could deal with any document,
no matter how sophisticated the markup. Furthermore,
this should remain true whether I have used straight
HTML or a custom SGML DTD as the underlying encoding.
One might observe that I want to have my cake, eat it too, and still
get a second choice for dessert. This would seem like a lot to ask
for, but it's actually not as hard as it sounds.
I like to solve problems by analogy, and in this case I like the
analogy of a television signal. Like the Web, television allows
an extremely large and diverse audience to access information in
a previously unprecedented way. It also started out simply,
consisting only of a black-and-white picture and mono sound.
However, technology advanced as time went on, viewers
demanded more, and things began to get complicated. Remember
when a show being "in color" was still new enough to brag about?
Of course, for a while most people still watched such shows in
black-and-white. In the 1980's the same thing happened with "in
stereo." Nowadays, we've added surround sound, closed
captioning, secondary audio program, etc., accessible only to
those with specialized receivers.
Now, if we were to follow the original CALS model in designing
standards for television, we might be requiring everyone to have a
full-blown home theater system, given that the standard has
evolved to incorporate all those features. Alternatively, if the
"keep it simple at all costs" model were followed, we would all be
required to watch everything in simple black-and-white, since
that's the lowest common denominator. Luckily, we did
something smarter with television: we made it modular and
scalable. In fact, I can still receive today's terribly rich and
complex television signal on my 1960's vintage black-and-white
set, or, if I want, can invest in a home theater system or anything
in between. It's completely my choice based on what I think I
really need. For the World Wide Web, simple browsers like Lynx
and Mosaic ought to be adequate for many people and many
applications, but corporate users (and others) will want to add
richer information accessible by users with more sophisticated
browsers, search programs, and downloading requirements.
Concretely, a simple but scalable HTML should rest upon three
fundamental principles:
1. Enable a reasonable framework for navigation and
linking, but impose no more explicit structural
constraints than necessary.
We certainly need nested heading levels, and a few
basic hierarchical constraints, but we should avoid at
all costs building a DTD that tries to enforce desired
document structure of any sort. Our job is always to
describe rather than prescribe. If somebody wants to
check complex (CALS-like) structure using SGML
(e.g. "all lists must have at least two items"), they can
do it with their own, more restrictive DTD. What is
required from HTML is enough basic structure to do
effective top-level navigation, form links, describe
sensible tables, etc. -- and sufficient richness to
describe semantic (not structural) distinctions between
other elements.
2. Use only the absolute minimum number of required
(vs. optional) tags and attributes.
In CALS, it takes a huge number of tags to encode the
message "Hello World" because so many elements are
explicitly required. CALS was not designed to be
scalable; it was designed to enforce military
specifications for complex documents, and to enable a
sophisticated database integrating text with product
data and other information. In HTML we should
require elements and attributes only where things
wouldn't make sense without them. For example, it
should be required that links always have a specified
target. It should not be required that documents
always have a main heading, even though that might
seem like a generally good idea.
3. Most importantly, use an object-oriented approach in
which a small set of standard element classes are
defined (i.e. several levels of headings and standard
paragraphs), but within which users can add arbitrary
complexity. Then add more complex objects -- such
as tables, figures, math, forms, etc. --as a series of
optional, well-defined layers.
This last point is the heart of my proposal. The idea here
is to define a very simple base set of element classes -- no
more complicated than the base elements in HTML 2.0 -- while
allowing users to create a rich set of additional semantic
distinctions by creating their own "subclasses."
For example, HTML currently contains a "paragraph" element to
cover most forms of text within heading levels. If you would like
a richer set of distinctions (e.g. preface, introduction, abstract,
note, warning, caution, etc.) -- distinctions that may be crucial to
effective information retrieval as well as formatting -- you're out
of luck. Today, all of these elements must be mapped to "paragraph",
because that's all there is to choose from.
Hence the motivation for customized SGML -- it allows you to make
up any element names you wish. But there's another problem. SGML
gives you the ultimate flexibility and scalability but in itself
provides no standardization of even the most basic concepts of
navigation and linking. IT WASN'T MEANT TO. One DTD designer
can call a hyperlink an "a", the next may call it an "xref",
another might choose to call it a "joe" or a "schmoe".
The point is, with fully general SGML there's no way to predict
the relationship between the elements in an arbitrary DTD and the
semantics my browser / searcher / download process needs to know.
Certainly, if I search across the Web (soon the world's largest
virtual document collection), I would like to be able to follow
links and understand navigation hierarchy without having to
analyze each individual use of SGML and figure out how to map it
to a common set of semantics. Having the DTD available -- in the
absence of some additional method of tying things together --
doesn't help me a bit.
So we must do something smarter: generalize HTML to be viewed as
providing the common set of semantics (classes of objects) that can
be understood by any Web browser, search engine or other application.
Using this approach, all individual Web documents would be encoded
in SGML, using whatever DTD is desired by the user, but each SGML
element in the DTD would be mapped to one of the standard HTML
classes. For example, my DTD may have elements called ARTICLE,
ABSTRACT, PREFACE, PARA and JUMP. So that any Web browser could
immediately understand the basic structure of my documents, I would
provide a standard mapping which shows that ARTICLE is a heading
level, ABSTRACT, PREFACE and PARA are all a form of paragraph,
and JUMP is a hyperlink (most likely this would be accomplished
using the "architectural forms" approach from SGML's companion
HyTime standard).
The majority of simple Web documents would continue to use the
simple SGML DTD called HTML, in which the elements map one-for-one
to the common object classes. However, even this DTD could be
made more extensible to allow a measure of scalability without
moving to a separate custom-defined SGML application (most likely
this would be accomplished using some form of class/subclass
attributes within the base HTML DTD).
So how do SGML and HTML really fit together? HTML is an
SGML application, a specific way to use SGML optimized for
simplicity and consistency in distributing electronic data over the
Web. It is not in competition with SGML, nor should it be
confused with a format suitable for archiving source information
in a corporate repository. However, it has the potential of being a
really smart use of SGML, taking advantage of SGML's flexibility
without giving up a common backbone structure that any browser
can readily interpret. In fact, it is precisely this combination of
simplicity and extensibility that will enable the Web to be viable
for serious online publishers.
* * *