[This archive copy from: "Overview: XML, HTML, and all that", by Jon Bosak, Sun Microsystems. Presented on April 11, 1997. Archive copy .ZIP. Note that this "text only" document is greatly impoverished - see the .ZIP archive if possible.]
Overview: XML, HTML, and all that
April 11, 1997
For best display, set font size to 24pt.
What is text, really?
Structured markup: the basic idea
Why structured markup?
The separation of presentation from structure and content makes
- multiple targets (both mechanical and human)
- reusable information
- intelligent downstream document processing
- large-scale information management
Is HTML structured markup?
- No, not really
- HTML maximizes ease of use by providing one general hardwired tag
set for all applications and predefining certain behaviors for each
- HTML is perfectly suited to the applications for which it was
designed, but it was not designed to handle all applications
- In particular, HTML was not designed to handle very large or
complex collections of data that need to be used in different ways
or maintained over a long period of time
Specific features missing from HTML
- HTML does not provide a way to validate document data structures
or impose editorial control in projects with multiple authors
- It is difficult or impossible in the general case to reliably
generate navigational aids from HTML documents
- As a result, navigation usually has to be implemented by adding
handcrafted hypertext links
- HTML browsers have no concept of entity management (no mechanism
for modular reuse)
- With just a general-purpose tag set, context searching or other
behavior based on semantic distinctions is difficult or impossible
- HTML cannot easily be made to model data structures residing in
HTML is not the optimum data format for database interchange or
certain kinds of large-scale commercial publishing.
What is SGML?
- Standard Generalized Markup Language: the international standard
for structured document interchange, ISO 8879 (1986)
- Developed out of a search for a universal typesetting language
begun in the late 1960s
- Descriptive, not procedural -- the procedures are left to
- Not a single markup language, but a metalanguage for the
specification of an unlimited number of markup languages, each
optimized for a particular category of documents
- The SGML description of a markup language is called a Document
Type Definition (DTD)
- In SGML, the DTD is a required part of the document
Major industry DTDs (markup languages)
|ATA 2100||aircraft industry
|SAE J2008||automobile manufacturing
|TMC T2008||truck manufacturing
|EDGAR||Securities and Exchange Commission
|ISO 12083||journal, book, and magazine publishing
|ICADD||publishing for the print-disabled
|TEI||academic and scholarly publishing
|HTML||World Wide Web
Advantages of generic SGML
- Based on international standards -- no vendor wars
- Fully extensible -- no tag limitations
- Supports formal validation and controlled authoring
- Highly structured -- can model any kind of data
- Links and navigational aids can be generated directly from the
structure of documents
- Context searching increases the speed of user access 10-20 times
- Documents can easily be reused for different purposes
- All versions of a document (printed and online) can be generated
from the same source
- User-selectable stylesheets allow dynamically configurable views
- System administration of large document repositories is vastly
Why not just put SGML on the Web?
SGML does provide the key features needed to support future
large-scale data-intensive Web applications...
- Late binding of presentation
...but SGML is too big and far too complex for most Web
software developers (not to mention site administrators).
The attempt to put SGML on the Web
W3C activity "Generic SGML on the Web" suggested at WWW5 in
Paris and initial participants recruited at SGML Europe in Munich
"The goal of the W3C SGML activity is to enable generic
SGML to be served, received, and processed on the Web. As in the case
of HTML, the implementation of SGML on the Web will require attention
not just to structure and content, but also to the standardization of
linking and display functions."
- Chunking (outsourced to SGML Open)
What actually happened
- July-August 1996: assembled 60 of the world's top SGML experts to
form W3C SGML WG; formed 11-member W3C SGML ERB to make decisions and
steer the activity
- September-November 1996: threw out about 80 percent of the SGML
- Result: Part 1 of the XML specification (the "red book")
XML Part 1 is a self-contained easy-to-implement subset of SGML for
use on the Web.
Current working draft of Part 1: http://www.w3.org/pub/WWW/TR/WD-xml-lang-970331.html
XML syntax vs. HTML syntax
- Freely extensible
- Arbitrarily deep structure
- Optional validation
- Strong separation of content and presentation
XML vs. SGML
- Vastly simpler (a very small conforming subset of SGML)
- Vastly easier to implement
- Design is formal and concise ("Lexable and Yaccable")
- Retains most of SGML's power as a document delivery medium
- DTD is optional
- Can (and will) function as an authoring format, but functions best
as a delivery format from databases (primary design goal)
Current overview of the activity
The XML activity can be summed up as the adaptation of existing
international publishing standards for use on the Web.
- Phase I: Subset SGML (ISO 8879) for easy implementation
- Phase II: Design standard hypertext mechanisms based on HyTime
- Phase III: Design a standard stylesheet for arbitrarily
structured information by subsetting DSSSL (ISO/IEC 10179)
Is XML intended to replace HTML?
In a word: No.
- HTML is well-suited to a wide variety of current applications.
It isn't broken!
- XML provides a more complex set of tools for addressing a
different set of applications:
- Database exchange
- Distribution of processing to clients
- Client-side manipulation of views into the data
- Customization of information by intelligent agents
- Management of document collections
Major XML application area #1: Database interchange
Example: Health care data
Bogus Web solution
Real Web solution
Why HTML can't handle the health care example
- The HTML tag set is too limited to represent or differentiate
between the multitude of database fields in the mixture of documents
making up the patient's medical history.
- HTML is incapable of representing the variety of structures in
- HTML lacks any mechanism for checking the data for structural
validity before the receiving application attempts to import it into
the target database.
Generalized data exchange
The role of a hub format
Other applications fitting the database interchange model
- Legal publishing
- The government drug approval process
- Collaborative CAD/CAM efforts
- Collaborative calendar management across different systems
- Any corporate network application that works across databases,
especially where policies must be enforced: purchase orders, expense
- Exchange of information between players in any broker-organized
business: insurance, securities, banking, etc.
Major XML application area #2: Distributed processing
Example: semiconductor data
Why HTML can't handle the semiconductor example
- It requires industry-specific markup that cannot be
implemented within the confines of the fixed HTML tag set.
- It requires that the data representation be platform- and
vendor-independent so that data from a variety of sources can be used
to drive a variety of distributed applications.
This is a widely applicable model!
Other applications fitting the distributed processing model
- Design applications where the a designer considers various alternatives:
electronics, engineering, architecture, menu planning, etc.
- Scheduling applications where a customer explores various
possibilities: airlines, trains, buses, and subways; restaurants,
movies, plays, and concerts.
- Commercial applications that allow consumers to explore
alternatives: real estate, automobiles, appliances, etc.
- The entire spectrum of educational applications.
- The entire spectrum of customer-support applications, including
Major XML application area #3: user views of the data
Example: Views of our Solaris documentation, server-mediated...
... and client-mediated
Other applications that need user-controlled views
- An installation sheet that carries warnings in multiple languages
can be made to show just the ones in the language selected by the
- A document containing many annotations can be switched from a mode
that shows only the text, to a mode that shows only the annotations,
to a mode that shows both, just by making a menu selection.
- A phone book sorted by last name can instantly be changed into a
phone book sorted by first name.
Major XML application area #4: Web agents
Example: the 500-channel TV guide
Once again, a category of applications depends on the ability to
standardize on a form of data representation for a particular industry
or problem domain.
Hyperlinking in XML
(Tim Bray will present this.)
How do we make XML documents do something?
- Stylesheet-based approaches
- Programmatic approaches
CSS (Cascading Style Sheets)
- Easy syntax
- Good match for HTML
- Will be widely used for simple XML documents
- Embedded CSS style attributes will be a common feature in
Come to this afternoon's XML technology demo session to see what
can be done with CSS and XML.
Limitations of CSS in complex applications
- CSS cannot grab an item (such as a chapter title) from one place
and use it again in another place (such as a page header).
- CSS has no concept of sibling relationships. For example, it is
impossible to write a CSS stylesheet that will render every other
paragraph in bold.
- CSS is not a programming language; it does not support decision
structures and cannot be extended by the stylesheet designer.
- CSS cannot calculate quantities or store variables. This means,
at the very least, that it cannot store commonly used parameters in
one location that is easy to update.
- CSS cannot generate text (page numbers, etc.)
- CSS uses a simple box-oriented formatting model that works for
current Web browsers but will not extend to more advanced applications
of the markup, such as multiple column sets.
- CSS is oriented toward Western languages and assumes a horizontal
Document Style Semantics and Specification Language (DSSSL)
- The International Standard for specifying the formatting of
structured documents (ISO/IEC 10179:1996)
- Two distinct independent languages:
- Transformation language
- Style language
- Each of which uses two other languages defined by the standard:
- The DSSSL expression language
- The Standard Document Query Language (SDQL)
- DSSSL syntax is derived from Scheme
- Functional (side-effect free)
- Turing complete
Some DSSSL stylesheet snippets
(define monospace-font-family "Courier")
(declare-initial-value writing-mode 'left-to-right)
(declare-initial-value font-size 12pt)
(declare-initial-value line-spacing 14pt)
(declare-initial-value font-family "Helvetica")
(define paragraph-indent 24pt)
first-line-start-indent: (if (first-sibling?)
(element (chapter title)
(literal "Chapter "
(format-number (ancestor-child-number) "1")))
(define (expt b n)
(if (= n 0)
(* b (expt b (- n 1)))))
DSSSL: Mixed languages
DSSSL: Mixed scripts
DSSSL: Top-to-bottom languages
DSSSL: Margin attachments
DSSSL: Rotated text areas
DSSSL: Multiple columns
DSSSL: Variant column flows
DSSSL: Column spans and zones
DSSSL: Synchronized columns
DSSSL: Asian language features
- glyph annotation
- multi-line inline notes
- emphasizing marks
DSSSL: Advanced mathematics formatting
DSSSL demonstration: Jade
XML/Java demonstration: Jumbo