XML: It's the Future of HTML

[This local archive copy mirrored from the canonical site: http://www.sun.com/980602/xml/; links may not have complete integrity, so use the canonical document at this URL if possible.]

XML: It's the Future of HTML

by Todd Freter

Those who watch the XML (Extensible Markup Language) phenomenon have noted its rapid and impressive advances. As the first and second articles in this series have attempted to convey, XML is gaining what the marketers call "mindshare" at an amazing rate. Web futurists are satisfied that XML is The Next Thing.

But what about all of the HTML data that constitutes the web? Ever since the web itself was The Next Thing, HTML has been the target format for content developers around the world. If you master HTML, you can reach a worldwide audience.

The Demise of HTML?

Many arguments in support of XML have taken strength from a critical appraisal of HTML. HTML's limitations have fueled the call for and interest in a technology like XML. After all:

HTML is a fixed tag set. It only describes documents of a single type.
HTML data is hard to process. Browsers have permitted all manner of HTML messiness to pass, unchecked, into semi-permanent residency in cyberspace.
HTML documents that aspire to function like applications are clogging the internet with client-to-server traffic.

XML will change all of this, to be sure. But no one can realistically expect the volumes of active, useful HTML pages to become irrelevant overnight. In fact, HTML has an important role to play in the brave new world of XML. But what is that role?

W3C to the Rescue

To answer the question of whither HTML, the W3C (World Wide Web Consortium) recently convened a workshop about HTML's future. As the organization that has maintained and furthered the HTML standard since the IETF released the HTML 1.0 specification, the W3C has an abiding interest in HTML, as have many of the W3C's members. On May 4 and 5, many of those members and some unaffiliated but interested parties attended the W3C's "Future of HTML" workshop near San Francisco.

The W3C is undoubtedly rich in ideas for HTML, but the purpose of this workshop was for the W3C to listen to its members and to determine what actions best support its members' needs. What emerged from this workshop is a surprising and, many may agree, a positive program for HTML.

Does HTML Have a Future?

HTML certainly has its past and current relevance. But the enthusiastic acceptance of XML and the fact that W3C dissolved its HTML working group after publishing the HTML 4.0 recommendation may have left HTML's status as an ongoing and meaningful data format in question.

Two workshop participants who represented the ISO (International Standards Organization) brought this question home. In the view of ISO, HTML has constituted a de facto standard of sufficient heft that ISO has given HTML its standards treatment. "ISO HTML," based on HTML 4.0, is described in two key documents:

Together these documents codify a rigorous view of HTML, even if HTML is not always implemented as ISO describes it. ISO has standardized HTML in the conviction that HTML will persist for at least 25 years. Given ISO's long view of the situation, HTML's future has at least one substantial vote of confidence. Moreover, having made HTML into a standard, ISO expects the W3C to remain responsible for HTML.

But ISO is a conservative organization that records existing standards; it stands down from the task of driving innovation. How does ISO's or any other long view of HTML square with the innovative force of XML?

HTML or XML?

Many enterprises and consortia have been wrestling with HTML's perceived limitations, and those groups look to XML as a means for escaping them. C|NET's representative called for HTML, augmented with CSS (Cascading Style Sheets) for style and layout, to persist only as a machine-generated output format for web documents. In this view, documents would be authored in some other format, perhaps XML, SGML, or some proprietary format more tractable than HTML. C|NET's view is reflected in the practices of many companies that publish technical documentation on the web (such as Sun's own documentation site).

Other industry consortia have sponsored XML-based languages better suited to the needs of their information. Mathematicians have developed MathML, and chemists have advanced CML (Chemical Markup Language). Both of these have used XML to define their content models. These are only two; there are also many others in the works.

Manufacturers of cellular phones, PDAs, or smaller information devices have taken a different approach. They have championed Compact HTML in an effort to pare from HTML features more appropriate for large-screen user agents such as browsers. While framed in terms of HTML today, this effort could easily turn to XML for a content model more closely suited to the information and devices that mobile HTML is meant to serve.

But HTML is a unitary standard that requires a W3C-convened working group to maintain and advance it. The process is not glacial, but neither is it instantaneous. Unlike HTML, XML enables users to develop the content models appropriate to their applications much more quickly. What is the motivation to stand by HTML when so much of the world is looking to XML to solve its problems?

HTML and XML?

Some powerful motivations to preserve HTML exist, despite XML's appeal. Beyond the obvious motivation, that is, support for multiple millions of active web pages, HTML is well understood as a format for authoring. The HTML Writers Guild's representatives were not the only workshop participants to make this point. With all its problems, HTML is nonetheless a highly successful lingua franca for expressing ideas on the web. HTML has given rise far too powerful a communications medium to cede quickly and gracefully to something else.

Is XML an irresistible force and HTML an immovable object? How can the stability of HTML accommodate the fast-appearing XML-based tag sets like CML or MathML? The consortia that develop their own XML-based information models also complain that HTML is already too big and complex.

It would seem that HTML and XML are on conflicting courses. But do they need to be?

XML in HTML?

One serious proposal is for HTML documents to support the inclusion and processing of XML data. This would allow an author to embed within a standard HTML document some well delimited, well defined XML object. The HTML document would then be able to support some functions based on the special XML markup. This strategy of permitting "islands" of XML data inside an HTML document would serve at least two purposes:

To enrich the content delivered to the web and support further enhancements to the XML-based content models
To enable content developers to rely on the proven and known capabilities of HTML while they experiment with XML in their environments.

The result (for markup mavens) would look like this:


<HTML>
<body>
<!-- some typical HTML document with
<h1>, <h2>, <p>, etc. -->
<xml>
<!-- The <xml> tag introduces some XML-compliant
markup for some specific purpose. The markup is
then explicitly terminated with the </xml> tag.
The user agent would invoke an XML processor
only on the data contained in the <xml></xml>
pair. Otherwise the user agent would process
the containing document as an HTML document. -->
</xml>
<!-- more typical HTML document markup -->
</body>
</html>

User agents that normally process HTML data would have to swap in an XML processor to render that "island" of information between the <xml> and </xml> tags.

Another proposal that met with more skepticism is the idea of "sprinkling" XML data within an HTML document. This idea has been tossed off in the popular press without considering the fuller implications, and many people consider it more problematic than practical, but for markup specialists, this is what XML "sprinkles" might look like:


<HTML>
<body>
<p>One would sprinkle some XML in a
document to indicate that
<part-number>805-5412</part-number>
requires special treatment because
it is a part number.
<p>Processing would be less straightforward
than for XML islands.
<!-- We at Sun contend that for these
sprinkles of XML, a different mechanism,
already in HTML 4.0, is more appropriate:
<span class="part-number">805-5412</span>
accomplishes a similar effect and does not
create processing challenges. -->
</body>
</html>

But the controlled embedding of XML objects inside an HTML document suggests a practical means of mixing the supposedly immiscible HTML and XML.

HTML as XML?

Another proposal more appropriate for long-term implementation (as opposed to the XML "islands" in HTML) is to re-do HTML as an XML application. That is, rewrite the HTML specification so that HTML documents must, like XML, be well formed and may optionally be valid. The reasons that HTML documents are not well formed today are technically dense and need not be elaborated here; they are a function of HTML's history. However, the consensus of W3C members at the "Future of HTML" workship strongly favored this option.

To support HTML in applications like XML browsers, a tool to convert today's amorphous, non-rigorous HTML documents into well formed XML documents is required. The W3C is working on such a tool right now; watch for details about it in the future.

W3C at the Ready

So where does that lead? It leads to the workshop participants' support of a resolution that the W3C reconvene an HTML working group. The working group's scope would include, among others, objectives suggested throughout this essay:

XML objects in traditional HTML documents
HTML as an application of the XML standard
HTML as a modular rather than unitary content model

The goal would be HTML as a robust, well known data format for documents on web, but with the benefits of extensibility, processability, and manageability that elude HTML documents today.