[This local archive copy is from the official and canonical URL, http://metalab.unc.edu/pub/sun-info/xml/why/4myths.htm, 1999-01-29; please refer to the canonical source document if possible.]

Media-Independent Publishing:
Four Myths about XML

Jon Bosak
Sun Microsystems

This article first appeared in IEEE Computer (Vol. 31, No. 10, October 1998, pp. 120-122).

Called "the emerging technology of the year" after it was endorsed by the World Wide Web Consortium (W3C), XML burst onto the scene in February, 1998. It was called the successor to HTML and, according to some, the future lingua franca for the exchange of structured data.

As XML emerged from the obscurity of its W3C beginnings, it was perhaps inevitable that this new data format would begin generating misconceptions as fast as it has attracted enthusiasts. In this column, I'd like to head off some myths about XML before they become permanent misunderstandings.

MYTH 1: XML IS A CONSIPIRACY LED BY MICROSOFT

XML is a conspiracy, but not Microsoft's. In fact, XML was produced by a group of markup language experts organized by Sun Microsystems to develop a form of the venerable ISO standard, SGML, for use on the Web.

It's true that Microsoft was a major participant in the XML effort, but so were a number of other large companies (Sun, Hewlett-Packard, Netscape, Adobe, and Fuji Xerox) as well as key SGML vendors and systems integrators (ArborText, Inso, SoftQuad, Grif, Texcel, and Isogen), representatives of the academic community (NCSA and the Text Encoding Initiative), early adopters (DataChannel and Vignette), and one of the world's leading SGML experts, James Clark, who is technical lead for the W3C's SGML activity.

The amazing thing about XML is that all of these people and organizations set aside personal and corporate agendas to cooperate in the construction of a genuinely open standard, driven entirely by user needs. These needs include:

Extensibility, to define new tags as needed.
Structure, to model data to any level of complexity.
Validation, to check data for structural correctness.
Media independence, to publish content in multiple formats.
Vendor and platform independence, to process any conforming document using standard commercial software or even simple text tools.

While I can't help admiring Microsoft's masterful adoption and marketing of the XML concept, XML doesn't belong to Microsoft. XML belongs to the world.

MYTH 2: XML IS AN EXTENSION OF HTML

Early generalizations about XML have led many to believe that XML is just a method for extending HTML by adding new tags. In fact, XML and HTML exist in entirely different layers of markup technology. HTML is a tag language (more formally, a markup language) -- a set of standard delimiters with standardized meanings that can be put into documents in order to indicate the role of particular pieces of the document. For example, anything between <H2> and </H2> in an HTML document is understood to be a second-level document head.

Tag languages

People whose experience of tag languages is limited to the Web are often surprised to learn that HTML is just one of a large number of standardized tag languages that have been developed over the years for use within particular industries. For example, the aircraft industry has a tag language for aircraft maintenance manuals called ATA-2100, the semiconductor industry has a tag language for circuit data called PCIS, and the computer industry has a tag language for software documentation called DocBook.

Some of these markup languages have been in use longer than HTML, and many of them take different approaches to the problems they solve. For example, consider this fragment of HTML:

  <H2>Second-level heading</H2>
  <P>This is a passage of text that probably
  belongs to the heading immediately above.</P>

An analogous fragment of DocBook might look like this:

  <SECT2>
    <TITLE>Second-level heading</TITLE>
    <PARA>This is a passage of text that certainly
    belongs to the heading above. We know this
    because both are contained in the same SECT2
    element.</PARA>
  </SECT2>

While these various tag languages have their differences, all of them, including HTML, are similar in three ways:

Each one defines a standard set of tags with standardized meanings and standardized rules of use -- in other words, a standardized grammar.
Each one is designed to work best for a particular category of documents or data.
All use the 12-year-old international text processing standard, SGML, to define their standard tag sets and grammars.

All of these languages look similar, too, because they all use the familiar angle brackets inherited from SGML's reference concrete syntax.

The SGML layer

It should be clear from this description that SGML itself belongs to a conceptual layer that's different from any of the individual markup languages that are defined using SGML. The difference between SGML and specific markup languages is often summed up by saying that SGML is a metalanguage rather than a language. This is a rather loose characterization. SGML is not as abstract as a true metalanguage like Backus/Naur Form (BNF), which is used to define programming languages. Nevertheless, calling SGML a metalanguage does get the point across: SGML is not a particular tag language; it's a language for defining tag languages.

The key factor to understand about XML is that it belongs to the SGML layer, not the HTML layer. XML is a simplified form of SGML, not an extended form of HTML. The difference between XML and SGML is that the designers of XML took out a number of advanced SGML features that make a full SGML parser difficult to implement in a Web browser.

But the basic idea remains the same: XML is a technology that allows the creation of an unlimited number of different markup languages for different purposes. The point of XML -- and the reason that it's becoming so popular -- is that all the various special-purpose languages that can be defined using it can be parsed by a single standardized processor small enough to be built into every Web browser.

People who don't understand this distinction tend to jump to the conclusion that an XML-aware application will allow them simply to sprinkle new tags throughout their HTML documents. Attempts to "extend" HTML this way will lead to an even worse mess than we've already got.

MYTH 3: XML CAN DRIVE WEB BROWSERS BY ITSELF

Remember that the HTML concept is one of a markup language consisting of a relatively small set of standard tags that are associated with some more-or-less standard behaviors. The XML concept is one of an infinitely large set of possible tags that are associated with no standard behaviors at all. Specification of the behavior has to come from somewhere else. In publishing, that's usually a style sheet, but in other domains it can be something as flexible as JavaBeans or as specialized as an industry-standard protocol around which programmers write standardized applications.

Syntax not semantics

XML proponents sum this up by saying that XML specifies syntax, not semantics. Some theorists object that this easy formulation overlooks the semantic association of XML syntactic objects with the XML data constructs they represent (such as elements and attributes). However, the point the "syntax not semantics" slogan tries to make is larger and simpler: Unlike HTML tags, XML tags have no predefined meaning. The meaning or behavior has to be supplied in operational terms by programs or scripts or in declarative terms by style sheets or even good old prose.

Confusion over this point becomes evident when prospective XML users ask plaintively how XML is going to be displayed on their Web browsers. The answer is that it's not -- at least not by itself.

To get something going in a browser that emulates what is done today with HTML, you are going to have to provide separately what HTML provides as a unitary but difficult-to-manage whole: You will have to supply both the content of a document (expressed in XML) and its treatment, which you must specify either programmatically (with scripts) or declaratively (with style sheets).

Style sheets

What's preventing the general use of XML for Web documents is the current lack of a style sheet language that is both powerful enough for XML but also easy to use. Cascading Style Sheets (CSS), the style sheet language developed for HTML, can be used to apply styles to XML documents, but it doesn't have the power to transform and generate structures (such as tables of contents) needed for XML-based publishing in general.

The Document Style Semantics and Specification Language -- the ISO style sheet style sheet language designed for use with SGML -- has the power for advanced publishing projects. But DSSSL (rhymes with "whistle") has a syntax based on the Scheme programming language, which many people find hard to learn. It also lacks a rich declarative layer, which makes it almost impossible to guarantee that independently developed style sheet editors can interoperate.

This is where Extensible Style Language (XSL) comes in. Part of the larger XML project from the beginning, XSL is a new language that will combine the power of DSSSL with the simplicity of XML and the established "style-property" vocabulary of Cascading Style Sheets. A W3C XSL working group, formed in January 1998, is busy defining this enabling language for XML-based Web publishing.

While a finished XSL recommendation is still almost a year away, the first XSL working draft is now publicly available on the W3C Web site at http://www.w3.org/TR/WD-xsl. This nascent specification deserves the careful attention of anyone intending to engage in electronic publishing as it moves into the next century.

MYTH 4: XML IS JUST FOR DATA

Because we don't yet have a style sheet language powerful enough to allow XML to demonstrate its superiority as an approach to publishing, the first wave of XML applications is based on what it can do on its own: convey structured data.

A single, human-readable syntax

XML gives us a single, human-readable syntax for serializing just about any kind of structured data -- including relational data -- in a way that lets it be manipulated and displayed using simple, ubiquitous, standardized tools. The larger implications of a standard, easily processed serial data format are hard to imagine, but they are obviously going to have a large impact on electronic commerce. And it seems clear that electronic commerce is eventually going to become synonymous with commerce in general.

XML can do for data what Java has done for programs, which is to make the data both platform-independent and vendor-independent. This capability is driving a wave of middleware XML applications that will begin to wash over us around the beginning of 1999. However, the ability of XML to support data and metadata exchange shouldn't be allowed to distract us from the purpose for which XML was originally designed. The designers of XML had in mind not just a transport layer for data but a universal media-independent publishing format that would support users at every level of expertise in every language.

Media-independent publishing

Media-independent publishing is actually a much harder problem than data exchange. In fact, it's fair to say that the requirements for publishing in the general sense are a superset of the requirements for data exchange. The arrival of XSL will make possible a solution for publishing in general, with consequences that few people yet realize.

The key to understanding the revolutionary potential of XML is that it is just one piece of a larger picture. XML by itself can provide standardized interchange formats for databases and spreadsheets. This is significant. But XML and XSL together can replace existing word processing and desktop publishing formats as well. It can give us, in effect, a single, completely internationalized format of almost unlimited power for both print and online publishing that is fully interoperable across all products and all platforms. The implications of this go far beyond data exchange and far beyond the Web.

What standardized publishing means to users

The combination of XML and XSL is potentially vastly more complex and difficult to work with than today's HTML, so its use will at first be the domain of a few experts working on large, specialized publishing applications by hand. These will be the applications that demand the highest level of automation and media independence -- newspapers, business directories, encyclopedias, commercial catalogs, television schedules, and so on.

The standardized approach will begin to move beyond this specialized group of power users only when ordinary word processing and desktop publishing programs start saving out files as combinations of XML and XSL instead of proprietary formats. This is not a technical problem but an economic one, because the larger vendors of publishing tools have historically relied on proprietary formats to lock in their user base. The vendors will change to standardized, open formats only when ordinary users become aware of their benefits and demand support for them.

The benefits of a standardized format for data and presentation are overwhelming. They include

complete interoperability of both content and style across applications and platforms;
freedom of content creators from vendor control of production tools;
freedom of users to choose their own views into content;
easy construction of powerful tools for manipulating content on a large scale;
a level playing field for independent software developers; and
true international publishing across all media.

I believe that user awareness of these benefits will eventually force vendors to support the standardized approach, just as user demand for access to the Internet forced vendors to support the Web.

The consequent restructuring of the relationship between producers and consumers of many kinds of desktop software applications should prove enormously beneficial for all of us. It will mean an end to control of the market by a few big companies and, perhaps even more importantly, an end to control of the market by a few big countries.

The result will be better products and better communication between human beings.

Jon Bosak is the Online Information Technology Architect for the Solaris Products division of Sun Microsystems. He organized the W3C XML activity in 1996 and has chaired the W3C XML Working Group since its inception. His opinions are his own and do not necessarily represent those of Sun Microsystems or the World Wide Web Consortium.