Beyond HTML: XML and Automated Web Processing

[Archive copy mirrored from the URL: http://developer.netscape.com/news/viewsource/bray_xml.html; see this canonical version of the document.]

BEYOND HTML: XML AND AUTOMATED WEB PROCESSING

By Tim Bray

Send comments and questions about this article to View Source.

XML (Extensible Markup Language) was nowhere a year ago; now it seems to be everywhere. It's supposed to be the thing that "goes beyond HTML" -- but what does that mean? Since HTML is the most successful document format in history, why would anyone want to go beyond it? The people who are working on XML talk about "automating the Web" -- what does that mean?

XML is designed to do some jobs that HTML isn't built to handle but that really need doing. If you just want to display text, there's nothing wrong with HTML, but for automated Web processing -- enriching documents in a way that enables computer programs (like Web robots) to do something with them -- what's needed is XML.

XML was designed under the auspices of the World Wide Web Consortium (W3C). It went public in November 1996 and is already the basis for half a dozen proposals to automate Web processing. XML has a lot of people thinking really hard about what the future of the Web will look like. You need to start thinking about XML now, because a year from now you'll undoubtedly be using it a lot.

XML is extensible, easy, and (hard to believe, but true) guaranteed not to break your computer programs. In this article, I'll expand on these extravagant claims. Then I'll explain where XML came from and also take some guesses as to where it's going and what it might mean for you.

XML IS EXTENSIBLE

Extensibility is the reason for XML. HTML is great, but often it can seem to have either too many tags or not enough. It's got too many if you're trying to write a browser, a robot, or a general-purpose JavaScript utility, but not enough if you want to identify a <Part-Number>,<Exchange-Rate>, or <Aikido-Rank> in your Web page to allow automated processing. HTML doesn't have those tags, and it isn't going to get them. But in XML, you can make up any old tags you want to use.

Suppose I wanted to add some intelligence to the vast amount of e-mail stored on my computer. With XML I could mark it as shown in Example 1.

Example 1

<email>
<head>
<from> <name>Tim Bray</name> <address>tbray@textuality.com</address> </from>
<to> <name>Paul Dreyfus</name> <address>pdreyfus@netscape.com</address> </to>
<subject> First draft of XML intro </subject>
</head>
<body> 
<p>Here's a draft of that XML article. I'll be on the road but
connected to e-mail. Let me know if it hits the right level (i.e., are
major revisions in order?). If it's fine, proceed with editorial
nit-pickery. -Tim</p>
<attach encoding="mime" name="xml-draft.html"/>
</body>
</email>

This example should be pretty obvious. The <attach> tag looks a little weird, but we'll cover that in a moment. Some of the advantages should also be obvious. To start with, a Web robot could do a smart job of indexing this, and a Java applet could do all sorts of intelligent formatting (such as build a table-of-contents summary of a bunch of e-mail).

The basic idea here is called descriptive markup: the tags around a chunk of text don't say how to format it, or what to do when people click on it; they just say what it is. This is in dramatic contrast to HTML, where the tags do all these things at once.

The big win with descriptive markup is a bit subtle. Suppose you're processing some e-mail and you want to be able to display it both with Navigator on a big monitor and on the teeny screen of a cell phone. If the e-mail were marked up in XML, you could write one set of rules for the monitor and another for the cell phone, another to produce a professional-quality paper printout, and still another to drive a fax machine.

The idea is that you've decoupled the document from its presentation. This doesn't make designing good documents or good presentations easy, but it does mean that you can attack the problems separately, which is a big step forward.

Publish and Constrain Your Tags

Obviously, you don't want to make up a new set of tags every time you write a document. Furthermore, since this is the Web, you'd probably like to share your work with others.

XML has something called a document type definition (usually called a DTD) that allows you to define the tags you've created, for future use by yourself or others. Example 2 is the DTD for the e-mail shown in the earlier example.

Example 2

<!element email   (head, body)>
<!element head    (from, to+, cc*, subject)>
<!element from    (name?, address)>
<!element to      (name?, address)>
<!element name    (#PCDATA)>
<!element address (#PCDATA)>
<!element subject (#PCDATA)>
<!element body    (p | attach)*>
<!element p       (#PCDATA)>
<!element attach  EMPTY>
<!attlist attach  encoding (mime|binhex) "mime"
                  name     CDATA         #REQUIRED>

This should be easy to read, too. In English, it says:

An EMAIL has to have a HEAD and a BODY.
The HEAD has to have a FROM, one or more TOs, zero or more CCs, and a SUBJECT.
The FROM and the TO can both include a NAME, and they have to include an ADDRESS.
The NAME, ADDRESS, and SUBJECT are all just text.
The BODY is a mixture of Ps and ATTACHes.
A P contains just text.
An ATTACH doesn't contain anything, but it has an ENCODING attribute whose value can be either mime or binhex; if it's not there, the default is mime. An ATTACH also has a NAME attribute whose value can be any text, but has to be there.

I'm not going to explain all the details of the DTD syntax, but the ideas are pretty obvious. Clearly, you'd normally have one DTD that describes a lot of different documents; think of it as an SQL database schema for documents.

If this DTD were stored at some location -- say,http://home.netscape.com/DTDs/email.dtd-- then to associate the DTD with the e-mail message you'd insert a first line like this:

<!doctype email SYSTEM "http://home.netscape.com/DTDs/email.dtd">
<email>
<head>
<from> <name>Tim Bray</name> <email>tbray@textuality.com</email> </from>
<to> <name>Paul Dreyfus</name> <email>pdreyfus@netscape.com</email> </to>
...

The DTD might be useful to a program that received one of these e-mail messages and wanted to find out in advance what tags would be in it and how they fit together. But its most important use is to support smart editing programs, which could read the DTD and simply not let the author create a document that didn't match the DTD. (This isn't imaginary; such authoring tools already exist.)

An XML document for which there is a DTD, and which conforms to that DTD, is called valid. But a document doesn't have to be valid to be useful, as we'll see in a moment.

Extensible Hyperlinks, Too

Adding your own tags is nice, but that's only part of what makes the Web useful and XML interesting. Hyperlinks make the Web go; the<A HREF="whatever">idiom has become universal. XML extends Web hyperlinks in a couple of useful directions. Example 3 is taken from a description of a tournament game of Go (which is an old, complex, popular Asian board game, Sakata being one of the most famous players of this century).

Example 3

<P>Faced with a tight situation, Sakata found a 
<X><L ROLE="EG" TITLE="English translation"
   SHOW="NEW" HREF="/cgi-bin/xlate?term=tesuji" />
 <L ROLE="ToMove" TITLE="Jump to move in game record"
   SHOW="REPLACE" HREF="game.html#Move127" />
 <L ROLE="PIC" TITLE="Illustration"
   SHOW="EMBED"
   HREF="pix.xml#DESCENDENT(1,FIG,CAPTION,TESUJI)" />
 <L ROLE="CourseNotes" TITLE="Course Notes"
   HREF="notes.xml#ID(def-Tesuji)..DITTO,NEXT(3,P)" />
tesuji</X>.</P>

Once again, we'll skip the syntactic details, which are explained in the Linking part of the XML Specification. In a browser, this would look something like:

Faced with a tight situation, Sakata found a tesuji.

When you clicked on "tesuji," though, instead of the usual Web behavior of charging off after that link, you'd get a menu with four entries:English translation, Jump to move in game record, Illustration, and Course Notes.

Choosing English translation would run an ordinary CGI script. The attribute SHOW="NEW" means that rather than replacing the current page, the results of the script would show up in a new window (as if you'd said TARGET=_NEW in an HTML page). By the way, the translation would reveal that "tesuji" is a Go term meaning a clever tactical maneuver.
Jump to move in game record, a link into an HTML page, would behave exactly as the Web does today.
The Illustration option is more interesting. First of all, it's a link into an XML file. The text after the # in the URL says that the link is to the first FIG element that has the attributeCAPTION="TESUJI". Also, because of the attribute SHOW="EMBED", rather than replacing the current page with the target of the link, that target material would be inserted in the display right here at the location of the link.
The Course Notes option links to a "span" of text in an XML file -- specifically, the first three paragraphs following a tag that has the attribute ID="def-Tesuji".

These straightforward extensions of the Web's current linking facilities, in my opinion, add a lot of richness and cost very little. (But then, I helped write the spec.)

XML IS EASY

Most standards, even popular ones, never get read by most people. How many of us, for example, have actually read the basic HTML or TCP/IP specs, or even the electrical standards that allow you to plug a toaster in safely? The XML Specification, on the other hand, is short (less than 40 pages), and since it was designed for use by programmers, most readers of this article will find it straightforward.

The XML spec is available not only in HTML but also in RTF, PostScript, and PDF versions. This was easy to arrange because the spec is actually written in XML; all the other versions were auto-generated with a variety of formatting systems. (Remember our discussion above about the advantages of decoupling markup from a particular formatting semantic?)

For programmers, the HTML version is probably the most helpful. All the special terms are linked to their definitions, and all the "nonterminals" on the right-hand side of grammar productions are linked to their definitions. If you want to sit down and read the spec end-to-end, paper is the way to go.

XML had a design goal that it should be easy enough for a smart programmer to whip up a parser in a week. Since it was announced in November 1996, a ton of parsers have been whipped up. The one I wrote, named Lark, took a bit more than a week, but I was traveling when I wrote it. Besides, I didn't know Java when I started and had to learn it as I went along. The Java class files for Lark are only about 40K, and it does most of XML, with good error messages. This is simple stuff.

Elements, Tags, and Attributes

An XML document is made up of elements. Most elements have a start-tag, which may contain attributes, and an end-tag. Example 4 illustrates the XML terms element, element type, content, start-tag, end-tag, attribute name, attribute value, and empty element.

Example 4

<p secret="false">This sentence is in the content of an 
element whose type is "p"; the content is found between the 
start-tag and the end-tag. The paragraph has an attribute named "secret" 
whose value is "false". <IMG SRC='madonna.jpg'/> is an 
empty element, distinguished by the fact that it ends with "/>".</p>

That's about all there is to it. The only thing that will look a little weird to Web-folk is the <IMG> tag ending in />. (In HTML the <IMG> tag just ends with > like any other tag.)

The /> is important. Since any HTML processor "just knows" that <IMG> is an empty tag (one that doesn't depend on enclosing text and so doesn't have an end tag), no special syntax is required. But since in XML you can invent your own tags, empty elements (having no end-tags) need special syntax to keep parsers from getting confused. The /> trick allows simple programs to parse documents without knowning anything about them in advance.

Entities

XML documents don't have to live in a single file; they can be made up of multiple pieces, called entities. Example 5 illustrates entities in the master document for a short book.

Example 5

<!doctype book SYSTEM "book.dtd"
[
 <!entity toc SYSTEM "toc.xml">
 <!entity chap1 SYSTEM "chapters/c1.xml">
 <!entity chap2 SYSTEM "chapters/c2.xml">
]>
<book><head>&toc;</head>
<body>
&chap1;
&chap2;
</body></book>

In this case, the table of contents and chapters live in separate files (well, really, those are URLs), so different people can work on them in parallel. These are called external entities because their content is outside the main document.

Entities can also be used for reusable text and for referring to characters that are hard to type on the keyboard. Example 6 declares an entity that expands to the text "Extensible Markup Language." It also uses entities referring to characters that are different versions of the number "1"; these don't need to be declared since they're just numbers. These numbers come from the Unicode standard for international character sets. (This is as good a place as any to let you know that XML comes globalized: you can use any Unicode character in XML.)

Example 6

<!doctype eg 
[
 <!entity xml "Extensible Markup Language">
]>
<eg>The new &xml; standard is fully internationalized; the following
are all examples of the digit "1": &#49; (in ASCII),
&#x0661; (in Devanagari), &#x0967; (in Arabic), and
&#x0d67; (in Malayalam).</eg>

Being Well-Formed

Now we're ready for one of XML's most important concepts: that of the well-formed document. This just means that:

All the tags are there.
The begin- and end-tags match (with the exception of empty elements, which can use the /> trick to skip the end-tag).
All the attribute values are quoted.
All the entities are declared.

All the examples so far have been well-formed, but the following (lousy, but usable) HTML document isn't:

<title>Reasonable HTML</title>
Some text, and I <i>really</i> don't want a line&nbsp;break
between "line" and "break".
<p>Here's a picture: <IMG src=madonna.jpg>

The problems with this are:

There's no "root" element to enclose the whole thing (should be<HTML> ... </HTML>).
The entity nbsp is used without being declared.
There's a <p> with no corresponding </p>.
The <IMG> tag is missing the closing />, so the XML parser can't tell it's supposed to be empty.
The value of the src attribute, madonna.jpg, should be quoted but isn't.

There are a few more syntactic details to being well-formed, but these are the important ones. Well-formed documents are easy to parse, even for the tiniest applets.

XML IS GUARANTEED

Since browsers are so forgiving of bad HTML, there's a lot of bad HTML out there, which makes it hard to do automated processing with any reliability. (You could do it if you were to write as much code as there is in Netscape Navigator, but you don't want to write that much code!) Fortunately, XML comes with a built-in solution.

The XML spec says, very clearly, that if a document is supposed to be XML but isn't well-formed, then it's toast. That is to say, no conformant XML processor is allowed to recover, to go on and try to guess what the author meant. The idea is, basically, that it's pretty easy to make documents well-formed, the rewards for doing so are very high, and anybody who doesn't bother is a bozo whose material should be ignored anyway.

This was a controversial decision, but it was one that both Netscape and Microsoft demanded of the XML committee. HTML means never having to say you're sorry, which is just fine for lightweight low-overhead publishing, but a really lousy basis for trying to automate the Web. This decision won't change the way people work -- authors will continue to publish any old thing, no matter how bad, as long as it looks good in Navigator -- but when you're publishing XML, XML's error-handling rules will guarantee that once Navigator displays something, you can be sure it's well-formed.

THE HISTORY OF XML

If you look under the covers, it turns out that XML is actually a simplified form of SGML (Standard Generalized Markup Language). SGML is a big, complicated ISO standard that has been used to define HTML and lots of other languages. While SGML is a useful tool in a lot of industrial applications, it's just too complicated for Joe Homepage.

XML was cooked up by a combination of old publishing hacks who like SGML and Web-heads (some with IPOs under their belts) who understand how the Web works. Some of us are old publishing hacks and Web-heads at the same time. Basically, XML is SGML with the hard bits thrown out, explained simply and straightforwardly. We started in July 1996 and published the first draft in November 1996. The first parser appeared in January 1997. The first applications started bubbling up in March 1997, and now they're springing up like mushrooms everywhere. Is this living in Internet Time or what?

THE FUTURE

XML will probably be an official World Wide Web Consortium (W3C) recommendation by the end of 1997. Its already being applied in a vaiety of ways, including:

RDF (Resource Description Framework) -- still under develoment in the W3C, a framework for general-purpose Web metadata that's supported by Netscape and a variety of other companies. (Metadata is information about information: datestamps, security, subject headings, content ratings, Web maps, copyright notices, and the like. Right now, the Web doesn't have any metadata, but needs it terribly.) While the picture is a little hazy, a couple of other XML-based proposals, Microsoft's CDF (Channel Definition Format) and Marimba/Netscape's DRP (Distribution and Replication Protocol), will probably fit into the RDF framework.
OFX (Open Financial Exchange) -- the format used by Intuit Quicken and Microsoft Money to talk to banks.
CML (Chemical Markup Language) -- invented in Britain for chemists to interchange descriptions of molecules, formulas, and other chemical arcana.
MML (Mathematical Markup Language) -- more W3C work, designed to support the unglamorous (but commercially important) job of typesetting mathematics.
OSD (Open Software Distribution) from Marimba and Microsoft.

All this is very nice, but it's not what really turns the cranks of the XML people. We're waiting for the day when Navigator will be able to display XML natively, driven by a selection of style sheets, powered by Java applets running in the browser, and fueled by rich, user-defined document structures. That day isn't here it, but it will be, sooner than you think.

FURTHER RESOURCES

View Source wants your feedback! Write to us and let us know what you think of this article.

Many thanks to Lauren Wood for support and sanity checks.

Tim Bray is a Canadian who has been working with computerized documents since joining the Oxford English Dictionary project in 1986. He co-founded Open Text Corporation, wrote one of the first big Web robots, and since 1996 has had a consulting practice under the name Textuality. He is a Seybold Fellow, edits The Gilbane Report, and is co-editor of the W3C XML specification. Tim represents Netscape (on a consulting basis) in the XML process, but does not speak for anyone but himself, and nothing in this article should necessarily be taken as representing Netscape's opinion.