[Archive copy mirrored from: http://www.cs.caltech.edu/~adam/papers/xml/]

X Marks the Spot

eXtensible Markup Language opens the door to a motherlode of automated Web applications

by Rohit Khare and Adam Rifkin

July 31, 1997 // $Id: xml.html,v 1.61 1997/07/31 13:06:50 adam Exp $

Abstract

The HyperText Markup Language (HTML) allows the structural markup of World Wide Web documents. Now, HTML's evolutionary successor, the eXtensible Markup Language (XML), takes document markup to the next level, offering human-readable semantic markup, which is also machine-readable. As a result, XML makes it dramatically easier to develop and deploy new mission-specific markup, enabling the automation of the authoring, parsing, and processing of networked data.

Introduction: The Evolution of the Web's Data Format Representation

In Japanese culture, one's meishi (or, business card) is not merely a clerical tickler. It is a weighty talisman, a proxy inveighed with every nuance of one's place in the clan, in the company, even in the world. That is to say, business cards matter.

These ephemeral shadows of relationships are a way to illuminate the predicament currently facing the World Wide Web. These meishis are bits of critical metadata about the people who carry them. We are extremely familiar with them, we know what the look like, and we know what to expect when we peruse one: items such as addresses, phone and fax numbers, titles, and logos.

Or do we? In the drawer where I keep my business cards scattered among the office supplies, I see many scores of different shapes, scripts, languages on two sides... and even superpositions such as Braille! Some are magnetic, some are calculators, some have photos, some have nothing more than a name and a public key (since these cards are pseudonyms rather than actual flesh and blood).

Despite all the different eccentricities associated with any individual's business card, we have learned to cope. We build islands of interoperability within companies (maintaining the same corporate "look and feel") and industries (using comparable titles where appropriate). In addition, the business card "data dictionaries" have evolved, too, to include such personal effects as pager numbers and home pages. In short, meishis represent a microcosm of all of the document management hassles currently facing the World Wide Web: handling multiple versions, dealing with multiple views of the same person, providing access control and security, and so on.

As artifacts, business cards seem pedestrian enough --- a mere scrap of paper always a moment's inattention from the final, inglorious end in a wastebasket. As information, though, they provide insight into some of the fundmental knowledge-representation challenges ahead for the World Wide Web. In this paper, we consider the evolution of the Web's data representation from display formats to structural markup to semantic markup, a new frontier opened up by the eXtensible Markup Language (XML) specification.

The Limits of HTML

On the Web, we have the same situation with home pages as we had with business cards. In fact, the problems are compounded, since web pages can come in an even more staggering profusion of shapes and sizes and sounds. Fortunately, you can still identify a homepage when you see one: the name in big type, the hobbies in a bulleted lists, and of course the email address in typewriter font. The reason for this ease of recognition is structural markup: we can identify the elements of a page with HTML tags and declare the relationships among the various document elements (for example, "this is a new paragraph" or "this is emphasized from its surrounding text").

To a human reader with some practice, this markup makes sense; however, it is all just Greek to computers. They cannot extract out rolodexes from this soup of tags and content. They need semantic markup to know what each particular element means on its own (for example, "this is a home street address" or "this is an email address"). We are not referring just to a primitive catchall such as <ADDRESS> in HTML, which lumps together a whole opaque string of bits. We need machine-readable semantic markup.

Wouldn't it be great if there were <job-title> and <cert type=X.509 src=ldap://...> elements? We could be very specific about their meaning, and we could push the look-and-feel of the display of these items out to corporate-wide stylesheets. The problem is that HTML is not extensible, so new tags are:

  1. ambiguous (offering no suggestions as to what is meant by a particular new tag),
  2. undefined (providing no source metadata about the ontology),
  3. not renderable (furnishing no hints about how to display data marked up by those tags), and
  4. not composable (for example, since there exists no external tagset with which to begin, and there exist no tag overlap resolution hooks, we cannot combine the tags of an bCard with tags defining a fingerprint-data).

We do want to specify what we mean, like the Cheshire Cat in Alice in Wonderland. But, we also need to share our specific semantics. How can we resolve this paradox: innumerable separate meanings within a single file format? The answer is, simply, by using a metaformat: a language construction kit.

GML, the Holy Grail

The idea that structured documents could be exchanged and manipulated if published in a standard, open format, dates back to multiple efforts in the 1960s. In one endeavor, a committee of the Graphic Communications Association (GCA) created GenCode to develop generic typesetting codes for clients who sent data typeset differently to different vendors. GenCode allowed them to maintain an integrated set of archives despite the records being set in multiple types.

In another effort, IBM developed the Generalized Markup Language (GML) for its big internal publishing problems, including the management of documents of all types, from manuals and press releases to legal contracts and project specifications. It was designed to be reusable for batch processors to produce books, reports, and electronic editions from the same source file(s).

GML had a "simple" input syntax for typists, including the tags we recognize today denoted by <> and </>. Of course, GML also permitted lots of "cheating": typists could easily elide obvious tags. This markup minimization erred on the side of making these documents easier to type and read for humans than for general purpose processing (as could be done if these documents were fed to computer applications). In fact, there were so few types of documents required at the time, people wrote special compilers bound to each particular kind of document to handle the inputting of the appropriate data formats.

As more document types emerged (each requiring specially suited tagsets), the need for a standard way to publish and manipulate each Document Type Definition (DTD) also emerged. Representatives of both the GenCode and GML communities joined in the early 1980s to form the American National Standards Institute (ANSI) committee on Computer Languages for the Processing of Text with the goal of standardizing the ways to specify, define, and use markup in documents.

SGML, the Standard Generalized Markup Language standardized in ISO 8879:1986 [Goldfarb, 1990] for defining and using portable document formats, was designed to be formal enough to allow proofs of document validity, structured enough to handle complex documents, and extensible enough to support management of large information repositories. The two key elements of SGML were its syntax (which evolved from IBM's GML), and its semantics (which came from the typesetters through the GCA).

SGML might seem to the casual observer like a pie-in-the-sky design by committee, but it was successful in furnishing an interchange language that could be used to manipulate and exchange text documents. By the late 1980s, organizations such as CERN used it; in a laboratory in Switzerland, a holiday hacker borrowed the funny-looking idiom for his then-new hypertext application.

Indeed, the World Wide Web's inventor Tim Berners-Lee picked a very small set of SGML's structural markup concepts with wide applicability for his own Hypertext Markup Language in 1990 -- to complement the style sheets he designed for his web browser to understand typesetting hints.

By the time Mosaic took off worldwide in 1993, people were using HTML as a hammer and were seeing nails everywhere. Unfortunately, HTML is not a hammer; even HTML 4.0 released in July 1997 [Raggett et al., 1997] only furnishes a elegant but small tagset, and no single tagset will suffice for all of the kinds of information on the web.

The obvious question, then, is why did the World Wide Web not formalize the ad-hoc angle-bracketing of early HTML to use generic SGML rigorously? Early HTML specification drafters Dan Connolly and Dave Raggatt believed just that, and began to make HTML a real SGML application in the mid 1990s. This was seen as the first step to bringing the SGML community to the promised land of the World Wide Web, and some companies shifted their agendas to unite SGML with the Web [Rubinsky and Maloney, 1997].

Unfortunately, in ISO 8879:1986, SGML was a large, cumbersome specification designed as though the Personal Computer revolution had never happened. In fact, the specification was given as if the most basic results in machine parsing from the computer science compiler studies in the 1960s had not happened! Not only was the long specification difficult for people to read and understand, but it was also hard for computers to process and manipulate. By 1996, SGML was still not ready for interactive parsing, authoring, or incremental display (thanks to its batch heritage!). But we like some of the functionalities offered by SGML that are not available in HTML, so...

XML to the Rescue

To recap, there are several features SGML provides that are not available in HTML:

  1. Extensibility: Authors can define new tag names and attribute names for documents by specifying their syntax and semantics.
  2. Structure: Documents can be containers for other documents, with arbitrary nesting. This allows complex documents to be constructed from simpler documents.
  3. Validation: If desired, any SGML document can reference (XML can also include) a description of its grammar so applications can validate that the document conforms to its specified structure. Furthermore, this process of validation can be automated.

In 1996, a new team backed by the World Wide Web Consortium worked to realize these benefits for the Web in a format more usable --- both by humans and by the computers --- than SGML.

XML, the Extensible Markup Language [Connolly and Bosak, 1997], is a simplified (but strict) subset of SGML that maintains the SGML features of validation, structure, and extensibility. XML is a standardized text format designed specifically for transmitting structured data to Web applications. In addition, XML's goals of being easier to learn, use, and implement than full SGML will have clear benefits for World Wide Web users, making it easier to define and validate document types, to author and manage SGML-defined documents, and to transmit and share them across the Web.

The new subset, like SGML, is a meta-language for describing the markup of different types of documents. However, its specification is 26 pages (versus 500 for SGML!). It is, in short, a worthy successor, providing a practical solution to the gordian knot of being precise and extensible without sacrificing simplicity.

XML is not a replacement for SGML; in fact, many features of SGML were left out to keep XML simple. XML also does not allow markup minimization, requires that empty elements be self-identifying, and does not support several other complex SGML standard features.

However, XML is being designed with network delivery concerns in mind, so current SGML users may choose XML for data exchange over a network: A well-formed XML document is unambiguous, so a browser or editor can read the tags and create a tree of the hierarchical structure without having to read its Document Type Definition. Since XML is a valid subset of SGML, the translation from SGML to XML is straightforward.

The working draft for XML 1.0 provides a complete specification in two parts: the extensible markup language itself [Bray and Sperberg-McQueen, 1997], and methods for associating hypertext linking [Bray and DeRose, 1997] and stylesheet mechanisms with XML. From this specification, we observe that expressive power, teachability, and ease of implementation, were all major design considerations. And although XML is not backward-compatible with existing HTML documents, we note that documents that are HTML 4.0-compliant can easily be converted to XML.

In addition to modifying the syntax and semantics of document tag annotations, XML also changes our linking model by allowing authors to specify different types of document relationships: new linking technology allows the management of bidirectional and multiway links, as well as links to a span of text (within the same or other documents), as a supplement to the single point linking afforded by HTML's existing HREF-style anchors.

With XML's syntax for both linking and defining tagsets simplified over SGML, any parser can handle any document; as a result, anyone can issue a Document Type Definition (DTD) specialized to particular data needs. Furthermore, as with SGML, XML DTDs can be composed together to create more complex, validatable new document types from simpler, validated document type definitions. In addition, authors can stylize how the document's appearance is formatted in a web browser or other web-compliant application, using Cascading Style Sheets (CSS) or the Document Style, Semantic, and Specification Language (DSSSL).

A nice summary of the purpose of XML was posted to the newsgroup comp.infosystems.www.authoring.html on June 3, 1997, by Jim Cape:

"XML was designed to provide an easy-to-write, easy-to-interperet, and easy-to-implement subset of SGML. It was not designed to provide a "one Markup Language fits all" DTD, or a separate DTD for every tag. It was designed so that certain groups could create their own particular markup languages that meet their needs more quickly, efficiently, and (IMO) logically. It was designed to put an end once and for all the the tag-soup wars propagated by Microsoft and Netscape."

Sidebar: XML Timeline

The W3C is a vendor-neutral international industry consortium founded in 1994 to develop common protocol specifications for the evolving World Wide Web. W3C team members pursue many different Activities, or technical directions. A W3C Activity starts when a new issue relevant to the Web is discovered; then, W3C team members organize a workshop to collect information, opinions, and ideas. This leads to the formation of a working group with specific short-term goals, watched over by an Editorial Review Board elected from nominated experts by the Members, who review the resulting work as expert consultants. W3C results may then enter the IETF standards track; for example, W3C and the IETF are currently in partnership on the development of HTTP 1.1 and Distributed Authoring and Versioning.

XML was invented by a small band of renegades, most of whom had been longtime SGML supporters. The W3C SGML Working Group and Editorial Review board were chartered for a single year (July 1996 through July 1997) with Jon Bosak of Sun Microsystems as chair, and Dan Connolly as the W3C staff contact person. XML's definition was expedited because the Web is amenable to fast, easy communication and collaboration. The result has been a snowball effect, allowing XML to gain momentum and exposure rapidly as its specification is refined by the working group. The following timeline denotes some of the milestones in the XML effort.

Milestones

July 1996
W3C work on SGML officially began (see the charter).
September 1996
Report on the Generic SGML activity at the Seybold Conference in San Francisco, CA.
November 1996
Initial XML draft presented at the SGML '96 Conference in Boston.
January 1997
Peter Flynn started maintaining a list of Commonly Asked Questions about XML.
February 1997
Imperial College, London formed (and continues to host) the xml-dev mailing list for XML developers.
March 1997
First XML Conference in San Diego, by the Graphic Communications Association.
March 1997
Revised XML Syntax Working Draft [Bray and Sperberg-McQueen, 1997].
April 1997
Initial XML Linking Working Draft [Bray and DeRose, 1997].
April 1997
XML was one of the most popular topics W3C presented at the Sixth International World Wide Web Conference in Santa Clara, CA.
May 1997
Approval of a Technical Corrigendum to ISO 8879:1986 to align features of XML with the SGML standard at the May 1997 meeting of ISO/IEC JTC1/SC18/WG8 in Barcelona.
June 1997
New versions of the XML Syntax and Linking Working Drafts available.
December 1997
SGML/XML 97 Conference in Washington, D.C. will herald the release of near-final specifications and the initial edition of a standard stylesheet language for XML publishing applications based on DSSSL (ISO/IEC 10179) along with the public text and extensions Web browsers will need to implement it.

A Brief Introduction to XML

Let's explore XML in a little more depth to see how it is similar to SGML and HTML, and how it differs. Recall that SGML is an internationally standardized language for defining sets of tags. HTML represents just one of the element type tagsets that can be created using SGML; now that HTML 3.2's DTD has been completely specified, documents can be verified to be HTML 3.2-compliant.

As with SGML documents, XML documents are composed of entities, which are storage units containing text and/or binary data [Bos, 1997a]. Text is composed of character streams that form both the document's character data content and the document's metadata markup. Markup describes the document's storage layout and logical structure. XML also provides a markup mechanism to impose constraints on the storage layout and logical structure of documents [Bos, 1997b], and it provides mechanisms that can be used for strong typing [Bray, 1997c].

In style and structure, XML documents look quite similar to HTML documents. However, when Web servers with XML content prepare data for transmission, they typically must generate a context wrapper with each XML fragment, including pointers to an associated Document Type Definition (DTD) and one or more stylesheets for formatting. Web clients that process XML must be able to unpack the content fragment, parse it in context according to the DTD (if needed), render it (if needed) in accordance with the specified stylesheet guidelines, and correctly interpret the hypertext semantics (such as links) associated with each of the different document tags.

Note that a DTD is not required for an XML document; instead, an author can simply use an application-specific tagset. However, a DTD is useful because it allows applications to validate the tagset for proper usage. A DTD specifies the set of required and optional elements (and their attributes) for documents to conform to that type; in addition, the DTD specifies the names of the tags and the relationships among elements in a document (for instance, nesting of elements).

Two examples of "electronic business card" (bCard) documents illustrate the power and simplicity of XML. In the first example, the DTD is given as part of the XML document; in the second example, the DTD is not given, but it exists in an externally defined document.

Example 1: Annotated Attribute-Value Pairs

Let's write a simple XML document that only contains tags annotated with attribute-value pairs; that is, there will be no content in the document other than the tags themselves. These tags can then be parsed and processed by software programs.

Our simple example is a document for maintaining a list of peoples' electronic business cards. Suppose, then, we want each "bCard list tag" to contain five attributes: a person's first name, surname, company, email address, and web page address.

We can specify default values to attributes to guarantee that every tag has the same number of attribute-value pairs (although some values may be null). The declaration of default attributes is lexically scoped by the bCard element (although in this case it has no effect, since none of the elements omit an attribute).


<!DOCTYPE bCard "http://www.cs.caltech.edu/~adam/schemas/bCard">
<bCard>

<?xml default bCard
        firstname = ""
        lastname  = ""
        company   = ""
        email     = ""
        homepage  = ""
?>

<bCard
        firstname = "Rohit"
        lastname  = "Khare"
        company   = "MCI"
        email     = "khare@mci.net"
        homepage  = "http://pest.w3.org/"
/>

<bCard
        firstname = "Adam"
        lastname  = "Rifkin"
        company   = "Caltech Infospheres Project"
        email     = "adam@cs.caltech.edu"
        homepage  = "http://www.cs.caltech.edu/~adam/"
/>
</bCard>

Note how XML's formatting is human-readable as well as machine-readable: empty lines immediately following a ">" or immediately preceding a "<" in the document are ignored by the parser, and whitespace inside tags is ignored (which is not true for HTML).

Example 2: Embeddable Tags

As a text-based format, XML is designed for storing and transmitting data. This can either be done through arbitrary attribute-value pairs, as demonstrated in the first example, or it can be done by strategically embedding tags around content to give that content more meaning.

For example, consider the following XML snippet:


<!doctype html>
<html version="-//W3C//DTD HTML Experimental 970324//EN">
<head>
<title> Adam's bCard List </title>
</head>
<body>

<h1> Adam's bCard List </h1>

<bCard MONTH=7 YEAR=1997>
<FIRSTNAME> Adam </FIRSTNAME>
<LASTNAME> Rifkin </LASTNAME>
<COMPANY> Caltech Infospheres Project </COMPANY>
<EMAIL> adam@cs.caltech.edu </EMAIL>
<HOMEPAGE> http://www.cs.caltech.edu/~adam/ </HOMEPAGE>
</bCard>

<bCard MONTH=8 YEAR=1997>
<FIRSTNAME> Rohit </FIRSTNAME>
<LASTNAME> Khare </LASTNAME>
<COMPANY> MCI </COMPANY>
<EMAIL> khare@mci.net </EMAIL>
</bCard>

<hr/>
<address><a href="mailto:adam@cs.caltech.edu">
Adam Rifkin</a></address>
<!-- Created: Wed Jul 16 12:22:32 MET DST 1997 -->
<!-- hhmts start -->
Last modified: Wed Jul 16 22:32:42 MET DST 
<!-- hhmts end -->
</body>
</html>
<!-- Keep this comment at the end of the file
Local variables:
mode: sgml
sgml-declaration:"~/SGML/html.decl"
sgml-default-doctype-name:"html"
sgml-minimize-attributes:t
sgml-nofill-elements:("pre" "style" "br")
sgml-live-element-indicator:t
End:
-->

Note that the DTD is not embedded in the document; we could specify it elsewhere if we needed to validate the tagset and content data structures, or we could omit the DTD.

By binding a meaning to the XML tag <bCard>, we understand what is contained in that element: the start tag, the end tag, and the contents in between those tags. In this case, the bCard element has two attributes, MONTH and YEAR, the values of which correspond to the month and year that bCard entry was added to the document. Now, the DTD might specify that the bCard element must contain the FIRSTNAME, LASTNAME, COMPANY, and EMAIL elements, and might contain a HOMEPAGE element as well. Additionally, the DTD might specify that any HOMEPAGE element that does appear in a valid electronic business card document must be given nested within an bCard element.

Once a document type's elements have been specified in a DTD, then style sheets and scripts and programs can be associated with any element in that document type. For example, a custom script might execute when someone clicks on that business card entry, opening up a fancy separate window that displays that entry in a classy font, color, and arrangement. Or, a style sheet associated with business cards might display all entries of people at MCI with the MCI logo.

The forms of metadata provided in example 1 (attribute-value pairs) and example 2 (start-end tags) demonstrate the different ways that document content can be marked up with metadata to allow searching for information in the document, generating information from the document, and filtering the content of the document. Metadata spans a wealth of information, from digital signatures and authentication seals, to prices and timestamps, to links to related information.

Integrating XML Content with HTML

HTML, with its millions of users and billions of documents, will not be lost in the transition to XML. Even in an XML-centric world, most documents will use the idioms of paragraphs, headings, and lists -- and authors will use <P>, <H1>, and <LI> tags for those just as they do today. Newfangled "XML" markup will emerge around new, semantically significant data structures. We can expect to integrate XML business cards in the middle of an "HTML" home page with minor disruption -- or similarly digitally sign or encrypt individual portions of a document. Client software and network tools will evolve gracefully alongside these changes; because today's tools can cope reasonably well with unrecognized tags and today's HTTP/1.1 can compress down slightly more-verbose XML, there won't be any need for wholesale changes.

To summarize, XML allows authors to specify their own document syntax, hypertext link semantics, and presentation style. Once we can create new tags and elements with new attribute-value metadata, we can reencode any systematic, structured document format using XML.

XML Applications

Even just considering the syntax aspects of the XML specification alone, we observe that once-flat web applications are liberated from the tyranny of closed data containers. Some of the resulting XML-enabling applications arriving in this "first wave" [Bosak, 1997] will include applications that:

  1. Use web clients (browsers and other web-compliant user programs) to intermediate between multiple heterogeneous databases.
  2. Distribute some of the processing load from web servers to web clients.
  3. Use web clients to present different views of the same data to different users.
  4. Employ "intelligent" web agents to tailor information discovery and filtering to the customized needs of individual users.

XML: Marking Up Everything from Chemistry to Mathematics

Many communities have struggled to codify their tacit knowledge of their data is structured and manipulated into common file formats. Many have even adopted SGML, realizing the value of textual, standardized interchange formats for long-term stabiilty. However, user organizations often also borrowed the SGML committee design mentality of laboring over a single, "big bang" DTD release for their domain. The chemistry community went down that path to share molecular descriptions, but recently adopted XML as a more flexible base for its evolving Chemical Markup Language (CML).

CML (Chemical Markup Language) [Murray-Rust, 1997] uses XML to manage molecular information. Although CML is based strictly on SGML, its use as an XML application is compelling as well. It is capable of holding extremely complex information structures, acting as an interchange mechanism or for archival, and interfacing easily with modern relational and object-oriented database architectures.

CML takes advantage of the fact that XML documents need not be valid, and can simply be well-formed. Essentially, this means that a document is syntactically correct (for example, the start and end tags balance, ATTRIBUTEs are quoted, and so on), but the document itself might not be valid (for example, it might contain an unknown tag). XML is better suited than SGML to situations where the documents have already been validated, for instance because the authoring software is authenticated, or because the documents have already passed through a validating parser. Although all CML documents must be validatable against the CML Document Type Defintion, but it is possible to manipulate them without necessarily having to validate them.

CML evolved out of SGML, but it gains some of its added power from XML's features. Now imagine another academic community living in the world after XML: new markup tagsets evolve out of rapid experimentation by the community that needs the DTDs. For example, XML fragments have evolved over the past year concurrently with the XML effort itself, with the explicit goal of to supporting mathematics in documents. Note that because XML DTDs are composable, defining a math formula DTD will allow the mathematics tagset to be used with any other document that includes in its composition the mathematics DTD.

Mathematical Markup Language (MathML) [Ion and Miner, 1997], is an XML application for describing the structure and content of mathematical expressions, allowing the markup of complex formulas, something mathematicians and computer scientists have been clamoring for since the earliest days of HTML. Just as HTML has enabled the serving and processing of text on the Web, the goal of MathML is to enable mathematics to be served, received, and processed on the Web.

Sophisticated mathematical notation is highly symbolic in nature, and the relation between meaning and notation is often subtle. This has ramifications for the say what you mean aspect of semantic markup. To keep in line with the philosophy behind mathematical expressions, MathML describes expression structure together with its mathematical context. About two dozen MathML tags describe abstract notational structures, and another four dozen provide a way of unambigously specifying the intended meaning of an expression. MathML content and presentation tags can interact to capture the nuances of meaning in traditional equations. The MathML working draft also discusses how renderers might be implemented and how they should interact with browsers.

XML for Pushing and Pulling

But XML's applicability extends beyond the academic communities who need custom markups for their documents. Recent efforts led by Microsoft and Netscape demonstrate that designing XML applications for pushing and pulling information over the Web is quick and easy to do at home.

Microsoft's proposed Channel Definition Format (CDF) [Ellerman, 1997] lets a Web site use XML to publish existing HTML content in a channel for desktop CDF-compliant push client browsers (from vendors such as Microsoft, PointCast, AirMedia, and BackWeb). XML also provides a way to embed arbitrary data and annotations within the broadcasted HTML, for use with scripts. CDF permits a Web publisher to offer frequently-updated collections of information from any Web server for automatic delivery to compatible receiver programs on PCs or other information appliances.

As an XML application, the CDF specification allows Web publishers to "push" information, by allowing them to specify the channels, the content available, the update schedule, and other information such as a delay period between when the data is received and when the data is browsable (to synchronize readers in multiple distributed locations, for instance). CDF overcomes a serious problem with the push platforms of today by such vendors as PointCast, Backweb, Microsoft, and Marimba: that the publisher and the subscriber must use the same technology. If all of these content providers push their information as CDF documents and data streams, then any user with a CDF-compliant client browser can read that information! In one fell swoop, this open standard puts flexibility in the hands of the user, who can now pick a custom client application for reading, and it makes wider audiences available to the content providers.

Netscape is using XML for a different style of application: pulling the metadata about an organization along with other information about that organization. Called Meta Content Framework (MCF) [Guha and Bray, 1997], this endeavor provides the specification for a data model for describing the information organization structures (meta content) for collections of networked information, using XML syntax to represent instances of this data model.

Handheld Device Markup Language

But not all communities that could make use of XML have chosen to do so. There exist several examples of products and user groups whose specialized markup languages make good future targets for XML porting.

For example, the Handheld Device Markup Language (HDML) [Unwired Planet, 1997] addresses the constraints of pocket-size devices: a few lines of display, a limited keypad, tens of kilobytes of memory and a wireless connection to the Internet. HDML, like HTML, is an information publishing and interaction description language, but it took the wrong approach: it extended HTML with new tags, the semantics of which were only clear to Unwired Planet. HDML ended up being quite different from HTML, and it was disingenuous to begin to assume that it would resemble HTML at all.

After some pontification, therefore, Unwired Planet has recently reinterpreted its proposal: although HDML was designed before XML was available, they are presently looking into revising HDML to be based on XML. Such an open solution bodes well for them and for HDML, as this will enable them to use device-specific cascading style sheets and preloaded binary compression dictionaries to separately settle the "pocket-size" constraints issue across platforms.

The Motherlode

The applications of XML we have explored so far allow authors to design custom tagsets; as an author, you can define a DTD to precisely say what you mean about the information content of a document, by using the tags for interpreting and supplementing that content. However, XML is useful for another, entirely different reason: because it honors machine-readability (one of the basic tenet of the Web), it opens up new applications based on automatability.

Automated Interpretation

Put simply, XML automates the extraction of data; for example, "electronic business cards" embedded in peoples' web home pages could automatically offer information in the same, commonly understood format to a variety of programs, Web forms, and scripts. As another example, imagine a flight checker that extracts the airline flight status reports from several different Web content provider services, and collates them into a single page formatted according to the readers' likings.

To perform such automation tasks, Web programmers could use operational hacks such as scripts, which leave room for plenty of errors and manageability problems. The alternative that XML suggests, as exemplified by the airline flight example, would allow the evolution of an airline community ontology for flight data.

There are alternative approaches to imbuing a document with a structured ontology. At one extreme, the meaning of a document could be represented by its behavior alone; that is, its meaning is reflected only by what happens when it's processed. W3C's Document Object Model follows this route by systematically binding programs to parts of an HTML document, animating it like a puppet on a string (which explains why this approach is also marketed as "dynamic HTML").

It seems more robust to declare, deterministically, what parts of a document mean and what behaviors they have. The Web Interface Definition Language (WIDL) [webMethods, 1997], was developed by webMethods specifically to describe the inputs and outputs of programs on the Web. WIDL captures the meaning of a document by extracting relevant output fields and mapping inputs onto Web forms.

WIDL is a meta-data syntax implemented in XML that defines Application Programming Interfaces to web data and services, enabling automatic and structured web access by compatible client programs, including mainstream business applications, desktop applications, applets, web agents, and server-side web programs.

WIDL provides well-defined "machine-readable hooks" into the rapidly increasing volume of Web data and services on the Internet, Intranets, and Extranets. Most important, WIDL enables interfaces to be described for web sites that are not controlled by calling programs. WIDL files can reside on the client, the server, or be centrally managed by third-party naming services.

This simple example demonstrates how a package tracking service might be described in WIDL:


<WIDL NAME="PackageTracking">

<SERVICE NAME=TrackPackage
 INPUT=InputData OUTPUT=OutputData METHOD=POST
 URL="http://www.packages_r_us.com/cgi-bin/AirbillTrace" />

<BINDING NAME=InputData>
<VAR NAME=trackNum />
</BINDING>

<BINDING NAME=OutputData>
<CONDITION TYPE="success" MATCH="Airbill Number:*"
  REF=doc.title />
<CONDITION TYPE="failure" MATCH="*Blank Airbill*"
  REF=doc.p[0].value REASONTEXT="Please provide an Airbill Number" />
<CONDITION TYPE="failure" MATCH="*should be*"
  REF=doc.p[0].value REASONREF=doc.p[0].value />
<CONDITION TYPE="failure"
  MATCH="*No information available*"
  REF=doc.p[0].value REASONREF=doc.p[0].value />
<CONDITION TYPE="failure"
  MATCH="*is not a valid*"
  REF=doc.p[0].value REASONREF=doc.p[0].value />
<VAR NAME=package
  REF=doc.tables[1].tr[0].td[0].value />
<VAR NAME=deliveredOn REF=doc.tables[2].tr[3].td[1].value />
<VAR NAME=signedForBy REF=doc.tables[3].tr[2].td[1].value />
</BINDING>

</WIDL>

Bindings between HTML/XML document elements and program variables can be defined using Document Object Model (DOM) references. Condition statements provide fault-tolerance and can initiate alternate binding attempts and other WIDL-defined service invocations. These features provide enhanced fault-tolerance and the ability to return meaningful error messages to calling programs.

WIDL provides abstract definitions for services that can be implemented in any language. webMethods WIDL-based tools generate application level function calls in C/C++, Java, Javascript, Visual Basic, and ActiveX directly from WIDL files.

Because WIDL is dynamically interpreted at runtime, client applications are insulated from changes in service locations and document structure. Transparency is achieved by changing Document Object references and Service URLs without re-generating client code.

Taken together, these facilities make WIDL arguably analagous to the Interface Definition Languages (IDLs) of the Common Object Request Broker Architecture (CORBA) and the Distributed Computing Environment (DCE). WIDL can define the name, inputs, outputs, data types, and exceptions for any "function" on the Web.

One system integrator is using WIDL to integrate other third-party web-based applications. Instead of coding directly to each specific package's API, they're using WIDL to use the web as a standard integration platform! This gives them a single mechanism for management of multiple product interfaces, and shortens delivery times.

Taken one step further, WIDL is being used to build a single application that automates the tracking of packages across twenty different shipping companies.

There are many potential applications for WIDL. A university student was able to easily set up a Web page that tracked the prices of computer chips at nine different Web locations. WIDL files automated the process of extracting the prices for components the student desired; furthermore, this extraction was done automatically every time he pulled up the page so he always had access to the most current data when he viewed the page in his Web browser!

In summary, there are two aspects to automatable mining. One is as a metadata file describing interface to other files. With this aspect, converting to XML is a format nicety for handling arbitrary new interface description files. The other aspect is as metadata for adding XML markup to the actual output information, for direct manipulation by an author or program.

Automated Publication

But automation is not limited to data mining: XML also can automate the generation of data from databases and other data stores, in a wide variety of publication formats. Why, we could have databases and programs that manipulated them, and even extract them back automatically from legacy HTML using expert system techniques. In turn, we could reuse and compose this data structure with others (answering such questions as "Who-owns-this-link?" and "Who-wrote-this-applet?"). We could write adapters for our new ontology to and from old personnel SQL databases or new, inscrutable X.509 identity certificates. XML provides an ideal medium for pickling (publishing) the state of distributed systems.

Automated Processing

In addition, XML automates the conversion of data, because it respects the social agreements of ontology rather than goring itself on the ox of universal data dictionaries by committee. These next few years will give rise to many such interpreters and translators. Such tasks are easy to do on the Web: observe what has already happened for cruder-scale content-types on the Web. The virtuous cycle of .ram data and RealAudio players are but one example of this. Furthermore, Jon Bosak demonstrates in his paper XML, Java and the Future of the Web [Bosak, 1997] how XML can enable advanced Web applications, allowing Java applets to embed powerful, automatable data manipulation facilities directly into Web clients.

"Leading the Evolution of the World Wide Web"

The World Wide Web Consortium, the driving force behind XML, sees its mission as leading the evolution of the Web. In the competitive market of Internet technologies, it is instructive to consider how the Web trounced competing species of protocols. Though it shared several adaptations common to Internet protocols: "free software spreads faster", "ASCII systems spread faster than binary ones", and "bad protocols imitate; great protocols steal", it leveraged one unique strategy: "self-description". The Web, you see, can be built upon itself. Universal resource locators, machine-readable data formats, and machine-readable specifications can be knit together into an extensible system which assimilates any competitors.

The Web stole content-neutrality from MIME: it learned how to adapt to any document type equally. On the other hand, some types were more equal than others: the Web prefers HTML over Portable Document Format (PDF), Microsoft Word, and myriad others. That's because of a general trend over the last seven years of Web history from formatting to structural to semantic markup. Each step up in the Ascent of Formats, from PostScript (opaque, operational, formatting); to troff (readable, operational formatting); to Rich Text Format (RTF) (readable, extensible, formatting); to "classic" HTML (readable, declarative structure); to HTML 1.x (readable, limited declarative semantics like <ADDRESS>); to XML; and on to intelligent metadata like Platform for Internet Content Selection (PICS) labels and Knowledge Interchange Format (KIF), adds momentum to Web applications.

As such, the Web is becoming a kind of cyborg intelligence: man and machine, harnessed together to generate and manipulate information. If automatability is to be a human right, then the drudge work involved in exchanging and manipulating knowledge must be eliminated by machine assistance, as indicated by MIT Laboratory for Computer Science Director Michael Dertouzous [Dertouzous, 1997].

In short, the shift from strucutral HTML markup to semantic XML markup is a critical phase in the struggle to transform the Web from a universal information space into a knowledge network.

References

  1. Bert Bos. The XML Data Model, 1997. (1997a) Available at http://www.w3.org/XML/Datamodel.html
  2. Bert Bos. XML Representation of a Relational Database, 1997. (1997b) Available at http://www.w3.org/XML/RDB.html
  3. Jon Bosak. XML, Java, and the Future of the Web, 1997. Available at http://sunsite.unc.edu/pub/sun-info/standards/xml/why/xmlapps.htm
  4. Tim Bray and C.M. Sperberg-McQueen. Extensible Markup Language (XML): Part I. Syntax, World Wide Web Consortium Working Draft (Work in Progress), March 1997. (1997a) Available at http://www.w3.org/TR/WD-xml-lang.html
  5. Tim Bray and Steve DeRose. Extensible Markup Language (XML): Part II. Linking, World Wide Web Consortium Working Draft (Work in Progress), April 1997. (1997b) Available at http://www.w3.org/TR/WD-xml-link.html
  6. Tim Bray. Adding Strong Data Typing to SGML and XML, May 1997. (1997c) Available at http://www.textuality.com/xml/typing.html
  7. Dan Connolly and Jon Bosak. Extensible Markup Language (XML), 1997. Available at http://www.w3.org/XML/
  8. Michael Dertouzous. What Will Be, HarperEdge, 1997.
  9. Castedo Ellerman. Channel Definition Format, W3C submission, March 1997. Available at http://www.w3.org/TR/NOTE-CDFsubmit.html
  10. Charles F. Goldfarb The SGML Handbook, edited and with a foreword by Yuri Rubinsky, 688 pages, Oxford University Press, 1990. This volume contains the full annotated text of ISO 8879 (with amendments).
  11. R.V. Guha and Tim Bray. Meta Content Framework Using XML, June 1997. Available at http://www.w3.org/TR/NOTE-MCF-XML/
  12. Patrick Ion and Robert Miner. Mathematical Markup Language, W3C Working Draft, May 1997. http://www.w3.org/pub/WWW/TR/WD-math
  13. Peter Murray-Rust. Chemical Markup Language (CML), Version 1.0, January 1997. http://www.venus.co.uk/omf/cml/
  14. Dave Raggett, Arnaud Le Hors, and Ian Jacobs. HTML 4.0 Specification, World Wide Web Consortium Working Draft (Work in Progress), July 1997. Available at http://www.w3.org/TR/WD-html40/
  15. Yuri Rubinsky and Murray Maloney. SGML and the Web: Small Steps Beyond HTML, 502 pages, the Charles F. Goldfarb series on Open Information Management, Prentice Hall, 1997. An ideal introduction to SGML for HTML users.
  16. Unwired Planet. Proposal for a Handheld Device Markup Language (Working Draft), Version 2.0, May 1997. Available at http://www.uplanet.com/pub/hdml_w3c/hdml_proposal.html
  17. WebMethods. Web Interface Description Language Specification, 1997. Available at http://www.webmethods.com/technology/widl.html

Other Useful Links

  1. XML 1.0 working draft
  2. HTML page at W3C
  3. SGML page at SIL
  4. SGML Activity page at W3C
  5. World Wide Web Consortium
  6. Commonly Asked Questions about XML
  7. An Initial Invesigation of XML
  8. webMethods

Author Addresses

Rohit Khare, khare@alumni.caltech.edu

Rohit Khare is a member of the MCI InternetArchitecture staff in Boston, MA. He was previously on the technical staff of the World Wide Web Consortium at MIT, where he focused on security and electronic commerce issues. He has been involved in the development of cryptographic software tools and Web-related standards development. Rohit received a B.S. in Engineering and Applied Science and in Economics from California Institute of Technology in 1995. He expects to join the Ph.D. program in computer science at the University of California, Irvine in Fall 1997.

Adam Rifkin, adam@cs.caltech.edu

Adam Rifkin received his B.S. and M.S. in Computer Science from the College of William and Mary. He is presently pursuing a Ph.D. in computer science at the California Institute of Technology, where he works with the Caltech Infospheres Project on the composition of distributed active objects. His efforts with infospheres have won best paper awards both at the Fifth IEEE International Symposium on High Performance Distributed Computing in August 1996, and at the Thirtieth Hawaii International Conference on System Sciences in January 1997. He has done Internet consulting and performed research with several organizations, including Canon, Hewlett-Packard, Griffiss Air Force Base, and the NASA-Langley Research Center.

Modification information:

$Id: xml.html,v 1.61 1997/07/31 13:06:50 adam Exp $

This paper will appear in modified form in IEEE Internet Computing, and might appear in modified form in the autumn 1997 issue of the World Wide Web Journal special issue on XML.

This paper represents the authors' third publication together. Their second publication, Weaving a Web of Trust, is available for review.