The Evolution of Web Documents: The Ascent of XML

by Dan Connolly, Rohit Khare, and Adam Rifkin

$Id: ascent-of-xml.html,v 1.75 1997/09/21 19:12:10 adam Exp $

HTML is the ubiquitous data format for Web pages; most information providers are not even aware that there are other options. But now, with the development of XML, that is about to change. Not only will the choices of data formats become more apparent, but they will become more attractive as well. Although XML succeeds HTML in time, its design is based on SGML, which predates HTML and the Web altogether. SGML was designed to give information managers more flexibility to say what they mean, and XML brings that principle to the Web. Because it allows the development of custom tagsets, we can think of XML as HTML without the "training wheels." In this article, we trace the history and evolution of Web data formats, culminating in XML. We evaluate the relationship of XML, HTML, and SGML, and discuss the impact of XML on the evolution of the Web.

1. World-Wide Markup Language

The idea that structured documents could be exchanged and manipulated if published in a standard, open format, dates back to multiple efforts in the 1960s. In one endeavor, a committee of the Graphic Communications Association (GCA) created GenCode to develop generic typesetting codes for clients who sent data typeset differently to different vendors. GenCode allowed them to maintain an integrated set of archives despite the records being set in multiple types.

In another effort, IBM developed the Generalized Markup Language (GML) for its big internal publishing problems, including the management of documents of all types, from manuals and press releases to legal contracts and project specifications. It was designed to be reusable for batch processors to produce books, reports, and electronic editions from the same source file(s).

GML had a "simple" input syntax for typists, including the tags we recognize today denoted by <> and </>. Of course, GML also permitted lots of "cheating": typists could easily elide obvious tags. This markup minimization erred on the side of making these documents easier to type and read for humans than for general purpose processing (as could be done if these documents were fed to computer applications). In fact, there were so few types of documents required at the time, people wrote special compilers, bound to each particular kind of document, to handle the inputting of the appropriate data formats.

As more document types emerged --- each requiring specially suited tagsets --- so too did the need for a standard way to publish and manipulate each Document Type Definition (DTD). In the early 1980s, representatives of both the GenCode and GML communities joined to form the American National Standards Institute (ANSI) committee on Computer Languages for the Processing of Text; their goal was to standardize the ways to specify, define, and use markup in documents.

SGML, the Standard Generalized Markup Language became an ISO standard in 1986 [ISO 8879:1986]. Developed for defining and using portable document formats, SGML was designed to be formal enough to allow proofs of document validity, structured enough to handle complex documents, and extensible enough to support management of large information repositories. SGML might seem to the casual observer like a "design by committee," but it was successful in furnishing an interchange language that could be used for exchanging and manipulating text documents.

By the late 1980s, SGML had caught on in organizations such as CERN, where, in a laboratory in Switzerland, a holiday hacker borrowed the funny-looking idiom for his then-new hypertext application. Indeed, in 1990, Tim Berners-Lee, inventor of the World Wide Web, picked an assortment of markup tags from a sample SGML DTD used at CERN. In NeXUS, the original Web browser and editor, he used these tags with style sheets for typesetting --- and added the "killer feature": links.

By the time Mosaic took off worldwide in 1993, people were using HTML as a hammer and were seeing nails everywhere. Unfortunately, HTML is not a hammer; even HTML 4.0 released in July 1997 [Raggett et al., 1997] only furnishes a limited tagset, and no single tagset will suffice for all of the kinds of information on the Web.

Starting in 1992, HTML evolved from a somewhat ad-hoc syntax to a conforming SGML application. This did not happen for free, and it involved some rather ugly compromises. So why was this worth the effort? First, it gave the specifications a solid foundation. But moreover, the intent was that Web tools would implement HTML as a special case of generic SGML and stylesheet support. That way, changes to HTML could be dynamically propagated into the tools by just updating the DTD and stylesheets. This proved to be an idea before its time: the engineering cost was significant, but the information providers did not have the necessary experience to take advantage of the extra degrees of freedom.

In 1992, the Web was not ready for a powerful, generic markup language: in its nascent stage, the Web needed one small tagset that would be suitable for most of its intended documents and simple enough for the authoring community to understand. That small tagset is HTML.

Basing HTML on SGML was the first step to bringing the SGML community to the World Wide Web. At that point, forward-looking companies began to shift their agendas to unite SGML with the Web [Rubinsky and Maloney, 1997].

SGML on the Web is risky because SGML has lots of optional features, the sender and receiver have to agree on some set of options to remain interoperable. The engineering costs are compounded because the SGML specification does not follow accepted computer-science conventions for the description of languages [Kaelbling, 1990]. For implementors, the specification is hard to read and contains many costly special cases.

The stage is set for XML, the eXtensible Markup Language [Connolly and Bosak, 1997], which addresses the engineering complexity of SGML and the limitations of the fixed tag set in HTML.

2. Community-Wide Markup Languages

For a document to communicate successfully from author to readers, all concerned parties must agree that words mean what all choose them to mean. Semantics can only be interpreted within the context of a community. Millions of HTML users worldwide, for example, agree that <B> means bold text, or that <H1> is a prominent top-level document heading. The same cannot be said, though, for the date 8-7-97, which reflects local culture. Or for <font FACE=Arial>, which is only usable by Microsoft Windows systems. The larger the community, the weaker the shared context; the smaller and more focused the community, the stronger the shared context becomes.

HTML is the currently only common tagset Web users can rely upon. Furthermore, HTML cannot be extended unilaterally, since the shared definition is maintained by a central standardization process which publishes new editions like 2.0, 3.2, and 4.0. Since semantics depend on shared agreements between readers and writers about the state of the world, there is a place for community-specific definitions. XML makes it cost-effective to capture community ontologies as Document Type Definitions to decentralize the control of specialized markup languages. The emergence of richly annotated data structures catalyzes new applications for storing, sharing, and processing ideas.

2.1 Semantic Markup

Descriptive markup indicates the role or meaning of some part of a document. While <H1> is a generic expression of importance, <WARNING TYPE=Fire_hazard> is a much more specific expression. Calling the former "structure" and the latter "semantics" is indeed, a matter of semantics, but it seems clear that the more specific the markup, the more meaningful the document and the lower the potential for confusion.

An ontology codifies the concepts that are noteworthy to a community so that everyone has a common level of understanding upon which future knowledge exchange can proceed. The reverse phenomenon is equally powerful: mastery of the jargon confers membership in the community. In this sense, community recapitulates ontology, but the tools to express private agreements have been late in coming. Communities are mirrored by ontology: when a large community has to use a single ontology, its value is diluted to the least common denominator (as exemplified by HTML itself).

When communities collide, ontological misunderstandings can develop for several reasons. Sometimes it is a matter of context, like the legal interpretation of "profit" according to national accounting and tax rules. Sometimes it is a matter of perception, like "offensive language" in a Platform for Internet Content Selection (PICS) content-rating [Resnick and Miller, 1996]. Sometimes it is a matter of alternative jargon: 10BaseT cable to a programmer is Category 5 twisted pair to a lineman. Sometimes it is a matter of intentional conflation, like Hollywood "profit" which refers to both the pile of cash in the studio account and the losses recorded in an actor's residuals.

The best remedy is to codify private ontologies that serve to identify the active context of any document. This is the ideal role for a well-tempered DTD. Consider two newspapers with specific in-house styles for bylines, captions, company names, and so on. Where they share stories on a wire service, for example, they can identify it as their story, or convert it according to an industry-wide stylebook. As competing DTDs are shared among the community, semantics are clarified by acclamation [Hale, 1997]. Furthermore, as DTDs themselves are woven into the Web, they can be discovered dynamically, further accelerating the evolution of community ontologies.

2.2 Generating New Markup Languages

As Web evolved, people and companies indeed found themselves extending the HTML tagset to perform special tasks. A rich marketplace of server-side-includes and macro-preprocessing extensions to HTML demonstrates that users understand the benefit of using local markup conventions to automate their in-house information management practices. And the cost of "dumbing down" to HTML is becoming more apparent when more organizations want to go beyond information dissemination to information exchange.

The fundamental problem is that HTML is not unilaterally extensible. A new tag potentially has ambiguous grammar (is it an element or does it need an end-tag?), ambiguous semantics (no metadata about the ontology it is based on), and ambiguous presentation (especially without stylesheet hooks). Instead, investing in SGML offers three critical features:

XML is a simplified (but strict) subset of SGML that maintains the SGML features for extensibility, structure, and validation. XML is a standardized text format designed specifically for transmitting structured data to Web applications. Since XML aims to be easier to learn, use, and implement than full SGML, it will have clear benefits for World Wide Web users. XML makes it easier to define and validate document types, to author and manage SGML-compliant documents, and to transmit and share them across the Web. Its specification is less than a tenth of the size of SGML86's. XML is, in short, a worthy successor in the evolutionary sense.

The "well-formed" versus "valid" distinction is an important one. Since one can always extract and reflect the document structure from the document itself without its DTD, DTD-less documents are already self-describing containers. A DTD simply provides a tool for deciding whether the structure implicit in the body of the document matches the explicit structure (known in the vernacular as "validity"). This phenomenon is very isomorphic to the interface/implementation separation in components; in the XML model, the DTD is the interface, and the body is the implementation. We discuss the implications of XML well-formedness in Section 3.1.

The working draft for XML 1.0 provides a complete specification in several parts: the extensible markup language itself [Bray, Paoli, and Sperberg-McQueen, 1997], methods for associating hypertext linking [Bray and DeRose, 1997], and forthcoming stylesheet mechanisms for use with XML. From the XML specification, we observe that expressive power, teachability, and ease of implementation, were all major design considerations. And although XML is not backward-compatible with existing HTML documents, we note that documents that are HTML 4.0-compliant can easily be converted to XML.

In addition to modifying the syntax and semantics of document tag annotations, XML also changes our linking model by allowing authors to specify different types of document relationships: new linking technology allows the management of bidirectional and multiway links, as well as links to a span of text (within the same or other documents), as a supplement to the single point linking afforded by HTML's existing HREF-style anchors.

2.3 Leveraging Community-Wide Markup

Accepting that community-specific DTDs can represent an ontology and that XML makes it cost-effective to deploy them, the potential of XML-formatted data will catalyze new applications for capturing, distributing, and processing knowledge [Khare and Rifkin, 1997].

Two communities using XML to capture field-specific knowledge have already chalked up early victories: the Chemical Markup Language (CML) [Murray-Rust, 1997] and the Mathematical Markup Language (MathML) [Ion and Miner, 1997]. Storing and distributing information in XML databases in conjunction with eXtensible Linking Language (XLL) can ease data import and data export problems, facilitate aggregation from multiple sources (data warehousing), and enable interactive access to large corpuses.

Web Automation promises the most dramatic leverage, though. Tools like webMethods' Web Interface Definition Language (WIDL) [Allen, 1997] bridge this gap between legacy Web data and structured XML data. WIDL encourages the extraction of information from unstructured data (such as HTML tables and forms) to produce more structured, meaningful XML reports; furthermore, employing WIDL one can synthesize information already stored as structured data into new reports using custom programming, linking, and automated information extrapolation. Manipulating XML-formatted data leverages a cleaner, more rigorous object model for accessing entities within a document, when compared with the Document Object Model's references to windows, frames, history lists, and formats [Bosak, 1997].

3. On the Coevolution of HTML and XML

Now that we have compared the values of HTML for global markup needs and XML for community-specific markup, let's see how all this pans out in practice. How will HTML will adapt to the presence of XML?

It will not be an either-or choice between HTML and XML; you do not have to plan for a Flag Day when your shop stops using HTML and starts using XML. Instead, as HTML tools evolved to support the whole range of XML, your choices will expand with them, just as the value to information providers is becoming evident, the cost of generic markup is going down because XML is considerably simpler than SGML. In addition, the complimentary piece of infrastructure, stylesheets, is finally being deployed.

If a browser (or editor or other tool) supports stylesheets [Culshaw et al., 1997], support for individual tags does not have to be hard-coded. If you decide to add <part-number> tags to your documents and specify in a stylesheet that part-numbers should display in bold, a browser with stylesheet support can follow those directions. It is clear how to add a <part-number> tag to XML, but what about HTML?

3.1 Platforms and Borders: Well-Formed HTML

HTML is built on the platform of SGML. The borders of SGML were set by ISO in 1986, and the borders of HTML were set originally by the IETF in 1995, and subsequently expanded by W3C in 1996 with HTML 3.2 and again in 1997 with HTML 4.0. But so far, the borders of HTML fit within the borders of SGML.

But only part of the ground inside the SGML borders is fertile --- the XML part. The rest is too expensive to maintain. Unfortunately, some of HTML is sitting on that infertile ground. But it should be a simple task to move a document from that crufty ground to the arable XML territory:

By the same token, it should be a simple task to move the HTML specification onto the XML platform. Let's look at those steps a bit more closely:

Is "More Text" inside the my-markup element or not? The SGML answer is: you have to look in the DTD to see whether my-markup is an empty element, or whether it can have content. The XML answer is: don't do that. Make it explicit, one way or the other:

Hence rule 1: match every start-tag with an end-tag. That is right, every p, li, dd, dt, tr, and td start-tag needs a matching end tag. If you are still using a text-editor to write HTML, you can take this as a hint to start looking at direct-manipulation authoring tools, or at least text editors with syntax support for this sort of thing.

Rule 3 takes the guesswork out of attribute value syntax. In HTML, quotation is only required in some cases, but it is always allowed. In XML, it is simply required.

All of these rules are only a prediction of how the specifications will evolve, but rule 4 is especially uncertain: it may turn out to be "use only upper case" instead, depending on how HTML adapts to the rules about case sensitivity in XML. XML tag names and attribute names can use characters from the a variety of Unicode characters, and matching upper-case and lower-case versions of these characters is not as simple as it is in ASCII. As of this writing, the working group has decided to punt on the issue, so that names compare by exact match only.

3.2 Licensed to Tag

According to the official rules, extending HTML is the exclusive privilege of the central authorities. But everybody's doing it in various underground ways: they use , preprocessing extensions with <if> <then> <else> tags, and so on. Even Robin Cover, maintainer of the most comprehensive SGML bibliography on the Web, admits in [Cover, 1997]:

Once HTML and XML align, there will be legitimate alternatives to all these underground practices. You can add your <part-number> and <abstract> tags with confidence that your markup will be supported.

In fact, you have two choices regarding the level of confidence: you can make well-formed documents just by making sure all your tags are balanced and there are no missing quotes and such. A lot of tools will only check this much.

On the other hand, if you want additional support from your tools, you will have to keep your documents valid: remember to put a <title> in every document, an alt attribute on every <img> element, and so on. In that case, adding tags to a document also requires creating a modified DTD.

Then you can validate that the document is not just any old HTML document, but it has a specific technical report structure for consistency with the other technical reports at your site. And you can use stylesheet-based typesetting tools to create professional looking PostScript or Portable Document Format (PDF) renditions.

3.3 Mix and Match, Cut and Paste

Not everyone who wants something different from standard HTML has to write his or her own DTD. Perhaps, in the best of Internet and Web tradition, you can leverage someone else's work. Perhaps you would like to mix elements of HTML with elements of DocBook [Allen and Maler, 1997] or a Dublin Core [Burnard et al., 1996] DTD. Unfortunately, achieving this mixture with DTDs is very awkward.

And yet the ability to combine resources that were developed independently is an essential survival property of technology in a distributed information system. The ability to combine filters and pipes and scripts has kept Unix alive and kicking long past its expected demise. In his keynote address at Seybold San Francisco [Berners-Lee, 1996], Tim Berners-Lee called this powerful notion "intercreativity".

Combining DTDs that were developed independently exposes limitations in the design of SGML for things like namespaces, subclassing, and modularity and reuse in general.

There is a great tension between the need for intercreativity and these limitations in SGML DTDs. One strategy under discussion is to introduce qualifified names a la Modula, C++, or Java into XML. For example, you might want to enrich your home page by the use of an established set of business card element types. This strategy suggests markup a la:

This markup is perfectly well-formed, but the strategy does not address DTD validation. Another strategy for mixing-and-matching elements is to use SGML Architectures [Kimber, 1997]. Or, perhaps a more radical course of research is needed to re-think the connection between tag names, element names, and element types [Akpotsui et al., 1997].

3.4 The Future Standardization of XML

If it seems that XML is moving very fast, look again. The community is moving very fast to exploit XML, but the momentum against changes to XML itself is tremendous. XML is not a collection of new ideas; it is a selection of tried-and-true ideas. These ideas are implemented in a host of conforming SGML systems, and employed in truly massive SGML document repositories. Changes to a technology with this many dependencies are evaluated with utmost care.

XML is essentially just SGML with many of the obscure features thrown out (Appendix A of the specification lists SUBDOC, RANK, and quite a few others). The result is much easier to describe, understand, and implement, despite the fact that every document that conforms to the XML specification also conforms to the SGML specification.

In a few cases, the design of SGML has rules that would be difficult to explain in the XML specification. And they prohibit idioms that are quite useful, such as multiple <!ATTLIST ...> declarations for the same element type. In these cases, the XML designers have participated in the ongoing ISO revision of SGML. The result is the WebSGML Technical Corrigendum [Goldfarb, 1997], a sort of "patch" to the SGML standard.

Every document that conforms to the XML specification does indeed conform to SGML-as-corrected, and the W3C XML Working Group and the ISO Working Group have an agreement to keep that constraint in place.

So the wiggle-room in the XML specification is actually quite small. The W3C XML working group is considering a few remaining issues, and they release drafts for public review every month or so. The next step in the W3C process, after the working group has addressed all the issues they can find, is for the W3C Director to issue the specification as a Proposed Recommendation and call for votes from the W3C membership. Based on the outcome of the votes, the director will then decide whether the document should become a W3C Recommendation, go back to the working group for more work, or be canceled altogether.

Outside the core XML specification, there is much more working room. The XLL specification [Bray and DeRose, 1997] is maturing, but there are still quite a few outstanding issues. And, work on the eXtensible Stylesheet Language (XSL) is just beginning.

4. The Ascent of XML in the Evolution of Knowledge from Information

The World Wide Web Consortium, the driving force behind XML, sees its mission as leading the evolution of the Web. In the competitive market of Internet technologies, it is instructive to consider how the Web trounced competing species of protocols. Though it shared several adaptations common to Internet protocols, such as "free software spreads faster," "ASCII systems spread faster than binary ones," and "bad protocols imitate; great protocols steal," it leveraged one unique strategy: "self-description." The Web can be built upon itself. Universal Resource Identifiers, machine-readable data formats, and machine-readable specifications can be knit together into an extensible system that assimilates any competitors. In essence, the emergence of XML on the spectrum of Web data formats caps the struggle toward realizing the original vision of the Web by its creators.

The designers of the Web knew that it must adapt to new data formats, so they appropriated the MIME Content Type system. On the other hand, some types were more equal than others: the Web prefers HTML over PDF, Microsoft Word, and myriad others, because of a general trend over the last seven years of Web history from stylistic formatting to structural markup to semantic markup. Each step up in the Ascent of Formats adds momentum to Web applications, from PostScript (opaque, operational, formatting); to troff (readable, operational, formatting); to Rich Text Format (RTF) (readable, extensible, formatting); to HTML (readable, declarative, limited descriptive semantics like <ADDRESS>); now to XML; and on to intelligent metadata formats such as PICS labels.

The Web itself is becoming a kind of cyborg intelligence: human and machine, harnessed together to generate and manipulate information. If automatability is to be a human right, then machine assistance must eliminate the drudge work involved in exchanging and manipulating knowledge, as indicated by MIT Laboratory for Computer Science Director Michael Dertouzous [Dertouzous, 1997]. As Douglas Adams described [Adams, 1979], the shift from strucutral HTML markup to semantic XML markup is a critical phase in the struggle to transform the Web from a global information space into a universal knowledge network.

Acknowledgments

This paper is based on our experiences over several years working with the Web community. Particular plaudits go to our colleagues at the World Wide Web Consortium, including Tim Berners-Lee; the teams at MCI Internet Architecture and Caltech Infospheres; and the group at webMethods, especially Charles Allen.

References

Author Addresses

Dan Connolly is the leader of the W3C Architecture Domain. His work on formal systems, computational linguistics, and the development of open, distributed hypermedia systems began at the University of Texas at Austin, where recieved a B.S. in Computer Science in 1990. While developing hypertext production and delivery software in 1992, he began contributing to the World Wide Web project, and in particular, the HTML specification. He presented a draft at the First International World Wide Web Conference in 1994 in Geneva, and edited the draft until it was published as the HTML 2.0 specification, Internet RFC1866, in November 1995. Today he is the chair of the W3C HTML Working Group and a member of the W3C XML Working Group. His research interest is the on the value of formal descriptions of chaotic systems like the Web, especially in the consensus-building process.

Rohit Khare is a member of the MCI Internet Architecture staff in Boston, MA. He was previously on the technical staff of the World Wide Web Consortium at MIT, where he focused on security and electronic commerce issues. He has been involved in the development of cryptographic software tools and Web-related standards development. Rohit received a B.S. in Engineering and Applied Science and in Economics from California Institute of Technology in 1995. He will enter the Ph.D. program in Computer Science at the University of California, Irvine in Fall 1997.

Adam Rifkin received his B.S. and M.S. in Computer Science from the College of William and Mary. He is presently pursuing a Ph.D. in computer science at the California Institute of Technology, where he works with the Caltech Infospheres Project on the composition of distributed active objects. His efforts with Infospheres have won best paper awards both at the Fifth IEEE International Symposium on High Performance Distributed Computing in August 1996, and at the Thirtieth Hawaii International Conference on System Sciences in January 1997. He has done Internet consulting and performed research with several organizations, including Canon, Hewlett-Packard, Reprise Records, Griffiss Air Force Base, and the NASA-Langley Research Center.