[This local archive copy mirrored from the canonical site: http://www.sun.com/sunworldonline/swol-02-1998/swol-02-xml.html; links may not have complete integrity, so use the canonical document at this URL if possible.]

What's the point of XML?

Find out what this much discussed, but little understood technology can do for you

By Ed Tittel

In some ways, XML is like object-oriented programming: Everybody's heard of it, but nobody really understands how it works. Some pundits are predicting that 1998 will be the year XML descends from the ivory tower of the W3C into the real world. Others see the Extensible Markup Language as too complex to serve as a wide-spread replacement for HTML. In this comprehensive overview of XML, we explain where it comes from, what it is, and how it works. (5,000 words)

Is the Web becoming the victim of its own success? Although the roots of the World Wide Web lie less than ten years in the past, they have nourished an incredibly diverse and complex ecosystem of technologies, particularly where HTML is concerned. From complex plug-ins, to extensive use of Java, JavaScript, and CGI programs, the bare framework of HTML and HTTP has become encrusted with additional layers of functionality and capability that often obscure the clean, basic framework that the HyperText Markup Language defines. Then, too, HTML has gone through almost yearly revisions arising from the efforts of browser vendors to extend the Web, followed by the valiant efforts of standards makers at the World Wide Web Consortium (W3C) to keep up with the ever-changing state of this particular art.

Introducing XML
Enter a new kind of markup language, called the Extensible Markup Language (XML), and the trend toward constant change and elaboration is altered dramatically. XML addresses many of HTML's limitations head-on, by creating a whole new way to approach how Web sites may be structured and designed. It also expands how relationships between content (the messages a Web site seeks to communicate) and form (the ways that content is formatted or presented, and the ways links between elements operate) may be expressed.

XML was designed with three primary goals in mind, largely to offset HTML's limitations:

  1. To be extensible, allowing users to specify their own tags, and define attributes and values that would allow such tags to carry associated parameters, or otherwise provide semantic qualifications to content.

    HTML consists of a frozen set of tags, as defined by a W3C HTML definition, or a given vendor's version of that definition that may also carry its own proprietary extensions. But the set of tags is fixed and limited to whatever is set by such a definition, and is basically closed. XML, on the other hand, supports arbitrary extensions. It can deal with special areas where unique representations are required (such as chemical or mathematical formulae) or where binary and text data must be pre-packaged (such as push technologies, which combine text content with software of some kind or another) for Web-based delivery.

  2. To support rich structure, so that complex data organizations like those found in databases or object hierarchies can be easily modeled within documents, and so that such structures can be approached and manipulated directly.

    HTML supports only a limited amount of structure, includes no real hierarchy or object-representation mechanisms, and requires additional plug-ins or extensions to permit structure to be perceived or manipulated. Such capability is part and parcel of XML's basic definition.

  3. To require validation, or at least well-formedness.

    The HTML specification does not permit consuming applications to check data for syntactical and structural validity as part of the data delivery process. XML, on the other hand, requires rigorous definition of any and all elements that appear within any document. XML supports several flavors of completeness that can require documents to be, in XML parlance, valid (completely self-defining) or well-formed, which means they adhere strictly to the markup and syntax defined for a specific environment, even if the definition for the environment is not explicitly included in the document itself.

In short, XML is a type of formal language that might best be described as a metalanguage -- a language that can be used to describe other languages -- because it permits users to define and use their own terms and even shortcuts (called entities), making it easier to express unique content and the relationships that can exist among content elements.

Early examples of this genre have already proven to be effective, especially within special-interest communities. In partnership with Marimba, Microsoft has already created an XML dialect called the Channel Definition Format (CDF), that's built into Internet Explorer 4.0. It has proved to be quite effective at distributing both text and software. Likewise, the Chemical Markup Language (CML) and the Mathematical Markup Language (MML) have wowed savvy viewers with their abilities to represent the arcane symbols, complex characters, and demanding markup that chemists and mathematicians require to get their most important ideas across. As we'll discuss later, similar efforts are underway, in all kinds of fields, to better meet the needs of groups with particular interests and demanding requirements for presenting their data and content.

This helps to explain XML's strong appeal: It makes it possible for Web designers to accommodate richer sources for Web document content, permitting documents to be structured to do the best possible job of presenting complex data elements. And, what's more, it does all this without requiring special contortions (such as Java applets or CGI programs) to grab and massage data before writing to a page. Experts claim that XML's greatest strength is its ability to cleanly and effectively separate form (the way information looks on your screen) from content (what things on the screen actually mean).

Interactivity, the XML way
According to Jon Bosak, chairman of the XML Working Group at the W3C, and the author of numerous seminal papers on XML, the user's need to interact with application on the Web, and the fact that this cannot be accomplished easily using HTML, argues most eloquently for XML's need to exist (see Resources below). According to Bosak's analysis, these applications fall into one or more of these four categories:

  1. Applications where a Web client must interact with two or more sources of data at the same time: HTML is good at accessing single sources, but chokes when multiple sources are required, especially when they adhere to different formats or must be accessed through dissimilar interfaces.

  2. Applications that seek to move as much of the processing load as possible from the Web server to the Web client. (Dynamic HTML shows some promise in this area, but is no match for what XML can provide).

  3. Applications where a Web client must interact with multiple views of the same data: In the HTML world, each different view requires a trip back to the server for re-presentation; in the XML world, the client can sort, filter, and manipulate a data set locally.

  4. Applications where intelligent Web agents seek to discover information to meet customized profiles defined by individual users: In HTML, all users see a single data set the same way; in XML, it is possible that each user might see a single data set in her own unique way.

Establishing the level playing field
XML is firmly founded on the belief that data should be managed by its creators and maintainers, and that those who seek to provide Web content will be served best by data formats that do not require use of a particular scripting language, authoring tool, or browser to display data. Rather, the guiding principle behind XML is to create a standard, vendor-neutral, data representation where different authoring tools and delivery mechanisms can freely interact.

By itself, plain-vanilla HTML is vendor-neutral; but when encumbered with the necessary plug-ins and interactivity mechanisms that most modern Web sites require, HTML falls down sadly at this job. The problem with HTML is that too much content winds up locked inside scripts or Java code, where it becomes difficult to find and maintain, and requires Web designers to learn the language and syntax upon which special-purpose extensions rest. XML's job, in a nutshell, is to reestablish a level playing field for all Web designers; one that supports richer, more interactive functionality than plain-old HTML ever could.

Document Type Definitions: SGML's legacy
The documents that govern any given version of HTML are known as document type definitions (DTDs). DTDs name the version -- and sometimes the language, or vendor affiliation -- associated with any specific implementation of HTML. This structure is based on a powerful and complex metalanguage called the Standard Generalized Markup Language, also known as SGML. In fact, any particular version of HTML is nothing more than an SGML-based markup language as defined by its related DTD. (For more details on SGML, please consult the sidebar, "All general solutions are quite complex").

This raises the inevitable question: Why not use SGML as a replacement for HTML? In fact, SGML is already an ISO standard (ISO 8879), offers incredibly rich representational abilities, and has proven itself equal to defining documents like the Military Specification (Milspec) required for weapons systems and aircraft for the U.S. Department of Defense.

The problem is that SGML is what we can call a "kitchen sink" document definition environment. That is, it includes everything conceivable that you might ever want to do or define about a document, including the proverbial kitchen sink. This works well for extremely large and complex documents, like the aforementioned Milspec, but is too much for the limited bandwidth of the Internet, and the often less-than-overwhelming computing capabilities of most desktop computers. In short, SGML may be a very good thing indeed, but it's far too much of a good thing for effective use over the Web.

XML's distinction lies in its careful redefinition of a strict subset of SGML, honing its capabilities in a way that leaves the kitchen sink out, and in a way that makes it easy to write software, documents, and DTDs, that conform to XML.

XML's design desiderata
The XML Working Group at the W3C originated in an SGML working group in 1996, among a select group of highly-trained professionals who were as keenly aware of SGML's stumbling points and implementation problems as they were of its tremendous flexibility and representational power. This experience led to the formulation of a set of ten design goals for XML, which state this metalanguage's capabilities and expressions both directly and succinctly (each of the ten elements is quoted directly from the XML Specification, followed by a brief exegesis that we prepared to explain and explore each topic):

  1. "XML shall be straightforwardly usable over the Internet." This states the most important requirement for XML -- namely that it be well suited for Web-based access, given limited bandwidth and clients of variable capabilities.

  2. "XML shall support a wide variety of applications." This is already a given, based both on XML's innate extensibility, and on the over 20 dialects of XML that have already been proposed, if not fully defined.

  3. "XML shall be compatible with SGML." Strictly speaking, XML is a subset of SGML; thus, any SGML tool can interpret XML. Also, when driven by a valid XML DTD, any of the hundreds of SGML authoring and document management tools can handle XML documents without alteration.

  4. "It shall be easy to write programs which process XML documents." This leads directly to features like closing tags for all tags that can enclose text, complete definitions in DTDs before markup or entities can be referenced, and requirements for validation and completeness that make it easy for programmers to write XML-compliant software. This element also stipulates that browser vendors add XML support to their programs.

  5. "The number of optional features in XML is to be kept to the absolute minimum, ideally zero." Optional features, especially seldom-used ones, are what makes SGML so difficult and expensive to implement. This declaration attempts to prevent the same kinds of issues from plaguing XML.

  6. "XML documents should be human-legible and reasonably clear." Human beings also need to be able to read and write XML, in addition to it being computer-friendly. By preserving much of the look and feel of HTML, XML succeeds admirably in this effort.

  7. "The XML design should be prepared quickly." Having seen HTML 3.0 (later 3.2) go through almost two years of bickering, revision, and rewrite, the XML Working Group wanted to be sure that XML would be subject to no such delays and arguments. So far, the Group has accomplished amazing results in a little over one year.

  8. "The design of XML shall be formal and concise." The formal specification for SGML requires 500 pages; the specification for XML, including appendices, requires just over 60 pages. Need we say more? In fact, the modified BNF grammar for XML by itself can be stated in just six or seven pages of text!

  9. "XML documents shall be easy to create." By creating a notation that is equally easy for humans and computers to read, the working group has done its best to guarantee that tool vendors will support this technology. In addition to the hundreds of SGML tools that are already capable of handling XML right now, over 20 XML-specific tools are already available (two short months after the specification reached "recommended" status), with many more on the way. Also, most major browsers (Internet Explorer, for one) and authoring tools (FrontPage and HomeSite) already support XML, with other tools playing rapid catch-up. Netscape is promising XML support In Communicator 5.0, which is due in the last half of this year.

  10. "Terseness in XML markup is of minimal importance." This requirement also stresses readability, and emphasizes that a clear, intelligible markup is preferable to a terse, cryptic markup. Old computer- language hands will recognize this as an "anti-APL (a programming language)" sentiment; one well worth contemplation and adoption by those interested in understanding what XML is about and how it works.

By stating its strategy and goals so clearly, the XML Working Group is attempting to avoid the constrictions of the fixed and immutable forms of HTML, while at the same time avoiding the trap of over-generalization so clearly implemented (or sometimes not, in many real versions of software) in SGML.

Building XML documents
An XML capable browser can actually read the DTD associated with an XML document (or use its knowledge about underlying DTDs when reading well-formed rather than valid XML documents) to use what it learns by reading the definition to display all kinds of markup and symbols (including complex graphics, if suitable definitions are supplied). Thus, it can render all kinds of complex markup and symbology that would otherwise be accessible to HTML only through use of graphics, such as JPGs or GIFs.

The process of browsing an XML document consists of:

  1. Parsing the DTD.
  2. Constructing a data model of the DTD to create a rendering context.
  3. Reading the document body, while applying the DTD to instantiate and realize its content.
  4. Rendering the document for viewing within the browser.
Not every XML document contains a DTD (this is the distinction between well-formed and valid, in fact), but a valid DTD for every XML document must exist somewhere for any such document to be successfully interpreted and rendered.

In keeping with the above design goals, XML documents have a predictable, regular structure. Any collection of text qualifies as an XML document if it meets certain structural requirements -- in which case it may be said to be well-formed, if not valid.

XML documents take on both a physical and a logical structure. From the logical standpoint, an XML document is comprised of:

  1. Declarations: The definitions of object types and attributes, plus the association of values with named variables, called entities, that occur within an XML document's DTD.

  2. Elements: Specific instances of objects, attributes, and related values, as defined by the DTD, and instantiated in the body of an XML document.

  3. Comments: Explanatory text and information that is ignored by an XML parser, but intended for human consumption.

  4. Character references: Definition of the notation used to denote text and other objects within a document, according to any of a number of well-defined character sets (including ISO Latin-1 and the Unicode character sets, among others).

  5. Processing instructions: Instructions for specific, named applications, or special code elements associated with some XML document (or documents).

Each of these types of objects are clearly denoted by explicit markup, and occur in predictable locations within a document. Furthermore, both logical and physical structures must nest properly, so that only legal child objects occur within legal parent objects, and legal sibling objects occur within a legal parent object.

Part of the DTD's most important function is to specify what kinds of markup are legal within other kinds of markup, and to state contextual frames of reference for what sequences or occurrences are explicitly legal. This way, the XML parser can be a very simple LookAhead Left-Right (LALR) parser, which can build a representation of the document on a single pass. This restriction eliminates most of the complexity inherent in full-blown SGML, and eliminates many of the special processing cases that cause SGML parsers to be large, complex, and difficult pieces of software to build and maintain.

From another perspective, any well-formed or valid XML document, consists of three basic parts (with examples):

  1. An XML declaration: essentially names the version of XML in use. For all but experimental purposes, the ellipsis is replaced by a character set designator or a required markup declaration:

    <?XML version="1.0" ... ?>

  2. A DTD: appears between the XML declaration and the start of the document element, which usually begins with markup that reads:

    <!DOCTYPE ...[...]>

  3. The document element: brackets the entire contents of the XML document between a pair of properly defined markup tags.

Valid vs. well-formed
The first two elements (XML declaration and DTD declaration) are sometimes jointly called the prolog to the XML document, since they together define the complete context within which it must be interpreted.

The distinction between a well-formed XML document and a valid XML document is best explained by stating that a well-formed XML document is one whose logical structure appears valid; all internal elements are neatly nested within a single root or document element. But well-formedness doesn't require that the document's elements and attributes be checked against any and all referenced DTDs, nor is it necessary to check that each element in a well-formed document contains all the required sub elements that its DTD says it should.

For a document to be valid, however, it must reference only elements and attributes that appear in a DTD that is either contained within or pointed to in the XML document's prolog, and that meets the nesting and containment requirements established by the DTD that governs the document.

Most XML authoring tools will permit only valid XML documents to be created, so this is typically not much of an issue. For large sites, or for more rapid delivery of data, Web designers may choose to deliver only well-formed documents to consumers, provided that the environment in which such documents are interpreted can be reasonably expected to have established the required context in advance.

How DTDs are digested by their documents
XML allows DTD declarations to be either explicitly included within an XML document's prolog, or linked by reference to some external DTD that may be stored in another file. In the case of larger, more complex sites, or documents that include standard XML dialects, it's entirely possible that only external DTD references will appear, or that a mixture of internal DTD declarations and external references will appear.

The important thing to remember is that internal declarations always take precedence over external ones, no matter what the order of declaration may be, so it's wise to think of external DTDs as defining the implicit base for a document's markup, attributes, and entities. Internal markup can be thought of as a way to extend, tweak, or customize whatever context the external DTDs referenced in the prolog establish. This, too, helps make XML powerful and flexible, since it lets page designers draw on bodies of existing, established markup, yet add to or modify that markup as their needs dictate.

Writing DTDs is not an exercise for the faint of heart. Those with a programming background will appreciate the description of DTDs as the combination of an include file and the data declaration section for a program -- on steroids. Those who lack such a background will have to be content to learn that writing DTDs involves mastering a complex declarative syntax, but one that allows objects, variables, and structures to be defined and laid out with great precision and control.

Building XML markup
Creating XML documents is almost anticlimactic, once the proper preamble (or processing context, if you prefer) has been established. It will invariably occur within the confines of some authoring application, be it a special-purpose XML authoring tool, such as docproc, an XML document processor developed by Sean Russell of the University of Oregon, or a content creation tool, like the extension to the SGML mode for emacs developed by David Megginson from Microstar Software.

It simply becomes a matter of arranging the tags that are defined in the preamble, with appropriate inclusion of content text and attribute references. This is very much like writing HTML markup. In fact, most XML tools support pull-down access to the markup that's legal at any given point within an XML document. They also generally enforce inclusion of required attributes and values, and provide ongoing validation of the documents while they're under construction. As long as your markup needs aren't so complex that you need to define an entirely custom-built markup language, once you master your XML tool's interface, the process of creating content should be both straightforward and speedy.

Should your needs go so far as to require custom-building your own DTD, some training in SGML is highly recommended, and it may even be worth contacting a qualified SGML consultant to help keep the learning curve (or development time, however you measure progress) to a manageable interval. We believe that most developers will find that working within the confines of an established XML dialect is sufficient. In fact, this approach represents the best way to familiarize yourself with the metalanguage and its capabilities.

For that reason, the next nine sections provide brief coverage of some of the best-known or most promising dialects of XML (for pointers to other dialects or work under way, consult any of the URLs mentioned in the Resources section).

Channel Definition Format (CDF)
The first of the commercial XML dialects, CDF represents Microsoft's effort to define a better mechanism to name and specify Web-based channels for data and text delivery. Since the delivery of Internet Explorer 4.0, which includes built-in CDF capabilities, CDF has been picked up by many content providers and software vendors, and CDF channels have been popping up like mushrooms after the rain. This technology appears both workable and well-accepted, and bodes well for XML in general.

Chemical Markup Language (CML)
Peter Murray-Rust created CML, the first XML dialect, to illustrate what XML can do for the chemical community as much as anything else. For most purposes, CML remains a tour-de-force illustration of XML's possibilities, and a vehicle that permits Murray-Rust to demonstrate his exceptional XML browser software.

CML is a tightly-focused form of XML notation that permits chemists to manipulate and model atoms and molecules directly, in the form of well-defined and articulated data elements; CML also supports all kinds of standard document elements found in scholarly papers, such as footnotes, citations, mathematical and chemical formulae, glossary terms, and so on. The specification, while less than ten pages long, can capture just about any kind of chemical data or formula.

Resource Description Framework (RDF)
This effort represents the W3C's attempt to bring three separate but related XML efforts together under one notation: the W3C's own work on defining graphics data using the PICS dialect, Netscape Meta Content Framework (MCF), which leveraged Apple's pioneering work to define a general-purpose metadata language, and Microsoft's XML-Data, which represents a similar but incompatible effort. The W3C is seeking to perpetuate the best of each of these attempts under the general heading of RDF, but the jury's still out on how well this work is going. Netscape has pledged to support RDF in Communicator 5.0, which is expected before the end of this year.

Extending your sense of XML style (XSL)
The Extensible Style Language represents a nascent attempt to create a more dynamic and powerful notation for defining document style, and to augment the capabilities of the Cascading Style Sheets work (CSS1 and CSS2) already in place at the W3C. Objectives here include a model that can dynamically resize itself completely around base font selections (which CSS cannot currently handle) and to provide more powerful, interactive support for document styles and rendering. At present, this work is largely experimental, and most active development uses CSS1 or CSS2 style sheets for production. But just as XML represents a strict subset of SGML, the work on XSL derives in large part from the DSSSL style sheet language developed in the SGML community.

Mathematical Markup Language (MathML)
MathML represents the resuscitation of mathematical markup originally planned for HTML 3.0, but later abandoned owing to complexity of implementation and lack of agreement among working group participants. Current efforts attempt to leverage Donald Knuth's work on TeX, as formulated by the American Mathematical Society (AMSTeX), and is well-accepted in draft form in the mathematical community.

Extensible Linking Language (XLL)
XLL defines an incredibly rich syntax and semantics for hyperlinks, including abilities to traverse various kinds of named links, to instruct a single link to coordinate multiple updates to different regions on-screen (solving the single link limitation problem that users of HTML markup will no doubt recognize), and to invoke various types of special processing when certain links are selected. This promises to expand the capabilities inherent in hypertext well above and beyond current HTML implementations.

Synchronized Multimedia Interface Definition Language (SMIL)
A brand-new effort, this dialect of XML is intended to help content developers better synchronize audio and graphics for Web delivery, without having to resort to specialized and expensive tools like Macromedia Director. It's still too early to tell if this effort will catch fire or not, but it certainly appears to be attracting interest in the development community.

Open Financial Exchange (OFX)
A collaboration between Microsoft, Quicken, and Checkfree, this markup language works behind the scenes in applications like Microsoft Money and Quicken. Although the current version is SGML based, efforts are underway to convert this to XML, making it more broadly accessible to modern Web browsers (or where most such browsers should be by the end of 1998).

OpenTag is an effort launched by the International Language Engineering Corporation (ILE), known for its work in automated translation software and natural language processing. This effort is intended to permit a single XML document to deliver content in different languages, which could then be filtered by settings in a user's browser to display the language (or languages) of interest for specific documents. This is of great interest outside the English-speaking world.

Ruminations on the future of XML
The range and power of XML permits it to be applied to many different problem areas. Its openness and extensibility appear to make XML a natural choice when developing complex content, working with multiple data sources, or when special presentation or data-handling must be part and parcel of the data delivery process.

XML will not necessarily replace HTML as the primary markup language for the Web, but it certainly offers to extend current capabilities well beyond their present limitations. While the learning curve will be steep, especially for those who seek to define their own complex markup, the rewards should be great, especially since all the major browser vendors either already support XML or promise to do so before the end of 1998.

Ultimately, XML will succeed or fail based on what it can do for content providers, and on how well it extends what content consumers (the users) can see and do. Because XML means better control over the information that gets to their desktops, and promises a single consistent interface to lots of different kinds of distributed functionality, we expect XML to be a big win for most users. Likewise, because XML delivers improved flexibility, reduces the maintenance requirements imposed by today's hodgepodge of intermixed data and code, and because it provides methods to extend markup as needed, we also believe XML can bring big wins to content developers. Our beliefs notwithstanding, it will be interesting to see what role XML will play in the ever-expanding future of the Web.


General XML Resources XML Dialects XML Software SGML Information Books and Journals

About the author
Ed Tittel is a principal at LANWrights, Inc. an Austin, TX-based consultancy. He is the author of numerous magazine articles and more than 25 computer-related books, most notably "HTML for Dummies," 2nd Ed., "The 60-Minute Guide to Java," 2nd Ed., and "Web Programming Secrets". Check out the LANWrights site at http://www.lanw.com. Reach Ed at ed.tittel@sunworld.com.

All general solutions are quite complex

The Standard Generalized Markup Language (SGML) originated at IBM in the late 1960s when the company realized that it needed some method to simplify moving documents across different hardware platforms and operating systems. The company's initial efforts were called GML, which stands for General Markup Language, and were originally developed solely for internal use. In fact, GML is one of the first instances of what's come to be known as "publish-once, use many times" technology. This approach has become incredibly popular since IBM's first efforts.

GML's originators included Charles Goldfarb, Ed Mosher, and Ray Lorie (some wags have identified their initials as the real meaning of "GML"). By the 1970s, these farsighted gentlemen realized that a more general version of their markup language would make it possible to move documents from any one system to any other, even if the famous three-letter acronym might be missing on one (or both) such systems. This effort led directly to the definition and proliferation of SGML in the 1980s, and to the development and publication of the ISO 8879 standard that today governs SGML.

SGML is incredibly powerful, and can represent just about any conceivable kind of document, but it's also quite complex. It permits document specifications (DTDs) of almost any kind to be defined. Once defined, DTDs may be referenced within individual documents to create instances of the type that contain readable content.

Certain commercial and government organizations, including the Department of Defense and the IEEE (the Institute of Electrical and Electronics Engineers), have adopted SGML as a standard format for submission of certain kinds of documents, either for consideration as standards themselves, or simply to provide a platform-neutral way of delivery documentation on products or systems purchased by such entities. The Department of Defense requires its contractors to submit all documentation using SGML, and SGML has proven itself to be useful for documenting chip layouts, aircraft designs, and other complex collections of information in other industries.

A sizable industry has grown up around SGML, but the costs of entry are high. They include complex and expensive software and systems, and the need for a highly trained staff to work within this difficult and sometimes harrowing environment. In many ways, XML represents the first real mainstream implementation of SGML and its success could mean that the rewards that have been slow in coming to early SGML pioneers might finally be forthcoming in this new application of the technology. For more information about SGML, please visit the Resources listed under the SGML heading in the Online Resources sidebar to the main XML story.