Extensible Markup Language (XML)

C. M. Sperberg-McQueen

University of Illinois at Chicago
u35395@uicvm.uic.edu

Tim Bray

Textuality
tbray@textuality.com
Keywords: text encoding, WWW

Extensible Markup Language (XML for short) is being designed under the auspices of the World-Wide-Web Consortium (W3C); the larger goal of this effort is "to enable future Web user agents to receive and process generic SGML in the way that they are now able to receive and process HTML. As in the case of HTML, the implementation of SGML on the Web will require attention not just to structure and content (the domain of SGML per se) but also to link semantics and display semantics." (See http://www.w3.org/pub/WWW/MarkUp/SGML/Activity for the W3C's description of this activity.) As a subgoal, we are creating an SGML application profile, XML, that is designed to provide many of the benefits of SGML in a lightweight, easy-to-use, easy-to-implement dialect that omits many of the difficult or problematic features of the full standard. This paper is a report on the XML specification; if time allows, some information will also be provided on the progress of the work toward a typology of links and link behaviors. At the time this abstract is prepared, the XML specification has been made public, but is still officially a working draft.

Motivation

The Standard Generalized Markup Language (SGML) is the most fully developed specification of the use of descriptive markup languages for electronic documents. The idea of descriptive markup is simple and powerful, and in fact has proved to be a basic requirement for many advanced information processing applications.

Unfortunately, the adoption of SGML has proved surprisingly difficult, expensive and slow, given that the underlying ideas are simple and self-evidently good. In particular, there is very little use of SGML on the World-Wide Web, which is the world's most popular electronic information delivery mechanism. Some of the perceived reasons have included:

The SGML standard itself is large, complex, and difficult to understand.
The standard specifies several optional and advanced markup features, some of which are almost never implemented.
Some of the features of SGML have proven counter-productive in practical use.
Practical use of SGML requires learning several other languages, including the language used to write DTDs, various stylesheeting and formatting languages, and the SGML/Open Entity Catalog language.
The design of SGML takes little account of the contemporary theory of formal languages and finite automata. One practical result is that SGML parsers are unable to make use of some advanced tools and techniques made possible by that theory. Consequently, they are large and complex pieces of computer software; as such they (a) suffer from reliability problems, (b) have in practice proven difficult to integrate into applications, and (c) change slowly in response to advances in software and document processing technology.

Nonetheless, there remains a consensus that SGML's basic design partition into entities, elements, and attributes is correct and useful. One result is a common tendency, in strategic projects involving SGML, to avoid using many advanced features and operate within the bounds of a highly restricted subset. This approach has generally met with success. However, this restricted subset has been re-invented by each successive group that has attacked the problem.

The SGML standard itself identifies two subsets of its features, intended to simplify implementation: Minimal SGML (defined in ISO 8879, clause 15.1.2) and Basic SGML (ISO 8879, 15.1.1). These have not had any practical significance, however, both because the choice of SGML features they include is not a happy one and because they have no free-standing definition, which means they cannot be implemented by anyone who has not first studied and understood the full text of ISO 8879.

There has been informal discussion for years on the subject of a further-simplified version of the standard. In recent times, there have been a substantial number of formal proposals for such a simplification. They include:

A lexical analyzer for HTML by Dan Connolly of W3C, as presented in "A Lexical Analyzer for HTML and Basic SGML: W3C Working Draft 15-Jun-96" (http://www.w3.org/pub/WWW/TR/WD-sgml-lex); this is a slight simplification of Basic SGML, with no entity handling.
the Minimal Generalized Markup Language (MGML) defined by Tim Bray in "MGML -- an SGML Application for Describing Document Markup Languages", unpublished draft paper for SGML '96 ( http://www.textuality.com/mgml/index.html). MGML is unique among the contributions in that it proposes using instance syntax for markup declarations.
Normalised SGML (NSGML), an invention of Henry Thompson, David McKelvie, and Steve Finch, presented in "The Normalised SGML Library (NSL)", NSL Version 1.4.4, Documentation version Fri Aug 2 14:13:40 BST 1996 ( http://www.ltg.ed.ac.uk/corpora/nsldoc/nsldoc.html). NSGML includes not just a language definition but a suite of software modules for parsing and handling documents in an efficient pipelined fashion.
Poor-Folks SGML (PSGML), defined by Michael Sperberg-McQueen in "PSGML: Poor-Folks SGML: A Subset of SGML for Use in Distributed Applications", Document UIC CC DB92-10, 8 October 1992 ( http://www.uic.edu/~cmsmcq/uic/db92-10.tei or http://www.uic.edu/~cmsmcq/uic/db92-10.html).
SGML-Lite, by Bert Bos, presented in "`SGML-Lite' -- an easy to parse subset of SGML", 4 July 1995 ( (http://grid.let.rug.nl/~bert/Stylesheets/SGML-Lite.html).
SGML Online, by Eliot Kimber, described in "SGML Revision, Proposal for Minimal SGML Feature Set" (unfinished, unpublished draft, distributed privately via email on 1996-06-03, now accessible at http://www.textuality.com/sgml-erb/kimber/index.html).
the TEI Interchange Format defined in Association for Computers and the Humanities (ACH), Association for Computational Linguistics (ACL), and Association for Literary and Linguistic Computing (ALLC), Guidelines for Electronic Text Encoding and Interchange (TEI P3), edited by C. M. Sperberg-McQueen and Lou Burnard (Chicago, Oxford: Text Encoding Initiative, 1994). The portions of the TEI Guidelines relevant to the interchange format have been extracted and are available separately at (http://www-tei.uic.edu/orgs/tei/ml/tif.html).

These simplified application profiles of SGML all take advantage of the fact that SGML exhibits an extreme case of the `80-20 syndrome'; that is to say, 80% of the benefit is gained by applying only 20% of the machinery. The W3C SGML Activity has formalized the definition of a useful subset in the form of the Extensible Markup Language, or XML.

Structure, Membership, and Mechanisms

The current work was initiated by Jon Bosak of Sun Microsystems, who, in co-operation with the Tim Berners-Lee and Dan Connolly of the World-Wide Web Consortium, initiated the formation of the Consortium's SGML Editorial Review Board and Working Group, who labor under the unwieldy acronyms W3C SGML ERB and W3C SGML WG. The mandate for this effort may be found at http://www.w3.org/pub/WWW/MarkUp/SGML/Activity; it includes SGML simplification and work on hyperlink semantics and display processing (presumably via a DSSSL profile). This paper describes the SGML simplification work.

The work is co-ordinated by the Editorial Review Board. Its members are: Jon Bosak (Sun, Chair), Tim Bray (Textuality, XML Co-Editor), James Clark, (Independent, Technical Lead), Steve DeRose (EBT), Dave Hollander (HP), Eliot Kimber (Passage), Tom Magliery (NCSA), Eve Maler (ArborText), Jean Paoli (Microsoft), Peter Sharpe (SoftQuad), and Michael Sperberg-McQueen (University of Illinois at Chicago, XML Co-Editor); Dan Connolly serves as liaison with W3C. The main functions of the ERB are to steer the design and discussion activities, and to resolve issues by voting. There is a well-defined voting procedure designed to maximize the chances of reaching consensus and to exercise majority rule rapidly when consensus is not possible.

The main work is done in the Working Group; this has over 60 members, including those of the ERB. The Working Group provides technical input, design proposals, and design critiques. It includes many people who have published significant papers on SGML or played a visible role in the design, evolution, and implementation of SGML; in particular Charles Goldfarb and James Mason from WG8. As a result of this overlap, it is likely that XML will avoid taking any directions fundamentally incompatible with the future development of SGML; in fact, the debate on XML is apt to have some influence on the next SGML revision.

Design Goals

Prior to the commencement of discussion in the WG, the ERB developed a `strawman' set of design goals to guide this discussion. While these remain open for challenge and revision, they have been fairly stable and thus presumably represent a reasonably large-scale consensus among those involved in this work. The design goals are:

XML shall be straightforwardly usable over the Internet. This does not mean users can feed it to, for example, the Netscape of today, but that the design will have regard at all times to the needs of distributed applications working on large-scale networks.
XML shall support a wide variety of applications. No design elements shall be adopted which would impair the usability of XML documents in other contexts such as print or CD-ROM, nor in applications other than network browsing, such as validating editors, batch validators, simple filters which understand XML document structure, normalizers, formatting engines, translators to render XML documents into other lanuages, and specialized browsers for specialized markup.
XML shall be compatible with SGML. I.e. (1) Existing SGML tools will be able to read and write XML data. (2) XML document instances are SGML document instances as they are, without changes. (3) For any XML document, a DTD can be generated such that SGML will produce the same parse as would an XML processor. (4) XML should have essentially the same expressive power as SGML. Note: (1) and (2) describe our goal in its ideal form. If this goal is not achievable in its fullest form, then we may back out to a weaker form: it shall be simple to transform XML documents into equivalent SGML documents, and vice versa. Our intention, however, is to bite the bullet and ensure if we can that no transformation is needed to allow SGML tools to read and write XML document instances. (3) and (4) indicate our intentions accurately, but it is not yet clear how best to formalize and explain the phrase "the same parse", or the phrase "essentially the same expressive power". These remain open questions; see point 8 also.
It shall be easy to write programs which process XML documents. In particular, it shall be straightforwardly possible to construct useful XML applications which do not read, or need to read, the DTD of the XML document. Note: For this purpose, easy means that the holder of a bachelor's degree in computer science should be able to construct basic processing (parsing, if not validating) machinery in less than a week, and that the major difficulty in the application should be the application-specific functions; XML should not add to the inherent difficulties of writing such applications.
The number of optional features in XML is to be kept to the absolute minimum, ideally zero. As a result of this, any XML document has a high probability of being handled successfully by any XML processor.
XML documents should be human-legible and reasonably clear.
The XML design should be prepared quickly. A first draft of the XML design should be ready for distribution and comment by end of 1996; a version should be ready for production use by the end of March 1997.
The design of XML shall be formal and concise. XML should be simple and easy for implementors to grasp; its reference documentation should not exceed 20 pages, which should contain mostly formal grammar and very little normative text, if any. Note: normative text is not the same as descriptive or explanatory text. XML shall specify clearly what characteristics of the input must be represented in the parse tree of an XML document, and what characteristics need not be captured by XML processors. This means the property sets `significant' in an XML application will be defined both formally and informally. Which properties are significant and which are insignificant remains an open question.
XML documents shall be easy to create. It should be a straightforward task (though possibly labor-intensive) to create valid XML documents by hand (i.e. without a validating authoring tool). It should be a straightforward task (though possibly labor-intensive) to create a validating XML authoring system.
Terseness is of minimal importance. Minimizing keystrokes is not deemed important in achieving any of the above goals, but other things being equal a concise notation should be preferred to a verbose.

A Snapshot of XML

At the time this paper is submitted, an initial public draft of the XML specification has been distributed, but like all working drafts it is subject to change. The broad outlines of XML, however, are clear enough to be summarized here.

XML omits a large number of SGML features often left unused in practice: DATATAG, OMITTAG, RANK, LINK, CONCUR, SHORTREF, SUBDOC, and FORMAL are all dropped. SHORTTAG, which defines several ways in which SGML documents may abbreviate their tags, is entirely disallowed except that attributes need not be specified if a default is specified for them when they are declared.

Most of these features are rarely used in any case; the most visible change is the absolute abandonment of SGML techniques of markup minimization. In XML documents, all tags are always present in full (except that attributes may be omitted if they have their default values). This will make no difference to those who use SGML or XML editors; others may choose to write their documents using standard SGML tag omission and then run the document through a normalizer like James Clark's spam.

In order to ensure that XML processors can, under certain circumstances, skip the document's DTD and still process the document correctly, empty elements (like the TEI's PTR element or like the HTML BR element) are required to be self-identifying: instead of the form <e>, they must take the form <e/>. This simple innovation radically reduces the complexity of parsing XML documents.

Comments and processing instructions are retained; XML uses a number of specialized processing instructions of its own as declarations. Comments are simplified, however, to try to minimize user errors.

In order to ensure the widest possible use, XML requires conforming processors to support usage of the characters from ISO 10646 (Unicode) in both markup and data. For the convenience of those still working without Unicode editors (currently the majority of users), processors are encouraged to accept other character-set encodings as well.

XML also restricts, in some ways, the normal SGML syntax for declaring elements and attributes. In particular, the AND connector is dropped, inclusion and exclusion exceptions are dropped, and the set of data types for attributes is simplified and rationalized (within the limits set by the design goal of compatibility with SGML).

Conditional marked sections are allowed in the DTD, but not in the document instance. In DTDs, conditional sections allow easy customization of the DTD; they appear unnecessary in document instances, since most practitioners agree that variant text is better handled with specialized elements and style-sheets. CDATA marked sections, in which markup characters need not be escaped, are allowed only in the document itself, and only in a restricted form.

In the interests of simplicity, XML abandons SGML's notion of abstract syntax and defines only a single concrete syntax, modeled on SGML's reference concrete syntax but extended to handle polyglot documents and large documents better. In XML, all tags will be enclosed in < and >, all entity references between & and &refc;, and all attribute values quoted. Unlike SGML, XML provides no mechanism for changing the default delimiters.

Future Plans

At the time this paper is submitted, it is the intention of the ERB to revise the draft XML spec and to turn, in early 1997, to the topic of hyperlink typology. In late 1997, the third phase of the project will see the specification of a subset of the Document Style Semantics and Specification Language (DSSSL) intended for use in network browsers (DSSSL-Online). The XML specification may change as a result of work in the two later phases; when it appears stable, steps will be taken to move it through the normal W3C processes to make it a technical report, then a proposed recommendation, and finally a specification of recommended practice.