[Archive copy mirrored from the URL: http://www.qucis.queensu.ca/achallc97/papers/p050.html; see this canonical version of the document.]
Paper
Keywords: text encoding, WWW
Extensible Markup Language (XML for short) is being designed under the auspices of the World-Wide-Web Consortium (W3C); the larger goal of this effort is "to enable future Web user agents to receive and process generic SGML in the way that they are now able to receive and process HTML. As in the case of HTML, the implementation of SGML on the Web will require attention not just to structure and content (the domain of SGML per se) but also to link semantics and display semantics." (See http://www.w3.org/pub/WWW/MarkUp/SGML/Activity for the W3C's description of this activity.) As a subgoal, we are creating an SGML application profile, XML, that is designed to provide many of the benefits of SGML in a lightweight, easy-to-use, easy-to-implement dialect that omits many of the difficult or problematic features of the full standard. This paper is a report on the XML specification; if time allows, some information will also be provided on the progress of the work toward a typology of links and link behaviors. At the time this abstract is prepared, the XML specification has been made public, but is still officially a working draft.
The Standard Generalized Markup Language (SGML) is the most fully developed specification of the use of descriptive markup languages for electronic documents. The idea of descriptive markup is simple and powerful, and in fact has proved to be a basic requirement for many advanced information processing applications.
Unfortunately, the adoption of SGML has proved surprisingly difficult, expensive and slow, given that the underlying ideas are simple and self-evidently good. In particular, there is very little use of SGML on the World-Wide Web, which is the world's most popular electronic information delivery mechanism. Some of the perceived reasons have included:
Nonetheless, there remains a consensus that SGML's basic design partition into entities, elements, and attributes is correct and useful. One result is a common tendency, in strategic projects involving SGML, to avoid using many advanced features and operate within the bounds of a highly restricted subset. This approach has generally met with success. However, this restricted subset has been re-invented by each successive group that has attacked the problem.
The SGML standard itself identifies two subsets of its features, intended to simplify implementation: Minimal SGML (defined in ISO 8879, clause 15.1.2) and Basic SGML (ISO 8879, 15.1.1). These have not had any practical significance, however, both because the choice of SGML features they include is not a happy one and because they have no free-standing definition, which means they cannot be implemented by anyone who has not first studied and understood the full text of ISO 8879.
There has been informal discussion for years on the subject of a further-simplified version of the standard. In recent times, there have been a substantial number of formal proposals for such a simplification. They include:
These simplified application profiles of SGML all take advantage of the fact that SGML exhibits an extreme case of the `80-20 syndrome'; that is to say, 80% of the benefit is gained by applying only 20% of the machinery. The W3C SGML Activity has formalized the definition of a useful subset in the form of the Extensible Markup Language, or XML.
The current work was initiated by Jon Bosak of Sun Microsystems, who, in co-operation with the Tim Berners-Lee and Dan Connolly of the World-Wide Web Consortium, initiated the formation of the Consortium's SGML Editorial Review Board and Working Group, who labor under the unwieldy acronyms W3C SGML ERB and W3C SGML WG. The mandate for this effort may be found at http://www.w3.org/pub/WWW/MarkUp/SGML/Activity; it includes SGML simplification and work on hyperlink semantics and display processing (presumably via a DSSSL profile). This paper describes the SGML simplification work.
The work is co-ordinated by the Editorial Review Board. Its members are: Jon Bosak (Sun, Chair), Tim Bray (Textuality, XML Co-Editor), James Clark, (Independent, Technical Lead), Steve DeRose (EBT), Dave Hollander (HP), Eliot Kimber (Passage), Tom Magliery (NCSA), Eve Maler (ArborText), Jean Paoli (Microsoft), Peter Sharpe (SoftQuad), and Michael Sperberg-McQueen (University of Illinois at Chicago, XML Co-Editor); Dan Connolly serves as liaison with W3C. The main functions of the ERB are to steer the design and discussion activities, and to resolve issues by voting. There is a well-defined voting procedure designed to maximize the chances of reaching consensus and to exercise majority rule rapidly when consensus is not possible.
The main work is done in the Working Group; this has over 60 members, including those of the ERB. The Working Group provides technical input, design proposals, and design critiques. It includes many people who have published significant papers on SGML or played a visible role in the design, evolution, and implementation of SGML; in particular Charles Goldfarb and James Mason from WG8. As a result of this overlap, it is likely that XML will avoid taking any directions fundamentally incompatible with the future development of SGML; in fact, the debate on XML is apt to have some influence on the next SGML revision.
Prior to the commencement of discussion in the WG, the ERB developed a `strawman' set of design goals to guide this discussion. While these remain open for challenge and revision, they have been fairly stable and thus presumably represent a reasonably large-scale consensus among those involved in this work. The design goals are:
At the time this paper is submitted, an initial public draft of the XML specification has been distributed, but like all working drafts it is subject to change. The broad outlines of XML, however, are clear enough to be summarized here.
XML omits a large number of SGML features often left unused in practice: DATATAG, OMITTAG, RANK, LINK, CONCUR, SHORTREF, SUBDOC, and FORMAL are all dropped. SHORTTAG, which defines several ways in which SGML documents may abbreviate their tags, is entirely disallowed except that attributes need not be specified if a default is specified for them when they are declared.
Most of these features are rarely used in any case; the most visible change is the absolute abandonment of SGML techniques of markup minimization. In XML documents, all tags are always present in full (except that attributes may be omitted if they have their default values). This will make no difference to those who use SGML or XML editors; others may choose to write their documents using standard SGML tag omission and then run the document through a normalizer like James Clark's spam.
In order to ensure that XML processors can, under certain
circumstances, skip the document's DTD and still process the document
correctly, empty elements (like the TEI's PTR element or like the HTML
BR element) are required to be self-identifying: instead of the form
<e>
, they must take the form <e/>
. This
simple innovation radically reduces the complexity of parsing XML
documents.
Comments and processing instructions are retained; XML uses a number of specialized processing instructions of its own as declarations. Comments are simplified, however, to try to minimize user errors.
In order to ensure the widest possible use, XML requires conforming processors to support usage of the characters from ISO 10646 (Unicode) in both markup and data. For the convenience of those still working without Unicode editors (currently the majority of users), processors are encouraged to accept other character-set encodings as well.
XML also restricts, in some ways, the normal SGML syntax for declaring elements and attributes. In particular, the AND connector is dropped, inclusion and exclusion exceptions are dropped, and the set of data types for attributes is simplified and rationalized (within the limits set by the design goal of compatibility with SGML).
Conditional marked sections are allowed in the DTD, but not in the document instance. In DTDs, conditional sections allow easy customization of the DTD; they appear unnecessary in document instances, since most practitioners agree that variant text is better handled with specialized elements and style-sheets. CDATA marked sections, in which markup characters need not be escaped, are allowed only in the document itself, and only in a restricted form.
In the interests of simplicity, XML abandons SGML's notion of
abstract syntax and defines only a single concrete syntax,
modeled on SGML's reference concrete syntax but extended to
handle polyglot documents and large documents better. In XML, all tags
will be enclosed in <
and >
, all entity
references between &
and &refc;
, and all
attribute values quoted. Unlike SGML, XML provides no mechanism for
changing the default delimiters.
At the time this paper is submitted, it is the intention of the ERB to revise the draft XML spec and to turn, in early 1997, to the topic of hyperlink typology. In late 1997, the third phase of the project will see the specification of a subset of the Document Style Semantics and Specification Language (DSSSL) intended for use in network browsers (DSSSL-Online). The XML specification may change as a result of work in the two later phases; when it appears stable, steps will be taken to move it through the normal W3C processes to make it a technical report, then a proposed recommendation, and finally a specification of recommended practice.