[This local archive copy mirrored from the canonical site: http://www.solero.force9.co.uk/; see the official source if possible.]


Adrian Orlowski

[Author's note, 7 December 1997: The following article was first published in EXE magazine in June 1997 based on the first published draft of XML (Nov. 96) and has not been amended in the light of subsequent developments.]

User-definable markup for Web documents is on its way. In November 1996 the World Wide Web Consortium announced Extensible Markup Language as a successor to HTML. Already there are some big-name users.

Last year many software houses must have breathed sighs of relief that they finally got their products Web-enabled. Word processors sprouted Publish as HTML functions, Lotus Notes grew the ability to transform its text databases into Web format on the fly, and Adobe announced that henceforth Level 3 PostScript rasterizers would print HTML documents directly. What else could a user wish for?

Yet, like the arrival of Comet Hale-Bopp, what could prove to be the most significant Web event of 1996 crept into view unheralded and largely unobserved outside of a small band of initiates into the arcana of text processing. November's publication in draft form of the specification for XML (Extensible Markup Language) is a harbinger of change. Far from celebrating HTML's glories, it more evidently signifies a wish for the end of Web text as the world and its dog have come to know it.

XML might even inspire similarly post-millenniarist visions of a new Eden across the globe for computer text. But as the film title had it, heaven can wait, as can the visions of seers. What does XML offer, and what is that likely to cost?

Talking of the globe, XML finally mandates ISO 10646 (aka Unicode) as the underlying document character set. You can refer easily to every glyph and ideogram in use today for text by a decimal or hex bit-string. Normally though XML documents will have to specify that they use a particular encoding. All XML document processors have to be able to handle UFT-8 and UCS-2 encoding, and English language documents can default to UTF-8, but use of others such as Shift-JIS has to be announced in the document. A few HTML documents will grow in size when they become XML documents, but undoubtedly this part of the specification is a vote winner.

The way XML democratizes markup for publishers is likely to vex applications developers rather more. XML introduces the concept of user-defined markup. W3C it seems has had enough of trying to sanction lists of tags names for Web browser authors to aim for. It has been inspired, as it was the first time around with HTML, yet again by SGML although in quite a different way from HTML. When it conforms (i.e. when it has a DTD) HTML at best is an application of (the rules of) SGML. XML, on the other hand, is near enough to SGML that it can correctly be said to be a subset of SGML. It has already garnered the moniker SGML-Lite.

XML also shares with SGML the general document processing model. The possibility that documents contain user-defined tag names effectively undermines the practice of Web browsers today of hard-wiring pre-defined processing semantics within software. What, then, is an XML document processor to do with tagged text? Like SGML, XML is purposely deaf to the question, but that question mark is not of course an EoF on the matter (merely entity-end, to adopt SGML-speak).

Two additional proposals will issue from W3C. A draft proposal for Web (and other) hyper-linking, called Extensible Hyper Linkage (XHL), has already been published. Although somewhat drafty at present, it looks like it will include and extend the established <A>Click-me-I'm-coloured</A> HTML link semantics. And it promises to get the intellectual juices of red-blooded developers going with novelties such as multiple sources and/or multiple targets from one link, semantic content addressing, and hyper-link specifications which are extraneous to the documents they link, e.g. to link in read-only databases.

Then, towards the end of this year, W3C will publish the third leg of its strategy, and we will get to see how they envisage the new XML documents will be rendered by Web browsers of the future. Having given vent to the SGML tendency, it's likely W3C will espouse a solution on SGML lines. In the meantime you might get some clues by pointing your current Web browser at Alta Vista and searching for DSSSL, or more specifically a simpler variant called DSSSL-Online.

If W3Cs overall strategy for XML is a three-legged pitch, those legs are by no means joined at the hip. There is no requirement for an XML document processor to support Extensible Hyper Linkage if the specific need is not there. (XHL, its worth pointing out, is proffered up as much to the SGML community as to Web people). Likewise XML document rendering looks as if it could be performed satisfactorily by a browser implementing Cascading Style Sheets (CSS1), independently of what W3C come up with in due course.

In fact, with just one leg of the XML stool more or less in place, Microsoft and Sun have already demonstrated it can be sat upon. In March Microsoft announced it had submitted to W3C a proposal for channel definition format (CDF) for broadcasting push content over the Web. More pointedly, they also announced Internet Explorer 4.0 would implement it. The CDF specification is for it to use XML, and an XML document type definition (DTD) is included. Hot on Microsoft's heels, and perhaps more impressively, Sun's John Bosak announced in comp.text.sgml the availability of the Solaris 2.5.1 manuals in XML form.

Although XML is hoped to shape Web browsers of the future, what seems to be exciting the cognoscenti is custom XML document processors. Publishers with specific types of binary as well as textual data can create software which uses an open document format but hard-codes the rendering semantics (for example, for commercial or proprietary reasons). That, of course, is where Java enters the picture. But while there are no particular issues about mixing Java in with XML documents, the availability of Java byte-code decompilers could be another matter. But the real importance of XML to its progenitors is not for better eye-catching graphics in the service of marketing, but for textual information interchange, with strong implications for client-side manipulation of the down-loaded information.

Unlike HTML, which has never found much purpose in life except as a presentation format for documents, XML documents are expected to be re-processible. In this they mimic SGML documents. But SGML, as probably all experienced practitioners would admit, is simply too darned hard for the mass market. Not the syntax, which can be hidden behind 'smart' SGML tools, but developing applications. There is an uncomfortably intimate connection between SGML the meta-language for designing specific document markup languages, and SGML the environment for parsing documents marked up in those languages. It means that a parser for even one specific type of SGML document (DTD) will end up knowing more SGML than should be necessary.

XML has been expressly shaped to make it easy to write parsers for the tag sets that users are expected to come up with. One estimate (from someone who has written a full scale SGML parser) of the time required is a couple of days. And that begs the question whether, affirmations in the proposal side (methinks it doth protest too much), one is actually dealing with realistic SGML when engaged on XML. The XML draft does indeed list what has been dropped from SGML in its metamorphosis into XML (it is incomplete). On the other hand it's likely to make more sense to EXE readers to compare the broad syntactic differences between an HTML document and its XML equivalent.

One major difference from HTML is that with one exception XML documents have to have tags as matching start/end pairs. Eliding the end-tags such as </P> from paragraph elements as you can in HTML is not permitted. However, where an element genuinely has no content, this can be indicated either by including an end-tag, or by signaling the fact on the (start) tag (or by using a DTD and forcing a processor to read it and remember elements declared as EMPTY).This means for example that HTML's horizontal rule element will usually end up being written in XML either as:


or as:


A second significant difference from HTML is that XML documents can designate full blown subsidiary units of text or binary data as entities for in-line inclusion during processing. In HTML entities exist only as an alternative encoding of specific characters (such as the ampersand or less-than character). But in XML entities can be physically distinct text or binary object. Binary entities are syntactically restricted to being attribute values for tags, and they required to bear a user-defined typing (called a notation>) so a program can invoke an appropriate subsidiary processor for them. On the other hand, text entities can include markup (in a different encoding if desired), and to keep things crystal clear for parsers, the <, > and & characters are allowed in document content only as XML entities, e.g. as &amp;.

However, the most fascinating aspect of having to deal with tags that you've never seen before is the prospect of comprehensively structured documents which don't allude to any prior formal specification of their structure. Like SGML, XML has document type definitions for pre-declaring the allowed arrangement and combination of tags in the logical structure. But whereas an SGML parser uses DTDs to infer tags where none exist in documents (it expedites entry of markup by hand), XML stands this idea on its head. If you don't need to be inferring tags from the context (because the tagging is de facto explicit and comprehensive), then you don't necessarily need a DTD around. Ergo your Web browser or full text indexing engine does not need to be built to understand DTDs. XML calls these documents 'well-formed' (subject to certain fairly easy to satisfy constraints).

So what's to stop you ignoring DTDs altogether and pushing about XML documents without them? For one thing, the document must announce whether it requires a processor to be cognizant of an accompanying DTD or not, and it is a reportable error if the processor discovers some monkey business going on. In other words, the processing results could well be (in the immortal epithet) undefined. For some applications, there will be XML DTDs, and some processors will have to parse them in detail. But the outlook is not as bleak as it might appear; as in the SGML sphere, it's likely that XML DTDs will emerge that will gather widespread support. And the declaration of subelement content in XML DTDs has been simplified compared to SGML--again, with the application developer in mind.

You might want to judge the situation for yourself. Fig 1 shows a trivial (and probably unrepresentative) HTML document, and Fig 2 my attempt at an XML version of this document. Picking off the tags visually should make it clear it is a different kettle of markup. For comparison, regular readers might also want to look up my February EXE article on SGML, where the original SGML version of the document is shown making use of SGML markup minimization. (The SGML DTD shown there will not unfortunately pass muster as an XML DTD for Fig 2 as it stands).

Fig. 1

 <title>Demo-ltr as HTML document </title>
 <body text="#000000" bgcolor="#c0c0c0">

 <p>7 February 1996

 <p>Dear Peter,

 <p>Thank you for your kind reply to my query about
Armaggedon <sup>TM </sup>. Here is my suggestion
for an enhancement you might consider for a future drop:

 <p> <img src="suggest.jpg" width="362" align="bottom" >
 <p> <a name="Flow1"> <font size="-2">
Suggested enhancement to Armaggedon </a> </font>
 <!-- Notice the interleaved tags just here -->

 <p>All the best for the forthcoming full product release.

 <p align="center">Yours sincerely,

 <p align="center"> <strong>Adrian </strong>

 <dt>Encl: </dt> <dd> <a href="#Flow1">Suggested enhancement to Armaggedon
</a> </dd>
 </body> </html>

Demo-ltr as HTML document

7 February 1996

Dear Peter,

Thank you for your kind reply to my query about ArmaggedonTM. Here is my suggestion for an enhancement you might consider for a future drop:

Suggested enhancement to Armaggedon

All the best for the forthcoming full product release.

Yours sincerely,


Suggested enhancement to Armaggedon

Fig. 2

 <?XML VERSION="1.0" RMD="INTERNAL" encoding="UTF-8" ?>
 <!DOCTYPE fake SYSTEM "c:\nul\nul.dtd" [
 <!NOTATION jpg "JPG" "JPGsInAFlash.exe" >
 <!ENTITY trade   " <sup>TM </sup>" >
 <!ENTITY suggest SYSTEM "c:\letters\suggest.jpg" NDATA jpg >
 <!--* the DTD alluded to here does not exist and is not required to,
because of the RMD attribute value in the preceding processing
instruction. BTW, this is an XML comment. Notice its delimiters *-->

 <demo-ltr from="Adrian">
 <date>7 February 1996 </date>

 <dear>Peter </dear>
 <para>Thank you for your kind reply to my query about
Armaggedon™. Here is my suggestion for an
enhancement you might consider for a future drop: </para>

 <picture id='Flow1'> <graphic name="suggest"/>
 <title>Suggested enhancement to Armageddon. </title> </picture>

 <para>All the best for the forthcoming full product release. </para>
 <Yours>sincerely </Yours>

 <Encs> <item> <crossref refs="Flow1"/> </item> </Encs>

The first line in Fig 2 is crucial in announcing the nature of the beast which follows. It is a (XML) processing instruction which signals an XML document and gives version and encoding details. The RMD attribute stands for the Required Markup Declaration, and indicates whether or not DTD processing is necessary. The possible alternatives here of None and All here are self-explanatory; Fig 2 though opts for Internal. What this means is that the XML processor can ignore the external subset of the DTD called fake alleged to be in c:\nul\nul.dtd, and is required to process only the declarations between the [ and ] brackets which follow the DOCTYPE keyword. The three declarations in the internal DTD subset in Fig 2 define meanings for the &trade; entity in the first para element and for the JPG image referred to by the graphic element, and give the JPG entity a notation processor hint. The markup in the document is straightforward to parse by eye, but as an exercise you might see if you can spot the empty elements from their start-tags. (There are two of these).

This example may need correction when the final draft of XML is released. However it seems clear that both HTML and SGML documents will need amending to conform with XML (even its more liberal notion of being 'well-formed' rather than its stronger concept of validity in accordance with an XML DTD). HTML documents will need at the least end-tags adding. SGML DTDs may need more or less re-writing, since XML outlaws some commonly used syntax of SGML (the & connector in content models; element inclusions and exclusions; and parameter entities). I have a sneaking suspicion that many people will be looking long at hard at the compliance criteria for well-formed XML documents.

It's possible to look away from the small print of the XML proposal to the larger picture of real world documents perhaps sceptical of the changes being asked for. Arguably though XML is the best attempt yet to move on from so-called plain text as the lowest common denominator for document interchange. To let loose with the milleniarist tendencies I mentioned earlier, I can imagine an XML document editor which lets you create your document's structure by connecting boxes in a design workspace, and then writes it out as an XML DTD along with your document and the rendering instructions as a CSS1 style sheet. And somewhere in my mind's dark recesses I recall that Microsoft Word is based on an implicit structured outline model of documents; what price Word 9 or 10 coming XML-enabled with a DTD to cover all documents ever produced by versions 1 through 8?


The XML situation is developing rapidly, particularly in the area of tools. The following URLs will help locate up-to-date materials:

http://www.ucc.ie/xml/#FAQ-BROWSER Commonly Asked Questions about the Extensible Markup Language

http://www.w3.org/pub/WWW/TR/WD-xml-961114.html Extensible Markup Language (XML) [W3C Working Draft 14-Nov-96]

http://www.microsoft.com/standards/cdf.htm Channel Definition Format (CDF)

http://www.sil.org/sgml/related.html SGML: Related Standards [many links to material on XML and DSSSL]

(C) 1997 Adrian Orlowski. All rights reserved.