November 11, 1997
XML: A <PRICE> for that <PRODUCT>,
an <ACCESSORY> for that <OUTFIT>
by Simon St. Laurent
XML promises to redefine the basic structure of the Web. HTML has been too limited to tackle large-scale document management and data interchange projects, while SGML has presented a steep learning curve, potentially endless complexity and a bureaucratic mindset. In bringing content-based structures to the Web, XML promises to create more flexible documents and a far more automated, more capable Web.
At the core of this potential is XML's dramatic expansion of the number of available tags. Elements are no longer limited to the HTML set, inviting developers to create their own elements, attributes and entities for use in their documents. Industries that want a common set of tags can use the XML syntax to create standards for document interchange. Search engines and document management systems can finally interpret the key parts of a document instead of shuffling through formatting codes and META tags to try to figure it out.
The return to content-based structures frees corporate data developers from the shackles imposed by the needs of designers, while still addressing designers' needs through cascading style sheets and other powerful emerging style standards, such as XSL. Returning markup language to its original focus on structure rather than formatting means that developers can create tags that convey meaning about the document instead of the appearance of a few sentences, characters, or paragraphs. XML also makes it much easier to apply boilerplate content to documents, and to reuse content from other documents that has been neatly compartmentalized by tags using the new XML-LINK standards.
The XML standard is still a work in progress, and probably will be until the end of this year. Although the standard is written in formal notation that is a challenge to decipher, the meanings behind the notation promise a new set of rules for creating and managing documents.
XML comes in two flavors: well-formed and valid. Well-formed is the easier standard to meet. It just requires that a document have an XML prolog, that all elements be nested cleanly, and that all start tags have matching end tags. "Empty" tags like IMG, which don't normally have closing tags, may end with a "/>" instead of receiving a full end tag. For instance, the HTML :
The XML prolog is the most obvious change from either SGML or HTML:
<?XML VERSION="1.0" RMD="NONE" ENCODING="UTF-8"?>
The VERSION attribute should always be included, to protect documents against changes in the standard. RMD is short for Required Markup Declaration and announces which, if any, document type declarations (DTDs) should be applied to the document. For well-formed documents this will be "NONE." Valid documents may use "INTERNAL" or "ALL." ENCODING tells the parser what kind of character set the document will use. UTF-8, a subset of Unicode, is the default. (XML parsers must support the full 16-bit Unicode standard for international character encodings, however.)
These minimal changes to the world of markup make life much easier for parser developers, who no longer have to support poorly-coded HTML missing half its end tags. Before a document can call itself well-formed XML, it has to meet minimum requirements. This requires some extra effort from those creating documents, but makes it possible for programmers to build much more reliable systems with much less effort.
A document has to pass a much more stringent set of tests to move from well-formed to valid. Valid documents must be accompanied by a document type declaration (DTD) that defines their structure. The DTD may be included as part of the document itself, or it may be stored in a separate document. Most complex DTDs will probably be stored as separate documents. A DTD is basically a list of element, entity and attribute declarations in a simplified SGML declaration style. For instance,
<!ENTITY typo "teh">
<!ELEMENT SAMPLE (INFORMATION+, ADVERTISING*)>
<!ELEMENT INFORMATION (#PCDATA)>
<!ELEMENT ADVERTISING ((#PCDATA | MERCHANT)*)>
<!ELEMENT MERCHANT (#PCDATA)>
PAYLINK CDATA #IMPLIED>
It's definitely not HTML anymore, but it's still not as complex as SGML. The first line defines the "typo" entity. Any time the parser encounters the sequence "&typo;" in markup, it will replace it with "teh." The definition for the DOCUMENT element follows; DOCUMENT elements must contain at least one INFORMATION element and as many ADVERTISING elements as the document author desires — including zero. Because the two element names are separated with a comma, they must appear in sequence. INFORMATION elements must appear before ADVERTISING elements. The contents of INFORMATION elements are defined as #PCDATA — parsed character data. Elements defined as #PCDATA can contain any characters except for &, < and >, all of which are reserved for markup. The ADVERTISING element uses a mixed declaration — it can contain #PCDATA and MERCHANT elements in any combination. The MERCHANT element is just PCDATA, but it has an attribute, PAYLINK. While the contents of the MERCHANT element describe the business to the reader, PAYLINK can lurk invisibly and carry key information for electronic transactions with this merchant.
A valid document written to this (admittedly absurd) DTD might look like:
<?XML VERSION="1.0" RMD="ALL" ENCODING="UTF-8"?>
<!DOCTYPE SAMPLE SYSTEM "sample.dtd">
<INFORMATION>This is just information. It's full of typos,
<INFORMATION>This is more information. Doesn't it
make you gleeful?</INFORMATION>
<ADVERTISING>Advertising makes the world go round.
At least that's what
<MERCHANT PAYLINK="A8FD04">World-Turner Globe
and Advertising</MERCHANT> believes.</ADVERTISING>
Even this extremely simple document has some significant seeds in it. The MERCHANT tag gives processing tools something to search for and use in later processing. INFORMATION and ADVERTISING are separated, placed in their own spaces. More sophisticated DTDs can create tags that break information into finer pieces, with more meaningful layers of elements. Making these DTDs work effectively requires standardization. A DTD can provide standardization within a document, but companies and industries will have to agree to use the same DTDs before this standardization can make it possible to share and search information easily.
Beyond document management: XML on the desktop?
XML development so far has sought to continue the tasks addressed by SGML in a simpler, cleaner fashion. The W3C considers XML part of its Architecture domain, while HTML is maintained in the User Interface Domain. Despite this separation, XML may well be a successor to HTML for many applications, potentially freeing Web developers from the capricious decisions of the browser developers. XML parsers at present are much simpler than browsers, but provide a foundation on which many kinds of data processing, including browsing, can take place.
Despite its lack of formatting information, XML could fit well in a browser environment. XML complements cascading style sheets and dynamic HTML very smoothly. Cascading style sheets (and succeeding style technologies) free tags from having to carry format information; styles do a much more effective and precise job of handling formatting than the tags ever did, and do it in a more structured way. Dynamic HTML can become dynamic XML without much significant change. Dynamic HTML rationalizes the way scripts address elements; XML rationalizes the way we create new elements. The W3C's Document Object Model working group is attempting to build a unified system that would allow developers to script both HTML and XML.
Netscape and Microsoft, the twin pillars of browser development, are both implementing XML; their latest proposals to the W3C no longer refer to HTML but to XML instead. Internet Explorer includes XML parsers written in both Java and C++, and Netscape has promised XML support in its next release. So far, both companies' announced support for XML is primarily aimed at metadata applications rather than document presentation, but hopefully their Web browsers will soon take advantage of XML's considerable power.
Simon St. Laurent is an experienced Web developer who has been fiddling with
hypertext since 1989 and working with the WWW since 1994. He has worked for a
number of multimedia and Web design firms on projects for clients from
small start-up businesses to Fortune 500 companies. He currently lives in
Greensboro, NC. His first book, Dynamic HTML: A Primer, was published by
MIS: Press in August, and he is currently hard at work on his next two books:
XML: A Primer and Cookies.
Contact us with questions or comments.
EarthWeb Inc. All rights reserved.