THE EXTENSIBLE MARKUP LANGUAGE (XML)

[This local archive copy mirrored from the canonical site: http://www.personal.u-net.com/~sgml/xml.htm; links may not have complete integrity, so use the canonical document at this URL if possible.]

The Extensible Markup Language (XML)

ETHOS Technology Briefings Series 1: Developments Shaping Internets and Intranets

EXECUTIVE SUMMARY

The Extensible Markup Language (XML) is a W3C Recommendation for marking up data that cannot be marked up using the HyperText Markup Language (HTML). XML is an extremely simple dialect of the Standard Generalized Markup Language (SGML) defined in ISO Standard 8879. The goal of XML is to enable SGML-coded data to be served, received, and processed on the Web in the way that is as easy as that currently made possible by use of the fixed SGML tag set provided by HTML. XML has been designed for ease of implementation and for interoperability with both SGML and HTML. It is based on the ISO 10646 Universal Character Set (UCS, equivalent to Unicode) so can be used in all major trading nations.

The existing XML standard will be complemented by two extensions. Part 2 of the standard will permit XML documents to be linked together. Part 3 will provide a method for specifying the presentation style, and controlling the behaviour of, XML elements.

As well as providing simple, in-line, links of the type found in HTML documents, the XML Linking Language (XLL) will allow a single 'hot-spot' in the text to identify more than one location by definition of an extended link. Extended links can be stored outside the documents they link. Contents identified by XLL links can be embedded in referencing documents to create compound documents. XLL links can identify objects from their markup, position and/or contents, and do not need a previously named link point.

The proposed XML Style Language (XSL) will allow the ECMAScript derivative of JavaScript to be used to evaluate and process the contents of XML elements prior to displaying them either as HTML elements or using the display object types defined in ISO's Document Style Semantics and Specification Language (DSSSL).

1. TECHNICAL DESCRIPTION

The Extensible Markup Language (XML) is a W3C Recommendation for marking up screen presentable data that cannot be marked up using the simple set of document presentation elements defined in the HyperText Markup Language (HTML). Whilst designed initially for the display of documentation distributed via the World Wide Web (WWW), XML has been widely adopted as a means of interchanging information between computer programs. In particular it is widely seen as the best solution for the interchange of metadata about stored objects and programs (e.g. the Open Software Description) and for the interchange of commercial information (e.g. Open Financial Exchange).

XML is an extremely simple dialect of the Standard Generalized Markup Language (SGML) defined in ISO Standard 8879. SGML was designed in the 1980's as a tool to enable technical documentation and other forms of publishable data to be interchanged between authors, publishers and those responsible for the production of printed copies of data sets. By providing a formal definition of the component parts of a publishable information set, SGML made it possible to verify the correct transmission and receipt of interchanged data sets. It was soon found that these techniques are applicable in areas other than those directly related to publications. For example, SGML is often used as a neutral data format when moving data between databases as part of multinational projects.

The goal of XML is to enable SGML-coded data to be served, received, and processed on the Web in the way that is as easy as that currently made possible by use of the fixed SGML tag set provided by HTML. XML has been designed for ease of implementation and for interoperability with both SGML and HTML. Unlike early versions of SGML and HTML, XML has been based from the very start on the ISO 10646 Universal Character Set (UCS, equivalent to Unicode) so that it can be used in all major trading nations.

An XML document instance must be created and stored as a set of properly nested data storage entities, each of which is made up of a number of logical elements which contain data or define processes to be performed. The outermost storage entity is referred to as the document entity: it contains both the start and the end of the root or document element of the document instance. Elements can be nested to create hierarchies (information trees), and may contain references to embedded entities. Elements can be assigned attributes (properties) which indicate how the contents of the element should be interpreted.

Each XML element starts with a named start-tag and ends with an end-tag with a matching name. Outward pointing angle brackets are used to delimit these markup tags (e.g. <title>). An end-tag is distinguished from a start-tag by having a slash immediately preceding the name (e.g. </title>). Elements that have no contents are distinguished by having a slash immediately after the name in the start-tag to indicate that the end-tag has been omitted (e.g. <image/>). Because each element of an XML document has clearly marked limits it is easy to determine when its contents have been received over a network.

Attributes of XML elements are defined as part of its start-tag (e.g. <image title="Front view" source="entity21"/>). Each XML attribute must be fully defined, with the attribute name followed by a value indicator (=) and a quote delimited string containing the attribute value. Attributes can be assigned a default value if an attribute list declaration is associated with the formal declaration for the element in the document type declaration (see below).

XML requires that data that is not coded in XML characters be stored in a named binary entity, which may have associated with it the name of the notation in which its contents have been encoded. The location and notation of each uniquely named binary entity must be declared in an entity declaration (e.g. <!ENTITY entity21 SYSTEM "http://www.myco.com/figs/figure1.gif" NDATA GIF>). The location of a processor for a notation can (optionally) be identified using a notation declaration (e.g. <!NOTATION GIF SYSTEM "show-gif.dll">). XML uses Internet Uniform Resource Locators (URLs) to identify locations of external entities and other types of files. Relative URLs can be used to identify locally stored information.

Parts of an XML document instance can be stored in separate files that will be referenced as external text entities. Such entities are declared in the same way as binary entities, except that there is no associated notation name. Alternatively internal text entities can be used to define the replacement text for an entity reference. For example, addition of an entity declaration of the form <!ENTITY company "The SGML Centre"> to a document type declaration will allow an entity reference of the form &company; to be entered in the associated document instance. This reference will be replaced by the quoted replacement text defined in the entity declaration when the file is processed (parsed).

The set of elements, attributes, entities and notations that can be used within an XML document instance can (optionally) be formally defined in a document type definition (DTD) that is associated with the document instance through the addition of a document type declaration that forms part of the prolog of a document instance. The declarations that make up the document type definition can form part of a file referenced as the external subset of the document type declaration, or can be embedded, or referenced, within the internal subset of the declaration. Comment declarations must be used for any explanations required as part of the document type definition.

XML document type definitions inherit a set of predefined entity definitions that provide names for the characters used to delimit XML markup. Each of these names maps to a numeric character reference that defines the decimal value of the ISO 10646 code point for the associated delimiter character. For example, the predefined entity whose reference has the form < maps to decimal character reference <, which identifies the < symbol in the ISO 10646 code set.

Where there is no formal declaration for an element, attribute or notation in the document type definition, an XML document processor (XML parser) will apply a set of default rules for determining how to process the markup. To indicate that documents are to be processed according to these default rules, using predefined entity sets, an XML document instance, and any associated document type declaration, must be preceded by a processing instruction indicating which version of XML the document conforms to, and what character encoding was used to create the storage entity. By default this XML declaration will take the form <?xml version="1.0" encoding="UTF-8"?>. This information can be supplemented by a statement indicated whether or not the document is self-contained (contains its own document type definition) by the addition of standalone="yes".

Because XML is a subset of SGML, parsed document instances can be referenced using groves of the form defined in the SGML Extended Facilities annex of ISO/IEC 10744. Alternatively the Document Object Model (DOM) defined by W3C for use by both HTML and XML can be used to identify the structure of an XML element or entity tree. Applications requiring a simpler application programming interface (API) can use the event-based Simple API for XML (SAX) developed by the XML Developers Group. Both DOM and SAX have IDL definitions that allow XML elements to be stored in CORBA compliant databases.

The XML Linking Language (XLL)

The July 97 draft specification for the XML Linking Language defines two types of links, simple links and extended links. XML links differ fundamentally from HTML links in that they allow spans of elements to be identified as the referenced data using a concept known as extended pointers (Xpointers). For example, extended points can identify all the contents of a contiguous section of a document, not just its title or first element. In addition, XML allows the position of the referenced data to be identified by selecting specific elements in the XML element tree (e.g. the third <section> element within the fourth <chapter> element) or by selecting elements with a specific content (e.g. the occurrence of the word XML within the contents of a <title> element).

Simple XML links point to a single span, and are often defined in-line, with the contents of the linking element forming the "hot-spot" that users can click on to activate the link. An HTML anchor element (<A>) can be turned into an XML simple link by the simple procedure of assigning an additional attribute with a predefined default value to its attribute list declaration. Alternatively an anchor element can be extended by adding the following attribute definition: xml-link="simple".

Optional attributes associated with an XML simple link allow it to be assigned a displayable title, a named role for the element (which can affect the way in which the link element is displayed, perhaps as an icon in the margin), a content role (which can affect the way any content is displayed), a show attribute that indicates whether the linked data is to be shown in a new window, is to replace the contents of the current window, or is to be embedded as part of the current document (creating a compound document), and an actuate attribute to define whether the link is to be followed automatically or only when a user selects the link. A behaviour attribute allows special behaviour to be assigned to specific instances of a simple link.

Extended XML links can point to more than one span of data. They contain one or more locators that identify a possible link point. Each locator can have a title which can be used in place of any contents when a menu is displayed to ask users to select which of the permitted paths to choose from the linked document. Extended links can use the same optional attributes as simple links to control how links are to be displayed and activated.

XML extended link groups can be used to store a list of links in a document other than the ones that they reference. The extended link group identifies which documents a particular set of links applies to.

The XML Style Language (XSL)

A draft proposal for an XML Style Language known as XSL was co-submittedto the W3C by Microsoft, Inso Corporation (a supplier of HMTL tools) and ArborText (a supplier of SGML tools) in August 1997. This submission was co-authored by members of these three companies, James Clark, and Henry Thompson of the Human Communication Research Centre of The University of Edinburgh. Freely downloadable software based on the submission has been made available by Microsoft and ArborText.

Whilst the draft specification is currently incomplete, it does exhibit features of the type that would be expected to occur in any eventual style language associated with XML. It is based on principles defined in the early days of XML, when a subset of ISO's Document Style Semantics and Specification Language (DSSSL) suitable for display of data on a computer screen was discussed and documented as the DSSSL Online (DSSSL-O) specification.

XSL provides for two forms of output flow objects. The first set is the set of displayable objects defined for HTML, which allows XML data to be mapped into HTML-aware browsers. The second set is based on the DSSSL-O specifications and allows XML data to be mapped to DSSSL-based text formatters, such as JADE. Both sets of flow objects are described using XML markup.

XSL defines a set of rules which define a set of actions that are to be associated with various patterns of target elements. The selection of target elements can be qualified in a number of ways. For example, XSL allows different rules to be applied to the same element type dependent on what its ancestors, siblings or contents are. In addition, processing rules can be specified for application when particular attribute values have been associated with an element, or when the element has specific contents. This means that specific rules can be applied to elements with unique identifiers or identified content types (classes).

XSL allows for the definition of sharable sets of style rules. A style rule applies a set of processing characteristics to a target element without creating a new flow object. Where the same style is to be applied to a number of elements a uniquely named style can be defined for future reference. This provides XSL with the facilities for creating cascading sets of style sheet specifications similar in effect to those defined in the more limited Cascading Style Sheet specification used to process HTML documents.

XSL provides facilities for the development of flow object macros that can be used to define the actions needed to process more than one type of target element. A default construction rule can be defined for those elements that are not specifically targeted. Where the contents of an element are to be displayed in more than one place (e.g. for titles that are to form part of a table of contents) separate modes can be used to distinguish which rule is to be applied in each context.

XSL style sheets can use the ECMAScript programming language to evaluate the contents of elements or attributes prior to or during the creation of flow objects. ECMAScript is a variant of JavaScript and Jscript which has been formally defined by the European Computer Manufacturers Association. It allows tools containing a Java Virtual Machine to process data contained within an XML document. The language has been designed to support only a limited set of processing side-effects to ensure that evaluation cannot inhibit the progressive rendering of large documents.

ECMAScript can perform calculations based on measurements expressed in terms of centimetres, millimetres, inches, picas and/or points. A set of built-in functions based on the document processing functions in the DSSSL standard will be provided in addition to the default set of general purpose data processing functions defined by ECMA.

Part or all of an XSL style sheet can be imported from an external file. XSL style sheets can be associated with XML document instances by the addition of an XML style sheet processing instruction (e.g. <?xsl-stylesheet href="normal.xsl" type="text/xsl"?>) to the document's prolog.

An alternative to XSL is to use the Cascading Style Sheet (CSS) specification defined for the formatting of HTML documents. A new revision of this, CSS2, was proposed in November 1997, and is being considered as an interim formatting language by some XML tool developers.

2. TYPICAL APPLICATIONS

XML was originally developed to allow structured documents of the type typically encoded in SGML to be delivered over the Internet as an integrated part of the World Wide Web of documents. Typically these documents require the specification of element types over and above those permitted in HTML (e.g. specific elements for parts number and other forms of article identification, prices and other forms of calculable measurements, and special classes of displayable text such as health warnings and controlled task lists). XML allows users to define their own sets of document elements and describe how each of these elements should be displayed on a screen in conformance with the supplier's house style.

Early in its development cycle XML was identified as a natural encoding format by those attempting to work out how to exchange metadata about stored objects. After a number of supplier-specific proposed solutions had been considered, one solution, the Resource Description Framework (RDF), became the clear favourite. RDF is used to provide an XML-coded definition of metadata languages which can be used to exchange sets of metadata.

XML will form the basis of more specialist data exchange languages, such as the Open Software Description (OSD) format proposed by Microsoft and Marimba. OSD will allow software companies to describe their software in a format that will allow it to be checked regularly, over the Internet, to ensure that the latest version is being used. When embedded within documents coded in the XML-based Channel Definition Format (CDF), OSD data can be used to provide information to channel-enabled desktops to allow automated "smart-pull" of software updates rather than waiting for user controlled "pull" or software vendor "push" of updates.

One area where XML is anticipated to be particularly important is in the area of electronic commerce. Traditional mechanisms for electronic data interchange (EDI) are based on the interchange of compactly codes messages between the computer systems of two or more businesses. Each message has to be decoded before its contents can be processed or presented to users. Web-based commerce has, by contrast, been based on the concept of completing an HTML form and then posting the results back to the initiating server for processing, without any details of the transaction being retained by the company completing the form. Neither of these approaches allows for a fully integrated approach, where exchanged data forms part of a business cycle that starts with product marketing, proceeds on to contract negotiation, purchasing, delivery logistics management, payment and tax and other administration data. It is anticipated that a combination of the behaviour control mechanisms of XLL links and the client-side evaluation functionality that can be provided by XSL may make it possible to develop message exchange systems where data can be captured on screen, processed locally as required, transmitted to the relevant receiver(s) for processing as relevant there without having to reformat the data during the processing. An early example of such a protocol, called the Internet Open Trading Protocol (OTP) was published in draft form in January 1998. This protocol is designed to handle messages between consumers and merchants trading over the Internet. It also allows for the use to third parties such as value acquirers, deliverers and customer care providers.

3. BARRIERS TO ADOPTION

Whilst XML has most of the commonly adopted features of SGML it lacks many of the optional features that can be used to make it more user-friendly. In particular, XML allows no use of the markup minimization features that allow an SGML document to omit those parts of the markup that can be implied from the document type definition. This means that XML messages are inherently longer than SGML ones. This factor is seen as particularly important to some communities, noticeably those involved in marking up complex data sets such as those required for the presentation of maths, chemical data and medical data.

The specification for the XML Linking Language is not expected to be finalized until April 1998. Doubts have been expressed about some aspects of the initial draft of the linking language specification. In particular, the problems associated with creating compound documents by embedding linked information need to be studied. Concern has also been expressed about what it means to automatically navigate links, especially when they are extended links to multiple locations.

Until an XML Style Language is finalized there will be no formally agreed mechanism for the presentation of XML elements on the screen. In addition, there are no fixed rules for converting XML documents into printed pages. In this respect XML is no different from HTML. Though a Cascading Style Sheet (CSS) specification has been available for HTML for some two years very few systems supported it before the second half of 1997. By the time implementations started to appear a second version of the specification was deemed necessary to fill in some of the many gaps in the initial specification. Currently (early 1998) there is a debate about whether it is better to use CSS2 or XSL. There are many limitations in the current draft CSS2 (issued November 1997), especially in the area of the handling of text that contains data whose writing direction is mixed. Another problem is that CSS2 lacks the ability to use ECMAScript, or any equivalent general-purpose scripting language, to control the presentation of data without imbedding processing objects into the text.

One area where HTML and CSS2 currently have an advantage over XML and XSL is in the processing of forms. HTML has a well defined, if limited, set of form display elements. It also has a well defined mechanism, the Common Gateway Interface (CGI), for transmitting information entered by users to a well defined Internet address. Whilst XSL will allow the use of similar functionality by mapping XML elements to HTML presentation flow objects, there are currently no facilities for processing captured data at the client side as well as at the server side. Until such facilities as are provided by HTML functions such as HTML's onFocus and onBlur event controllers are available to XML users it will not be possible to create XML-based electronic commerce systems.

The W3C Working Group on XML are well aware of these problems. The group is advised by an XML Special Interested Group made up of over 100 acknowledged experts in SGML, HTML and related subjects. Additional discussion groups have been developed to focus on the problems of specific user communities, such as those of XML Developers and users of XML for Electronic Business. These groups are encouraged to feed suggestions for improvements to the working group. One problem with this process, however, is that it is very difficult to please everyone.

One of the goals of the XML Working Group is to make it easier for new developers to create XML-based tools. An original goal was that an experienced computer graduate should be able to create an XML parser in a week or so. It turns out that it takes most computer graduates at least this long to work out how to receive and decode ISO 10646 strings when using an interactive language such as Java. In practice it tends to take around 4 weeks to code and debug XML tools. Many suggestions were made for the simplification of XML to make the life of developers easier, but in nearly every case these suggestions would either make the resulting tools harder to use for document creation or processing, or would make them incompatible with existing SGML tools, which was another goal of the working group.

4. EVOLUTION PATH/TIMEFRAME

The XML specification was accepted by W3C members as an approved recommendation in January 1998. An extension showing how to create namespaces for controlling the use of sets of XML documents is expected to be published in March 1998.

The XML Linking Language is due for submission to W3C for approval in April 1998. At a meeting in December 1997 it was decided that the Xpointers part of the specification should be published as a separate document. It is expected that both parts of the revised specification will be made publicly available in time for presentation at the second SGML/XML Europe conference to be held in Paris in May 1998.

A working group dedicated to the development of XSL is expected to start work at the end of January 1998.

5. SOME MAJOR COMPANIES AND SUPPLIERS

A range of XML document validators is currently (January 1998) available, many of these still need to be extended to take account of minor changes to the specification that were made during the period the recommendation was being balloted by W3C. Whilst most of the existing tools have been developed by small companies specializing in XML it is noteworthy that one of the earliest publicly available processors came from Microsoft.

Details of the Microsoft range of XML tools can be found in a specialized directory at http://www.microsoft.com/xml. In January 1998 Microsoft released beta-software for an XSL display add-on that allows XSL specifications that create HTML flow objects to be displayed using Version 4.0 of their Internet Explorer. Details of this software, and other developments on the XSL front, can be found at http://www.microsoft.com/xml/xsl.

Many developers of SGML tools have already prepared options that allow their systems to read and generate XML compliant files. The most popular of these is the Arbortext Editor. Arbortext maintain an XML information site at http://www.arbortext.com/xmlresrc.html.

Many SGML tools are based on shareware developed in the UK by the principle developer of the DSSSL specification, James Clark. This SGML parser, SP, forms the basis of a DSSSL processor called JADE, which is also freely available to those wishing to evaluate DSSSL-O. Details of SP and JADE can be obtained from http://www.jclark.com/dsssl. The University of Edinburgh have developed a tool that allows XSL specifications to be converted into a form that can be processed by JADE (see htttp://www.ltg.ed.ac.uk/~ht/xslj.html). The output of JADE can either be an RTF file that can be read by most word processors or a TeX file that can be submitted for typesetting on the most commonly used academic typesetting systems (again freely available as shareware).

DataChannel, a major provider of push-technology, have announced that they will be supplying a complete suite of XML tools during 1998. There will be a stand-alone XML parser, based on Norbet Mukula's NXP parser (to be called the DataChannel XML Parser, DXP), an XML viewer based on John Tigue's viewer, a Java-based XML parser object, Pax Syntactica, based on the parser used in the viewer and an XML tree viewer. Demonstrations of beta versions of these objects are currently available (http://www.datachannel.com/products/xml/index.html)

An early example of the use of XML in a niche market was the development in the UK of the Chemical Markup Language (CML). Designed by Peter Murray-Rust, Director Virtual School of Molecular Sciences at the University of Nottingham (http://www.nottingham.ac.uk/vsms) on behalf of the members of the Open Molecular Foundation (http://www.venus.co.uk/omf/), this system is in the process of being changed from a stand-alone SGML-based system to a web-based XML one.

It has also been proposed that the Maths Markup Language (MML) being developed on behalf of W3C should also be an application of XML. MML will allow existing presentation-based mathematics to be integrated with a new form of processable, semantic-based, mathematics to provide a long-term upgrade path for mathematics publication.

APPENDICES

APPENDIX 1: REFERENCES FOR FURTHER INFORMATION

RELEVANT WWW SITES

http://www.w3.org/XML/ W3C Extensible Markup Language (XML) page

Page maintained by the World Wide Web Consortium (W3C) to keep the public informed of the latest state of XML development. Contains pointers to the latest published drafts of specifications, and pointers to other useful information relating to XML.

http://www.ucc.ie/xml/ The XML FAQ

A set of answers to frequently asked questions developed by members of the XML Special Interest Group.

http://www.sil.org/sgml/xml.html Robin Cover's XML page

Robin Cover maintains, at the Summer Institute of Linguistics, an extensive bibliography of data on SGML on behalf of the International SGML User's Group (ISUG). He has set up a separate page to cover developments specific to XML. Includes a section providing details of most of the currently available XML-based software packages.

http://www.microsoft/XML/ Microsoft's XML page

Explains why Microsoft consider XML to be important, and provides pointers to various XML tools developed by Microsoft.

http://www.datachannel.com/products/xml/index.html DataChannel's XML page

Explains DataChannel's plans to release an XML toolset during 1998.

http://www.arbortext.com/xmlresrc.html ArborText's XML page

Explains the XML add-ons to ArborText's Adept 7 SGML editor, and points to useful sites on XML.

API	Application Programming Interface
CDF	Channel Definition Format: W3C proposed recommendation
CGI	Common Gateway Interface: IETF specification
CML	Chemical Markup Language
CSS	Cascading Style Sheets: W3C recommendation
DOM	Document Object Model: proposed W3C recommendation
DSSSL	Document Style Semantics and Specification Language defined in ISO/IEC 10179:1996
DSSSL-O	DSSSL Online subset proposed by W3C SGML on the Web working group
ECMA	European Computer Manufacturers Association
ECMAScript	ECMA 262: A general purpose scripting language
EDI	Electronic Data Interchange
GCA	Graphic Communications Association of America
HTML	HyperText Markup Language: W3C recommendation for the markup of text
IETF	Internet Engineering Task Force
JADE	James' Amazing DSSSL Engine: shareware DSSSL engine developed by James Clark, the author of the DSSSL standard
MML	Maths Markup Language: W3C recommendation for the markup of mathematics
ISUG	International SGML Users' Group
OFX	Open Financial Exchange: W3C proposed recommendation prepared by consortium led by Microsoft
OSD	Open Software Description: W3C proposed recommendation prepared by consortium led by Microsoft and Marimba
OTP	Open Trading Protocol: proposal from The OTP Forum
RDF	Resource Description Framework: W3C proposed recommendation
SAX	Simple API for XML
SGML	Standard Generalised Markup Language defined in ISO 8879:1986
SP	SGML Parser: shareware developed by James Clark
UCS	Universal Character Set: ISO/IEC 10646:1991 (aligned with Unicode)
URL	Universal Resource Locator. A standard for writing a unique text reference to a piece of data in the World Wide Web
XLL	XML Linking Language: Part 2 of the XML recommendation
XML	The Extensible Markup Language: defined by W3C SGML Working Group
XSL	XML Style Language: Part 3 of the XML recommendation
WWW	World Wide Web
W3C	World Wide Web Consortium