Cover Pages: XML and Semantic Transparency

How does XML help with the encoding of information at the semantic level? Or does it? New users sometimes refer to XML as "semantic markup," and may be heard to praise XML for its ability to express semantic clarity through markup. We understand what gives rise to this sentiment: a work environment within which proprietary, procedural, and implicit markup has been the norm. Someone who uses a text editor to examine an XML document -- comparing it to an ancient WordStar file, to a comma-delimited text file, to Postscript, or to any document using a procedural or presentational markup language -- will readily judge the XML document more meaningful with respect to the information objects represented by text. The markup itself is a form of 'metadata', explaining to us what the constituent elements are (by name), and how these information objects are structured into larger coherent units.

We may rehearse this fundamental axiom of descriptive markup in terms of a classical SGML polemic: the doubly-delimited information objects in an SGML/XML document are described by markup in a meaningful, self-documenting way through the use of names which are carefully selected by domain experts for element type names, attribute names, and attribute values. This is true of XML in 1998, was true of SGML in 1986, and was true of Brian Reid's Scribe system in 1976. However, of itself, descriptive markup proves to be of limited relevance as a mechanism to enable information interchange at the level of the machine.

As enchanting as it is to contemplate the apparent 'semantic' clarity, flexibility, and extensibility of XML vis-à-vis HTML (e.g., how wonderfully perspicuous XML <bookTitle> seems when compared to HTML <i>), we must reckon with the cold fact that XML does not of itself enable blind interchange or information reuse. XML may help humans predict what information might lie "between the tags" in the case of <trunk></trunk>, but XML can only help. For an XML processor, <trunk> and <i> and <bookTitle> are all equally (and totally) meaningless. Yes, meaningless.

Just like its parent metalanguage (SGML), XML has no formal mechanism to support the declaration of semantic integrity constraints, and XML processors have no means of validating object semantics even if these are declared informally in an XML DTD. XML processors will have no inherent understanding of document object semantics because XML (meta-)markup languages have no predefined application-level processing semantics. XML thus formally governs syntax only - not semantics.

In fact, XML syntax is designed for representing an encoded serialization, and thus has a very limited range of expression for modeling complex object semantics, where "semantics" fundamentally means an intricate web of constrained relationships and properties. Otherwise stated: XML is a poor language for data modelling if the goal is to represent information objects in the problem domain such that they correspond transparently ("one-to-one") to the user's conceptual model of objects in this domain. The principal constructs available in XML for expressing relationships are "containment" (hierarchy), "adjacency" (A 'followed by' B), "co-occurrence" (if A then [also/not] B), "attribute", and "opaque reference". These constructs are indeed useful for serialization, but are not optimal for modelling objects of a problem domain in the way users typically conceive of the objects as core abstractions. All primitive relational semantics must be shoehorned into these crude syntactic structures, and even then, the XML processor will not be able to recognize their significance. The notion of "attribute" might have been more useful except that XML supports only a flat data model for the value of an attribute in a name-value pair (essentially 'string'). This flat model cannot easily capture complex attribute notions such as would be predicated of abstracted real world objects, where attribute values are themselves typically represented by complex objects, either owned or referenced.

Interoperable computing solutions imply the existence of a sharable ontology, or common set of object semantics. Implementors will still be able to use localized and otherwise customized XML markup languages if they choose, but it should be possible to express and validate the semantics of the design as well as the raw XML syntax. Participants in a partner or network relationship need assurance that transactions will be negotiated meaningfully and correctly because all participants mean the same thing in the use of the interchange markup constructs. But shared ontologies are not the domain of XML.

If XML itself governs only syntax, and provides inadequate basis for specifying the primitive 'ontological' and relational semantics desirable for the support of interoperable processing solutions, where is semantic specification to be made? In the world of "documents destined for paper print or computer display," the notion of a "stylesheet" provides one key opportunity for specifying processing semantics. The W3C activity on style represents one arena in which methods for display semantics are being defined. But documents can perform in far more powerful ways, providing interactive interfaces for research-oriented and task-driven activities, incorporating expert system knowledge, electronic performance support, and other facilities. To the extent that documents can have advanced interfaces supporting transformation, navigation, querying, and computation, the term "stylesheet" seems too narrow a term to encompass specifications for document semantics.

Furthermore, the use of XML for "data" interchange may already outweigh the use of XML for "document" display. For messaging and other transaction data, specifications approaching the level of formal semantics are desirable (e.g., KIF, or KQML), governing not just common (atomic) data types in business objects, but governing complex objects used by computer agents in large-scale business transactions. XML vocabularies supporting these applications will need to be defined in terms of precise object semantics.

Some support for generic XML semantics is being designed in ancillary W3C submissions/specifications like SOX, DCD, RDF syntax and schemas, and XML Data. The W3C Schema Working Group chaired by Dave Hollander and C. Michael Sperberg-McQueen has been chartered to address XML document semantics in the broader sense. ISO activities such as the SGML/STEP harmonization effort (involving the semantics of the EXPRESS data modeling language, and property sets) represent other interesting work. The XMI effort sponsored by OMG integrates XML with two other key industry standards (UML -Unified Modeling Language; MOF - Meta Object Facility), offering yet other avenues for modelling data object semantics in XML syntax. These specifications promise to make up for some of the recognized limitations of XML, but it appears that other industry initiatives might be required to coordinate the results of the several efforts -- to enable the meaningful sharing of XML-based schemas and information units at the semantic level.

The need for semantic transparency in XML-based applications has been recognized within several vertical industries where XML can be identified as playing a vital role in building interoperable systems for collaborative industry endeavor. Within the realm of electronic commerce, for example, Ontology.Org now plays a preeminent role in highlighting the critical need to provide unambiguous semantic specification for XML-encoded objects. In order for XML to achieve its full potential, this principle must be recognized and democratized across industry domains.

By Robin Cover. October 23, 1998. Revised November 24, 1998.

[Note: this represents reworked text from an article submitted in draft to Sun/OASIS. See now the reference collection on "XML Schemas." -rcc]


SEARCH \| ABOUT \| INDEX \| NEWS \| CORE STANDARDS \| TECHNOLOGY REPORTS \| EVENTS \| LIBRARY