|Last modified: September 27, 2000|
|The SGML/XML Aversion to Semantics|
In order to fully appreciate the significance of the new XML Schema work with its focus upon datatypes and other semantic constraints, one must understand the historical posture of SGML toward "datatypes" and, more generally, toward "semantics." For better or worse -- opinions differ -- SGML as a metalanguage had a strong aversion to semantics, and this feature has been carried over into the design of XML. We offer here three illustrative examples of SGML's disposition: the first is being addressed -- some would say redressed -- quite visibly in the current W3C draft specifications, while the latter two may be addressed less completely in the final W3C Recommendation.
The short answer? A "number" (numeric value) does not exist. While SGML offered no straightforward method of defining or validating a datatype in simple element content (#PCDATA), it did support a limited degree of freedom in assigning a "declared value" for an attribute, e.g., a 'number' (NUMBER) or 'number token' (NUTOKEN), where the characters "." and "-" are allowable in the latter. So, with the NUTOKEN, one might encode <walnut weightInGrams="12.4"> but the SGML parser would not understand this semantically as a decimal number; an encoding <walnut weightInGrams="2.4..089.-.--..23.3-8"> would be equally "valid" to SGML because there is no semantic notion of a number whatsoever. One might attempt to do type conversion on a NUTOKEN value, with variable success, but to the SGML parser, the value is just a string. It was probably just a small loss, therefore, that XML threw out the NUMBER and NUTOKEN types. XML is left even weaker than SGML in terms of its rudimentary datatype system for attributes; the validating XML processor has no means of understanding a numeric value.
In six cases for SGML's "declared value," where it was said that semantic checking was indeed the role of an SGML parser, the semantic aspect was exceedingly weak. In the case of ID and IDREF(S), the lack of support for relational semantics has come over into XML. How so? The declared value of "ID" involves the semantic of global uniqueness, while the related value "IDREF(S)" implies the nomination of (reference to) one or more of these global IDs . The weakness of this kind of semantics, if it can legitimately be called "semantics," is revealed in several ways.
First, SGML/XML have no mechanism for expressing constraints upon the type of document object that is referenced via the ID-IDREF mechanism. One might expect to be able to declare that a "bibliographic reference" might indeed point from an in-text citation to a bibliographic-reference object (e.g., to an item in a reading list), but the ID-IDREF mechanism cannot support this constraint. In modeling genealogical information, one might wish to declare an encoding for "sibling" which requires a reference to a "person" having a common parent, but the SGML ID-IDREF mechanism simply cannot be used to enforce the constraint upon the referents. Second, the IDREF scheme permits any number of (normally) bizarre situations which violate sensible expectations. Imagine that we have a "link'"element with an "ID" attribute named "id" an "IDREFS" attribute named "targets," and an "IDREF" attribute named "mySource." Then we might find in our text a DTD-valid markup occurrence such as <link id="section2" targets="section2 section1 section2 section2" mySource="section2">, which almost certainly encodes nonsense vis-à-vis the intent of the markup language designer. We should be able to express, for example, constraints on (a) the exact number of tokens in "targets," (b) the significance of the encoded order of the tokens in "targets," (c) the appropriate class(es) of elements nominated in "mySource" and "targets," etc. While the semantic weakness of the ID-IDREF pointing mechanism in SGML/XML is consistent with the general posture of radical disinterest in semantics, we must conclude that the ID-IDREF semantic constructs are simply not enough. We require facilities to specify and "schema-process" the fundamental notions of relational semantics through references that cannot be adequately expressed in the SGML/XML DTD metalanguage syntax.
The reference mechanism in SGML/XML may be considered a special case -- though an especially critical case -- in the much larger issue of semantic relations. It has sometimes been claimed that the goal of SGML was to enable the modeling of the "logical structure" of a document or database. But what is meant by logical structure? In reference to a simple document -- composed of chapters, sections, paragraphs, and characters -- the "logical structure" may sensibly be understood as a hierarchical structure in which one thing "contains" smaller things, recursively. Sort of... But what is the "logical structure" of an abstracted financial transaction, a trading protocol, a QA process, a workflow or timing diagram, a complex polymer, a highly "intelligent" graphic, a complex mathematical equation, a colony of penguins, a weather or astronomical observation?
Most of real-world events and processes being modelled in XML today do not admit of a single or even obvious "logical structure" in the sense of having a natural beginning and end, or components "containing" smaller components. The models need to express principally semantic relationships between information objects, including rich webs of relational inter-dependencies.
Here, precisely, is one of the inherent limitations of SGML/XML: having no formal mechanism to specify and constrain semantic relationships so as to permit declaring a thing's meaning. The principal constructs available in SGML/XML for expressing relationships are "containment" (hierarchy, A contains B), "adjacency" (A followed by B), "co-occurrence" (if A then [also/not] B), "attribute" (string descriptor), and "semantically opaque reference (ID-IDREF)". Furthermore, there is no way to indicate formally what containment and adjacency actually mean (or do not mean) in terms of element/object relationships. One might see the markup <book><chapter></chapter></book> and think that a real world book object must "contain" a thing called chapter, but this inference cannot be licensed in any way from the SGML and XML standards. The strings 'book' and 'chapter' are totally meaningless to the parser/processor, and the markup nesting "means" nothing; it may be in fact that a 'chapter' "contains" a 'book' in the real world, or that a 'chapter' becomes a neutron star when coming within 2 light years' distance of a 'book'. Neither the labels nor the structural relationships "mean" anything.
But it's worse than that. Objects in the XML problem space are not related fundamentally by notions of containment and adjacency anyway. These constructs are indeed useful for serialization, but they are not optimal for modeling object relationships in a problem domain in the way users typically conceive of the core abstractions. All primitive relational semantics important to the schema designer must be shoehorned into these crude syntactic structures, but the XML DTD processor will not be able to recognize the significance since no additional "meaning" can be attached to them. The notion of "attribute" might have been more useful except that SGML/XML supports only a flat data model for the value of an attribute in a name-value pair (essentially 'string'). This flat model cannot easily capture complex attribute notions such as would be predicated from abstracted real world objects, where attribute values are themselves typically represented by complex objects, either owned or referenced. [OO Model]
The new XML schema work provides a framework within which, at least philosophically, the concern for semantic relations can be addressed. Unambiguous meaning is the foundation of any agreement or contract. Humans collaborating on the design of large-scale transaction-based computing systems need to be able to unambiguously express and thus agree upon what they "mean" in the sphere semantic relations between objects in their models. Objects in the problem space have behaviors, and these behaviors too must often be modelled. Machines, in order to truly support interoperable agent-based computing solutions, need to "understand" the logic implied by shared ontologies that use robust, standardized definitions of relational and behavioral object semantics. To meet these requirements, it will not be sufficient simply to design XML-based languages -- using XML DTDs -- to express relational and behavioral object modeling formalisms. It is desirable rather to provide core facilities within an XML schema definition language to express primitive semantic relations such that generalized XML schema processors can validate business logic expressed in XML schemas. The current goal of making "semantics" a core concern of XML schemas lays the predicate for these new generalized metalanguage facilities.
The paragraphs above illustrated the SGML/XML posture toward semantics and contemplated the overall weakness of a markup-based modeling formalism which lacked interest in semantics. To many critics of SGML, it has never been obvious why a descriptive markup metalanguage should need to be innocent of, and resolutely disinterested in primitive (ontologic, relational) semantics such as might be expressed through constraint-based sublanguages. Although full critical appraisal of SGML's disinterest in semantics is beyond the scope of the present article, a few thoughts will prepare the way for discussion of the new XML schema work in terms of its full, enthusiastic "embrace of semantics." As both a participant and observer in this world since 1986, I offer here two personal and incomplete accountings, from among many, to explain the historical situation.
First, we must remember that SGML was given birth in the context of "document publishing." Indeed, one of the chief concerns addressed by SGML was that it be able to "achieve typographic results comparable to procedural markup" for high-quality printing of marked-up documents [8879 Annex A]. In the early 1980s there was no Web: no notion of the "universal browser" that scholars, clerics, and business managers alike would use to "click their way through highly interactive and dynamic online documents," and no notion of high-bandwidth communications lines supporting global real-time e-commerce, reaching into millions of homes. What were there? Multi-national corporations with document databases storing content that needed to be repurposed, and collaborative authoring environments in which textual information needed to be able to sent across many hostile barriers to be edited on machines having very particular notions on how to represent text. But the principal commodity was in fact information that existed largely in one single domain: character text. This character text would be rendered on a screen and "finally" printed on paper in various formats -- and, of course, archived. SGML served that publishing and archiving purpose tolerably well. Binary data in the form of images would be bunged in at print time, but the trading commodity was character text. Characters were for markup, characters were the information substrate represented usually in "native form," and most importantly, "the information in the leaf nodes of the tree all came from that same domain" of character text [Simons 1992]. In this world, given the requirements for print publishing as the chief concern, one can understand how the general modeling of semantic relations could be conceived of as a minor and potentially separate concern.
At the edges of this world, however, even in the 70s and 80s, were people who appreciated the power of the fundamental markup abstraction, but perceived a different application of the "descriptive" markup mechanism. They saw markup as a means of representing other arbitrarily complex kinds of data that were not fundamentally of and in the domain of character text but could nevertheless be represented using markup delimiters and character strings. They visualized, for example, complex databases of linguistic, literary, and text-critical annotation analyzing some text, but where the annotation itself represented 98% of the information in the database. The annotations were created to support elaborate queries (getting answers to research questions) and perhaps for constructing multiple views -- not to be used as something for printing "as character text" on paper or on the screen. To ensure integrity of the information in the database, researchers needed facilities that the SGML "publishing" system could not of itself offer: a sublanguage for expressing datatype constraints, a more powerful notion of an "attribute," and notational constructs that would allow the markup language designer as a domain expert to model the problem space straightforwardly -- exactly as the real-world phenomena are abstractly perceived. SGML's optimization for representing document (character) text through serialization did not readily lend itself to modeling a complex network or web of related data objects, where the relational semantics and other semantic integrity constraints are foundational to the modeling enterprise. Document "text" in its simplest form is serial; objects in the real world and information about them are not.
A second explanation for SGML's evident aversion to "semantics" can be offered in terms of the competing forces from which SGML found constant opposition. The commitment of SGML design to principles of "descriptive" markup (eschewing "procedural" and "presentational" markup) came to be expressed as a commitment to a radical and radically-needed separation of concerns. The dogma was formulated thus: one should delimit in markup exactly what "is there" (in the text) -- identifying it descriptively (viz., by name) in terms of "what it is" rather than "how it should be processed." Validation processing via the SGML parser was reckoned as a special case. So too, speaking prescriptively, a creedal statement became: "the specification for the logical structure of the document should be made independent of and uncontaminated by any specification for the processing of that structure and content." And so, "descriptive markup should be innocent of processing semantics."
But something of an equivocation crept into these apologetics: "processing semantics" was shortened and made equivalent to "semantics." We heard (and we said): "Pure SGML languages do not countenance (processing) semantics... they should not try to express semantics... so design your processing semantics elsewhere..." Catch the slip? Declaring what should be done (procedurally) as a matter of particular processing semantics is vastly different than expressing declaratively the primitive semantics of object ontology and object relationship. A notion even more frightening to traditional SGML dogma would be that some "behaviors" intrinsically bound up in object ontology and expressible delaratively could harmlessly be brought into the markup-language enterprise. In the rhetoric of such public debate, SGML advocates exhibited (by my reckoning) too little patience for the persistent question of database experts, who asked quizzically, "Well, why shouldn't a (meta)markup language be capable of modeling datatype (semantic) constraints, while steering clear of processing directives? Why shouldn't a schema processor be able to address itself to the semantic integrity of a database modelled in a markup notation, if the meta-model knows about semantic relations? Why not indeed?
The formulation "SGML has no pre-defined application level processing semantics" captured the historic concern for separation of "logical structure" representation from "processing functions", and it accurately described SGML's posture toward "processing semantics." At the same time, such language fundamentally begged the question of markup semantics by assuming (1) what should be meant by "processing", (2) what should be meant by an "application" as opposed to a validating parser/processor, and (3) that the expression and validation of datatype semantics would introduce "processing dependencies." The SGML processing model involved machine processing via the validating SGML parser as well as the entity manager. Both were reckoned as safe, special cases of "processing" within the SGML system, predicated upon an assumption that the system should not however embrace (processing) semantics other than the low-level lexical semantics necessary to bootstrap the system. The W3C Schema Definition Language now exposes these assumptions by asserting that processing dependencies are not necessarily involved in a descriptive markup system which elects to address primitive (ontologic, relational) semantics defined through standard declarative mechanisms. The W3C work further asserts that generalized schema-level validation of primitive semantics (including "business logic") is an appropriate goal, both pragmatically and philosophically; passé the argument that all validation of semantics should be left to the application.
Note 2: The SGML facility for partitioning information content into "element" and "attribute" markup structures has been judged poorly designed by many who have attempted to use SGML outside the domain of document publishing. First, the SGML element "content model" notion presupposes that there is an obvious and optimal way to distinguish "content" information from "non-content" information. This distinction simply does not apply outside the realm of simple, static documents (where "content" represents characters-to-be-printed) -- if it applies even there. Second, SGML's element and attribute constructs do not easily map onto the two-level modeling constructs commonly used in other schema languages (e.g., EXPRESS schema language) and object-oriented systems, as well as in general linguistics and computer science. The typical two-level model used in schema languages and OO-systems provides for a notion of Object and Attribute (using various names) in which an Object is defined through its named Attributes, an Attribute is a name-value pair, and the "value" is itself an Object (complex or atomic). SGML's model for Attribute is profoundly broken at this point: it does not accept a complex "value." One may thus observe that while XML is not a programming language, it leans in the direction of a programmer's language by virtue of the artificial (skewed) modeling constraints it imposes. Under the influence of these constraints, sometimes offered as "implementation choices for encoding," an analyst as schema/DTD writer may be seduced or coerced into modeling parts of a problem domain in ways that are not natural or well-matched to the user's conceptual model of the problem space -- sometimes with a cascade of unforeseen consequences due to hidden assumptions in XML software implementations. These features, in the name of "optimizations", work to erode a foundational principle (viz., that XML should have no predefined application level processing semantics); software may privilege certain encoding models, for example, by treating attribute nodes ("pull") differently than element nodes ("push"). For references to generalized use of the typical two-level system for information representation across many disciplines (e.g., Roman Jakobson and Noam Chomsky in linguistics; Donald Knuth in computer science; L. Cardelli in data type systems; H. Ait-Kaci in knowledge representation), see Langendoen and Simons "A Rationale for the TEI Recommendations for Feature Structure Markup," CHUM 29 (1995) 191-209. The W3C XML Schema Requirements document would appear to encourage the design of a schema-level facility to work around the SGML attribute problem (viz., the schema language is to be "more expressive than XML DTDs"), but it remains uncertain whether the Schema working group will actually address this matter.
Note 3: See "Annex A. Introduction to Generalized Markup" in the ISO 8879-1986 specification.
Note 4: The phrase "all the data in the leaf nodes are from the same domain - the domain of character text" is a paraphrase of something I heard or read from Gary Simons, ca. 1992. This insight and formulation seems to capture an explanation for why SGML markup works well in some relatively simple scenarios (structuring serialized characters at surface-text level), but does not work so well in the realm of modeling "text" multidimensionally. SGML's inability to handle multiple concurrent hierarchies (or: multiple analytical models for encoded objects) in an elegant manner reveals in a different way the nature of the inline-markup problem.
Note 2001-01, approximate reference discovered: "A Conceptual Modeling Language For the Analysis and Interpretation of Text." By Gary F. Simons. Text Encoding Initiative Committee on Text Analysis and Interpretation. Document Number: TEI AIW12. January 16, 1990.[cache]
"... The basic model of text embodied in garden variety SGML applications (like the AAP tag sets for books, articles, and journals) is that all the bottom level data elements come from one domain (namely, the words and punctuation of a particular language), that these data elements are arranged in an inherently ordered sequence, and that the main problem of document markup is to encode the fact that the content of the text also involves an organization in which lower-level elements nest inside of higher-level ones to form a hierarchy of text elements.... The SGML approach of enforcing syntactic integrity while ignoring semantic integrity works well in the world of conventional documents. This is because the semantics of the markup bears an implicit one-to-one mapping to its syntax. The semantic domain of the words and punctuation of the author's language maps directly onto the syntactic domain of CDATA. The semantic concept of order in the flow of the conceptual text maps directly onto the linear sequence of elements and CDATA in the encoded text... The approach of explicit syntax with implicit semantics breaks down when there is not a one-to-one mapping between semantic and syntactic notions..."
[Extracted: The preceding [lightly revised] paragraphs are excerpted from a paper written (and submitted for publication) in July 1999.]