[This local archive copy mirrored from: http://www.ltg.ed.ac.uk/~ht/sgml97.html; see the canonical version of the document.]

Why I demand Schemata: Element Type Hierarchies for Transparent Document Structure Definition
Henry S. Thompson
Language Technology Group
University of Edinburgh

Oct 15 1997

1. Introduction

Two recent proposals for meta-applications of XML (XML-Data and MCF) have included DTD fragments for describing document structure, sometimes called 'schemata'. In this paper I describe the XML-Data schemata proposal, concentrating on the motivation for and nature of the provision of an element-type hierarchy, in which element types can inherit attribute declarations and positions in content models from ancestors in the hierarchy. I argue that this represents a major improvement over the use of parameter entities to structure and maintain DTDs.

Complex document types require rich and complex structural markup. SGML provides powerful mechanisms for defining the grammar of such markup, with element type and attribute declarations in the document type definition (DTD). The structure of the DTD itself, however, finds no explicit expression in SGML. The fact that element types are related in a structured fashion can only be represented implicitly, e.g. through the use of parameter entities. There is a real need, for ease of understanding and ease of maintenance, to address this issue.

There is an obvious solution, prefigured by the following, which appeared recently in a public XML-related newsgroup:

"We really need to build an object-oriented hierarchy, with classes that are extended by subclasses and so on...For example, a <restaurant> is a subclass of <location> and inherits the properties of <location> such as <address> and <street number>, but adds other properties, such as <menu>."

In this paper I outline a proposed XML application which provides exactly this facility.

2. Taking control of the D. S. D.

The watchword of SGML used to be "Taking control of your data". SGML gave you the means to express the grammar of your markup yourself, rather than be bound by wordprocessor and document compiler manufacturers.

A side-effect of the XML initiative has been to open up the possibility of a similar move one level up, as it were. Just as SGML allowed us to experiment with different markup for document instances, so I think XML invites us to experiment with different markup for document structure definitions.

In a recent proposal I and my co-authors used the word 'schema' (plural 'schemata') for an XML document instance which itself described the structure of a document type.

In our approach, we envisage

a) the schema DTD, a definition of an XML representation of document structure, that is, an old-style DTD for schemata;

b) a master XML application, the equivalent of the XML parser, which is capable of processing pairs of XML documents, where the first, a schema, is valid in terms of the schema DTD; the second, an instance, has no old-style DTD, but is both well-formed in the XML sense and meta-valid in terms of the schema expressed by the first.

Meta-validity is, of course, validity with respect to the document structure constraints contained in the associated schema, which conforms to the schema DTD.

This "takes control of the D.S.D." in that experimenting with the grammar of schemata now involves changing the schema DTD (and the master application), not changing XML itself.

3. Document type hierarchies

The first move we make after introducing schemata which reproduce the expressive capabilities of existing XML DTDs is to add an explicit element type hierarchy.

[Out of time, and this is already late. The balance of the paper contains two worked examples, one simple and introductory, the other a detailed comparison of this approach to the traditional one using parameter entities, using a TEI DTD fragment. The examples, without the prose, follow below:

<schema>
    <elementType id="animalFriends">
      <elt href="#pet" occurs="PLUS"/>
    </elementType>

    <elementType id="pet">
      <any/>
      <attribute id='name'/>
      <attribute id='owner'/>
    </elementType>

    <elementType id="cat" extends="#pet"/>
      <elt href='#kittens'/>
      <attribute id='lives' type='NMTOKEN'/>
    </elementType>

    <elementType id="dog" extends="#pet"/>
      <elt href='#puppies'/>
      <attribute id='breed'/>
    </elementType>
  <schema>

This schema says that the animalFriends element type can contain one or more pet elements. Because cat and dog are subtypes of pet, they can occur as well. So the following instance fragment is now meta-valid under this schema:

  <animalFriends>
    <cat name="Fluffy" lives='9'/>
    <pet name="Diego"/>
    <dog name="Gromit" owner='Wallace' breed='mutt'/>
  </animalFriends>

Compare

  <!ENTITY % paraContent '(#PCDATA | %m.phrase | %m.inter)*'      >
  <!ENTITY % m.phrase '%x.phrase %m.data; . . .'>
  <!ENTITY % a.global '        id ID #IMPLIED
			       . . .'>
  <!ELEMENT p         - O  (%paraContent;)                    >
  <!ATTLIST p              %a.global;
	    TEIform            CDATA               'p'            >

with

  <!elementType id='p' extends='#global'>
    <mixed>
     <elt href='#phrase'/>
     <elt href='#inter'/>
    </mixed>
    <attribute id='TEIform' presence='fixed' default='p'/>
  </elementType>

  <elementType id='phrase'>
   . . .
  </elementType>

  <elementType id='global'>
   <attribute name='id' type='id'>
   . . .
  </elementType>
4. Maintaining Order

The only coherent development policy in my view is to introduce things into the schema DTD which we know how to translate into vanilla XML. Not only does this guarantee inter-operability in the limit, but the translation serves to define the semantics of each part of the schema DTD in a concrete and unequivocal way.