[This archive copy from the posting to XML-DEV mailing list, June 22, 1997]

XML-Data.html

Position Paper from Microsoft
20 June 1997

XML-Data

Authors:
Andrew Layman, Microsoft Corporation
Jean Paoli, Microsoft Corporation
Steve De Rose, Inso Corporation
Henry S. Thompson, University of Edinburgh
Acknowledgements:
We thank Paul Grosso (Arbortext), Sharon Adler (Inso Corporation), Anders Berglund (Inso Corporation), François Chahuneau (AIS/Berger-Levrault), and Edward Jung (Microsoft) for their help and contributions to this proposal.

Copyright (c) 1997 Microsoft Corp.


Abstract

This document provides the specification for exchanging structured and networked data on the Web. This specification uses XML, the Extensible Markup Language for describing data as well as data about data. We expect this specification to be useful for a wide range of applications such as describing database transfers, digital signatures or remotely-located web resources.

1. Introduction

The Internet holds the potential to integrate all information in a global network (with many private but integrated domains). The Internet promises access to information any time and, with wireless technology, anywhere. Today, however, the Internet is merely an access medium to text and pictures. To actualize the Internet's potential, we need to add intelligent search, data exchange, adaptive presentation, and personalization. The Internet must go beyond setting an information access standard, and must set an information understanding standard, which means: a standard way of representing data so that software can better search, move, display, and otherwise manipulate information currently hidden in contextual obscurity.

XML is an important step in this direction. It offers a standard syntax for textual structure of tagged data, based on extensive industry and theoretical experience. Its lexical format easily depicts a tree structure. A tree is a natural format that is richer than a simple flat list, yet (compared to a generalized graph) also respectful of cognitive and data processing requirements for economy and simplicity.

Looking at this point in more detail, there are several ways of structuring data. One is a flat tagging system. In this system, sets of keywords are applied to data elements. This is a simple form of data structure, but it does not capture any relationships between the keywords.

A more advanced means of structuring information is a tree. A tree allows expression of subsumption, containment, or any other single (contextual) relationship such as "manages." Trees correspond to object-oriented class hierarchies, file system hierarchies, organizational hierarchies and so forth. Trees are relatively easy to understand and to construct. Trees are efficient to process, and there is a linear (e.g. textual) structure that a program can parse incrementally, and determine when it is finished. This makes trees particularly useful as a transmission format for asynchronous, distributed systems such as the Internet, and also for display purposes where the single relationship (usually visual containment) enables incremental display.

A still more elaborate structure is a directed graph. A graph allows expression of arbitrary binary relationships, that is, many relationships between two things. A graph can express subsumption, containment, and any number of other relationships simultaneously. It is therefore a superset of a tree. This makes graphs very expressive for real-world semantics, but it also makes them harder to understand, more difficult to construct, and less efficient to process than trees. There is no efficient linear (e.g. textual) structure of a graph that can be incrementally processed. Therefore, while they are particularly useful for representing (and instrumenting) the complete semantics of a system, they are typically not suitable for transmission, display, or immediate processing.

The tree structure is proved broadly implementable and easy to deploy, not just in theory but also widely in practice. Industrial implementations, in the SGML community and elsewhere, demonstrate its intrinsic quality and industrial strength, e.g. aircraft (ATA), automotive (J2008), banking (OFX), and semiconductors (Pinnacles PCIS).

This proposal shows how to add a single convention to XML so that graph arcs are easily added into a lexical tree structure, without requiring decomposition of tree format into a "lowest common denominator" nodes-and-arcs structure. (For a quick look at the difference, see the XML-Data versus MCF in XML comparison.)

XML-Data consists of a collection of related technologies. First, it unifies lexical trees with graph structures. Second, it builds on this to define a representation for schemata based on XML instance syntax. It offers a mechanism to organize element types into a hierarchy, and proposes a small set of basic types. Finally, it adds facilities for lexical typing and proposes a small collection of lexical types.

XML-Data can encode the content, semantics and schemata for a gamut of cases, from simple and prosaic to complex and sophisticated:

The resulting flexibility of a single homogenous data representation system allows any reader to uniformly determine the structural semantics of a data element. Information can then be reused for new purposes and in novel contexts. For example, a record from a database of restaurants and a record from a client contact database might be reused in the context of an appointment, say in setting a lunch date with a client. The relationships between the restaurant and contact data do not reside in the schema data described by either database individually, but are extensions defined by the instance of the appointment.

This proposal, building on the earlier Web Collections in XML proposal, shows how to use a single syntax for a broad range of data, using that syntax for data and schemata, permitting the expressiveness of graph data when such power is required, but retaining the benefits of lexical trees.

2. Examples of XML-Data

Data

The following example shows a simple order from a bookstore for several books, a record, and a cup of coffee.

<ORDER>
  <SOLD-TO>
    <PERSON><LASTNAME>Layman</PERSON>
            <FIRSTNAME>Andrew</FIRSTNAME>
    </PERSON>
  </SOLD-TO>
  <SOLD-ON>19970317</SOLD-ON>
  <ITEM>
    <PRICE>5.95</PRICE>
    <BOOK>
      <TITLE>Number, the Language of
Science</TITLE>
      <AUTHOR>Dantzig, Tobias</AUTHOR>
    </BOOK>
  </ITEM>
  <ITEM>
    <PRICE>12.95</PRICE>
    <BOOK>
      <TITLE>Introduction to Objectivist
Epistemology</TITLE>
      <AUTHOR>Rand, Ayn</AUTHOR>
    </BOOK>
  </ITEM>
  <ITEM>
    <PRICE>12.95</PRICE>
    <RECORD>

<TITLE><COMPOSER>Tchaikovsky's</COMPOSER
> First Piano Concerto</TITLE>
      <ARTIST>>Janos</ARTIST>
    </RECORD>
  </ITEM>
  <ITEM>
    <PRICE>1.50</PRICE>
    <COFFEE>
      <SIZE>small</SIZE>
      <STYLE>cafe macchiato</STYLE>
    </COFFEE>
  </ITEM>
</ORDER>

XML-Data is flexible enough to encode heterogeneous structures, for example books, records and coffee all within one sales order. These different kinds of items do not need to all have the same internal parts. For example, books have titles, coffee generally doesn't. XML-Data allows values to be expressed as element content (for example the book titles shown) or with a value attribute (for example the author and artist elements). Properties of elements can be expressed as attributes (e.g. size and style of coffee) or as sub-elements (e.g. author, artist). XML-Data can appear in separate documents or within other documents (such as HTML pages).

Data about Other Data

XML-Data is suitable for complex, self-contained data structures such as the book order, and also for information such as the Channel Definition Format, which describes remotely-located web resources, many of which are themselves data:

<CHANNEL>
  <ITEM
HREF="http://www.zoosports.com/intro.htm"
level="2"
precache="NO">
    <A
HREF="http://www.zoosports.com/page1.htm">
This is a link to page 1.</A>
    <TITLE>Welcome to ZooSports!</TITLE>
    <ABSTRACT>ZooSports articles, news, and promotional
offers</ABSTRACT>
  </ITEM>
  <SCHEDULE ENDDATE="1994-11-05">
    <INTERVALTIME DAY="1"/>
    <EARLIESTTIME HOUR="12"/>
    <LATESTTIME HOUR="18"/>
  </SCHEDULE>
</CHANNEL>

PICS-NG Labels

XML-Data can express PICS-NG Labels:

(This uses the Layman-Bray proposal for namespaces.)

<xml>
  <xml:schema>
    <namespaceDcl
href="http://purl.org/Schemas"
name="purl"/>
    <namespaceDcl
href="http://www.foo.com"
name="foo"/>
  </xml:schema>
  <xml:data>
    <purl:description1
href="http://purl.color.org/document.html">
;
      <title>Light and Dark: A study of
color</title>
      <subject><LCSH>
          <for>Color and Color
Palettes</for></LCSH> </subject>
      <author> <foo:author>
                            <name>John
Smith</name>

<affiliation>thedarkside</affiliation>

<email>john@thedarkside</email></foo:aut
hor>
               <foo:author>
                            <name>Smith, Jane
Q.</name>

<affiliation>thelightregion</affiliation>

<email>jane@thelightregion</email></foo:
author></purl:description1>
  </xml:data>
</xml>

Digital Signatures, Security &Authentication

Returning to the bookstore example, this is the same order with a digital signature added. The structured nature of XML-Data makes it easy to sign whole elements or parts of them.

<ORDER>
  <dsig:DSIG>

<MANIFEST>>80183589575795589189518915</MANIFEST
>
    <SIG
href="http://XYX/Joe@company.com"/>
  </dsig:DSIG>
  <SOLD-TO>
    <PERSON><LASTNAME>>Layman</PERSO>
            <FIRSTNAME>>Andrew</FIRSTNAME>
    </PERSON>
  </SOLD-TO>
  <SOLD-ON>>19970317</SOL>
  <ITEM>
    <PRICE>5.95</PRICE>
    <BOOK>
      <TITLE>Number, the Language of
Science</TITLE>
      <AUTHOR>Dantzig, Tobias</AUTHOR>
    </BOOK>
  </ITEM>
  <ITEM>
    <PRICE>12.95</PRICE>
    <BOOK>
      <TITLE>Introduction to Objectivist
Epistemology</TITLE>
      <AUTHOR>Rand, Ayn</AUTHOR>
    </BOOK>
  </ITEM>
  <ITEM>
    <PRICE>12.95</PRICE>
    <RECORD>

<TITLE><COMPOSER>Tchaikovsky's</COMPOSER
> First Piano Concerto</TITLE>
      <ARTIST>>Janos</ARTIST>
    </RECORD>
  </ITEM>
  <ITEM>
    <PRICE>1.50</PRICE>
    <COFFEE>
      <SIZE>small</SIZE>
      <STYLE>cafe macchiato</STYLE>
    </COFFEE>
  </ITEM>
</ORDER>

Database Information

While XML-Data can represent complex structures, it can also represent simple ones, for example a simple list of database records:

<BOOK-MASTER-LIST>
  <BOOK id="book1">
    <TITLE>Number, the Language of
Science</TITLE>
    <AUTHOR>>Dantzig, Tobias</AUTHOR>
  </BOOK>

  <BOOK id="book2">
    <TITLE>Introduction to Objectivist
Epistemology</TITLE>
    <AUTHOR>>Rand, Ayn</AUTHOR>
  </BOOK>

  <BOOK id="book3">
    <TITLE>I, The Jury</TITLE>
    <AUTHOR>>Spillane, Mickey</AUTHOR>
  </BOOK>

  <BOOK id="book4">
    <TITLE>Half Magic</TITLE>
    <AUTHOR>>Eager, Edward</AUTHOR>
  </BOOK>

  <BOOK id="book5">
    <TITLE>QED</TITLE>
    <AUTHOR>>Feynmann, Richard P.</AUTHOR>
  </BOOK>
<BOOK-MASTER-LIST>

Graph Structures

An XML-Data element may include links to resources outside the immediate tree. When it meets application needs, this href facility can be used to break up a single structure into multiple parts, with relations among them indicated by Universal Resource Identifier (URI) links. The references can be local or remote. In this example, they are inventory records from the database table we just looked at.

<ORDER id="order1">
   <dsig:DSIG>

<MANIFEST>>80183589575795589189518915</MANIFEST
>
     <SIG
href="http://XYX/Joe@company.com"/>
   </dsig:DSIG>
   <SOLD-TO>

<PERSON><LASTNAME>>Layman</PERSO>
              <FIRSTNAME>>Andrew</FIRSTNAME>
      </PERSON>
    </SOLD-TO>
    <SOLD-ON>19970317<</SOLD-ON>
    <ITEM
href="http://bigbookstore.com/data/bookmaster?XML-XPTR=book
1">
      <PRICE>5.95</PRICE>
    </ITEM>
    <ITEM
href="http://bigbookstore.com/data/bookmaster?XML-XPTR=book
2">
      <PRICE>12.95</PRICE>
    </ITEM>
    <ITEM
href="http://bigbookstore.com/data/musicmaster?XML-XPTR=cd1
">
      <PRICE>12.95</PRICE>
    </ITEM>
    <ITEM>
      <PRICE>1.50</PRICE>
      <COFFEE>
        <SIZE>small</SIZE>
        <STYLE>cafe macchiato</STYLE>
      </COFFEE>
    </ITEM>
</ORDER>

Notice that each of the ITEM elements establishes a relationship between the ORDER and a BOOK, and that the relationship itself has attributes, in this case the price at which the book was sold. Relations can have attributes, can contain elements and the process can be carried to any needed level of detail.

Discontiguous Information (propertyOf)

Information about an element can be contained in the element, but also can sit outside it. For example, the following applies a digital signature to a sales order without actually modifying the order:

<dsig:DSIG>
  <xml:propertyOf
href="http://bigbookstore.com/data/orders?XML-XPTR=order1&q
uot;/>
  <MANIFEST
>80183589575795589189518915</MANIFEST>
  <SIG
href="http://XYX/Joe@company.com"/>
</dsig:DSIG>

Schema

Every data object, such as a purchase order, contains certain parts, such as sold-to, sold-on date, items, etc. We can write a formal description of what these parts are and which are allowed where. This is called a "schema" and is written using a form of XML-Data:

<xml:schema ID="BookOrderSchema">
  <!-- This schema is digitally signed. Schemas are a form of data,
       so they, too, can be signed. -->
  <dsig:DSIG>
    <MANIFEST
>*(&#&$&@*$&%*&@*$&$*@</M
ANIFEST>
    <SIG
href="http://XYX/Jane@company.com"/>
  </dsig:DSIG>

  <!-- Here are all the element types, their contents,
       attributes and relations. -->
  <elementType id="ORDER">
    <relation href="#SOLD-TO"/>
    <relation href="#SOLD-ON"/>
    <relation href="#ITEM"
occurs="STAR"/>
  </elementType>
  <relationType id="SOLD-TO">
    <elt href="#PERSON"/>
  </relationType>
  <relationType id="SOLD-ON">  
    <pcdata/>
    <!-- Date is YYYYMMDD -->
    <attribute name="lextype"
default="DATE.ISO8061"
presence="fixed"/>
  </relationType>
  <elementType id="PERSON">
    <relation href="#LASTNAME"/>
    <relation href="#FIRSTNAME"/>
  </elementType>
  <elementType id="LASTNAME">
    <pcdata/>
  </elementType>
  <elementType id="FIRSTNAME">
    <pcdata/>
  </elementType>
  <relationType id="PRICE">
    <pcdata/>
  </relationType>
  <relationType id="ITEM">
    <any/>
    <relation href="#PRICE"/>
    <range href="#BOOK"/>
    <range href="#RECORD"/>
    <range href="#COFFEE"/>
  </relationType>
  <elementType id="BOOK">
    <relation href="#TITLE"/>
    <relation href="#AUTHOR"/>
  </elementType>
  <elementType id="RECORD">
    <relation href="#TITLE"/>
    <relation href="#ARTIST"/>
  </elementType>
  <relationType id="SIZE">
    <pcdata/>
  </relationType>
  <relationType id="STYLE">
    <pcdata/>
  </relationType>
  <elementType id="COFFEE">
    <relation href="#SIZE"/>
    <relation href="#STYLE"/>
  </elementType>
  <elementType id="TITLE">
    <mixed><elt
href="#COMPOSER"/></mixed>
  </elementType>
  <relationType id="AUTHOR">
    <pcdata/>
  </relationType>
  <relationType id="ARTIST">
    <pcdata/>
  </relationType>
  <relationType id="COMPOSER">
    <pcdata/>
  </relationType>
</xml:schema>

Type Extension

Sometimes some elements are variants of others, in which case we can organize the element types into a genus-species hierarchy using the extends attribute:

<xml:schema ID="ArtSchema">
  <elementType id="artistic-work">
    <relation href="#TITLE"/>
  </elementType>
  <elementType id="BOOK"
extends="#artistic-work">
    <relation href="#AUTHOR"/>
  </elementType>
  <elementType id="RECORD"
extends="#artistic-work">
    <relation href="#ARTIST"/>
    <relation href="#COMPOSER"
occurs="OPTIONAL"/>
  </elementType>
  <relationType id="AUTHOR">
    <pcdata/>
  </relationType>
  <relationType id="COMPOSER"
extends="#AUTHOR"/>
  <relationType id="ARTIST">
    <pcdata/>
  </relationType>
</xml:schema>

Here we see that books and records are both types of artistic work, and that a composer is a type of author.

Schema Extension

We can use also use this ability to customize a schema that has useful features, but which is too general. In this example, we show a general schema for orders, then another one that is customized for our bookstore:

<xml:schema
ID="GenericOrderSchema">
  <elementType id="ORDER">
    <relation href="#SOLD-TO"/>
    <relation href="#SOLD-ON"/>
  </elementType>
  <relationType id="SOLD-TO">
    <elt href="#PERSON"/>
  </relationType>
  <elementType id="PERSON">
    <relation href="#LASTNAME"/>
    <relation href="#FIRSTNAME"/>
  </elementType>
  <relationType id="LASTNAME">
    <pcdata/>
  </relationType>
  <relationType id="FIRSTNAME">
    <pcdata/>
  </relationType>
</xml:schema>  


<xml:schema id="BookOrderSchema">
  <elementType id="ORDER"
extends="http://generic.com/genericOrder?XML-XPTR=ID(ORDER)
">
    <relation href="#ITEM"
occurs="STAR"/>
  </elementType>

  <relationType id="ITEM">
    <any/>
    <relation
href="http://generic.com/genericOrder?XML-XPTR=ID(ORDER)"/>
    <range
href="http://art.com/schemata?XML-XPTR=ID(BOOK)&qu
ot;/>
    <range
href="http://art.com/schemata?XML-XPTR=ID(RECORD)&
quot;/>
    <range href="#COFFEE"/>
  </relationType>

  <relationType id="SIZE">
    <pcdata/>
  </relationType>

  <relationType id="STYLE">
    <pcdata/>
  </relationType>

  <elementType id="COFFEE">
    <relation href="#SIZE"/>
    <relation href="#STYLE"/>
  </elementType>
</xml:schema>

3. XML-Data Schema

The XML-Data schema language defines element types, attributes, relations, and which of these can be used in which combinations with others. It also provides features for organizing element types into a genus-species hierarchy, a basic set of element types, and a small set of lexical types. The schema contains other features from XML Document Type Definition (DTD) language, such as entity and notation declarations. The XML-Data schema is powerful enough to express the same structural information and constraints as XML DTDs. It covers all the features of XML-DTDs. An XML DTD can be mechanically converted to an XML-Data schema.

Schemata are composed of principally of declarations for:

Comments can be interspersed as usual in XML, and there is provision for using references to external schemata or schema fragments.

3.1. The schema document element type: schema

All schema elements are contained within a schema element, like this:

<?XML version='1.0' rmd='all'?>
<!doctype schema SYSTEM
"http://www.w3c.org/pub/sotr/schema.dtd">
<xml:schema id='ExampleSchema'>
  <!-- schema goes here. -->
</xml:schema>

3.2. The element type declaration element type: elementType

Key terms used here: element, elementType, empty, any, mixed, pcdata, content model.

The heart of an XML-Data schema is the elementType declaration which defines a class of elements, gives them attributes, establishes a grammar of which other element types and character data are allowed in their contents and defines their allowable relationships to elements of other classes. (The allowable content, including relations, is called "content model.")

<elementType id="example">  <!-- element
example (p*) -->
    <elt href="#p" occurs="STAR"/>
</elementType>
<elementType id="p">       <!-- element p
((#PCDATA|p)*) -->
    <mixed><elt href="#p"/></mixed> 
</elementType>

The name attribute is optional if id is present, in which case the id is used as the name.

Within an elementType, elt indicates that instances are permitted to only have a single element type in their content. The occurs attribute of elt specifies whether this content is optional, and gives its cardinality.

Empty and any content are expressed using predefined elements empty and any. (Empty may be omitted. Any signals that any mixture of elements and parsed character data is legal.) Parsed character data content is similarly expressed with a pcdata item. Mixed content (a mixture of parsed character data and one or more element types), is identified by a mixed element, whose content identifies the element types allowed in addition to parsed character data (see below).

<elementType id="ARTIST">
  <pcdata/>
</elementType>

More complex content models are created using group:

<elementType id="animalFriends" >
  <group groupType="OR" occurs="STAR">
    <group groupType="OR" occurs="PLUS">
      <elt href="#cat"/>
      <elt href="#dog"/>
    </group>
    <elt href="#bird"/>
    <elt href="#rabbit"/>
    <elt href="#pig"/>
    <elt href="#fish"/>
  </group>
</elementType>

3.3 Relations

Key terms used here: relationType, relation, XML-Link locator, href.

Relation element types express a relationship between one element (usually the relation's parent) and either another element or an atomic value (such as a simple number, string or date). Relations use the XML-Link locator without implying navigation. The target of a relation is the element referenced by the href attribute if one is present, else the element contents. This single convention unifies graphs and trees.

Including a relation in an elementType makes it an implicit part of that element's content model, with the default for occurs being OPTIONAL. Relations must occur (in a valid document instance) after any other content. RelationsTypes are elements, and the full content model is as if there were a sequential group containing first the explicitly provided content model, then the relations in a starred or group with all the relations as content.

Two element types are used in the schema to effect a relation: The relationType is a specialized kind of elementType, while relation has the same function as elt ( but validates that it refers to a relationType).

If a default attribute is specified for a relation, it becomes the default of the value attribute of the relation elt. The range element, if present, declares a restriction on the valid target of a relation. Each range element references one elementType; any of which are valid.

 <relationType id="favoriteFood"
><mixed/></relationType>
 <relationType id="chases"
><any/></relationType>

 <elementType id="dog" >
   <any/>
   <attribute name="name"/>
   <relation href="favoriteFood"/>
   <relation href="chases"/>
 </elementType>

3.4 Attributes

Key terms used here: attribute, attribute, values, default.

After the content model, attribute declarations may occur, which are divided into attributes with enumerated or notation values, and all other kinds.

<elementType id="p1">       <!-- element
p1 ((#PCDATA|p1)*) -->
    <mixed><elt href="#p"/></mixed> 
    <attribute name='id' type='ID'/>  <!-- attlist p id
ID=#IMPLIED
                                                        exm (a|b|c) 'c'
                                                        x CDATA FIXED
'y' -->
    <attribute name='exm' type='ENUMERATION' values='a b
c'default='c'/>
    <attribute name='x' defType='FIXED' default='y'/>
</elementType>

An attribute may be given a default value. Whether it is required or optional is signaled by presence. (Presence ordinarily defaults to IMPLIED, but if omitted and there is an explicit default, presence is set to the SPECIFIED.)

Attributes with enumerated (and notation) values permit a values attribute, a space-separated list of legal values.. The values attribute is required when the type is ENUMERATION or NOTATION, else it is forbidden. In these cases, if a default is specified it must be one of the specified values.

Similar to the facility of multiple ATTLISTs, we sometimes need to have attributesDcls declared separately from the elementType they refer to. We can do this with the propertyOf element, discussed later.

3.5 The internal and external entity declaration element type: intEntityDcl and extEntityDcl

Key terms used here: entity, internal entity, external entity, notation.

This and the next two declarations cover entities in general. Entities are a powerful shorthand mechanism, similar to macros in a programming language.

<intEntityDcl name="LTG">
    <entityDef>Language Technology Group</entityDef>
</intEntityDcl>
<extEntityDcl name="dilbert">
    <notation href="#gif"/>
    <systemId
href="http://www.ltg.ed.ac.uk/~ht/dilb.gif"/>
</extEntityDcl>

Here as elsewhere, following XML, systemId must be a URL, absolute or relative, and publicId, if present, must be a Public Identifier as defined in ISO/IEC 9070:1991, Information technology -- SGML support facilities -- Registration procedures for public text owner identifiers.. If a notation is given, it must be declared (see below) and the entity will be treated as binary, i.e., not substituted directly in place of references.

<notationDcl name="gif">
    <systemId href='http://who.knows.where/'/>
</notationDcl>

3.6. The external declarations element type: extDcls

Key terms used here: external entity with declarations.

Although we allow an external entity with declarations to be included, we recommend a different declaration for schema modularization. The extDcls declaration gives a clean mechanism for importing (fragments of) other schemata. It replaces the common SGML idiom of declaring an external parameter entity and then immediately referring to it, and has the same import, namely, that the text referred to by the combination of systemId and publicId is included in the schema in place of the extDcls element, and that replacement text is then subject to the same validity constraints and interpretation as the rest of the schema.

3.7. Type Extension

Key terms used here: type (class), typeOf, extension (inheritance, subclassing), implements, extends, typeOf (genus).

Schema of all types can benefit from a subtyping mechanism: indicating that one class of object is a specialization of another more general class. For example, cat and dog both have the type pet as their more general category. To make more effective use of such classes, we introduce one new schema attribute, which can be used to declare explicitly that an element type is a subclass of another: extends:

<xml:schema>
  <elementType id="animalFriends" >
    <elt href="#pet" occurs="PLUS" />
  </elementType>

  <elementType id="pet" >
    <any/>
  </elementType>

  <elementType id="cat" extends="#pet"/>

  <elementType id="dog"  extends="#pet"/>

</xml:schema>

This schema says that the animalFriends element class can contain one or more elements from the pet class, such as a cat or a dog. Also, that each cat and dog instance is a pet (that is, any cat is semantically a pet, and any valid cat is also a valid pet). So the following data is now valid under this schema:

<animalFriends>
  <cat/>
  <dog/>
  <cat/>
</animalFriends>

Type Extension

It is frequently necessary to add new attributes to a subclass. This requires no extra machinery, because XML already permits multiple attribute list declarations, which cumulatively add attributes to element types. So each subclass may easily add any new attributes desired, as shown here:

<elementType id="dog"
extends="#pet"/>
  <attribute name="age"/>
</elementType>

If the super type has content model, (attributes, etc.) these are inherited, that is, they are also declared implicitly for the derived class. In the following example, we give an owner attribute to pet. This are inherited, so both cat and dog now also now have an owner attribute..

<xml:schema>
  <elementType id="animalFriends" >
    <elt href="#pet" occurs="PLUS" />
  </elementType>

  <elementType id="pet">
    <any/>
    <attribute id='name'/>
    <attribute id='owner'/>
  </elementType>

  <elementType id="cat" extends="#pet"/>
    <elt href='#kittens'/>
    <attribute id='lives' type='NMTOKEN'/>
  </elementType>

  <elementType id="dog" extends="#pet"/>
    <elt href='#puppies'/>
    <attribute id='breed'/>
  </elementType>
<xml:schema>

This schema says that the animalFriends element class can contain one or more pet elements. Because cat and dog are subtypes of pet, they can occur as well. So the following instance fragment is now valid under this schema:

<animalFriends>
  <cat name="Fluffy" lives='9'/>
  <pet name="Diego"/>
  <dog name="Gromit" owner='Wallace' breed='mutt'/>
</animalFriends>

Additional relations can also be added, but only be added if the content model of the superType consists of a single list of optional, repeatable element types.

When defining a derived element class, one can also override existing attributes and relations. The following example adds a Height relation and overrides the favoriteFood relation, giving it a default value of "Fish." (We also do something fancy here. Making this overridden element itself have its super type favoriteFood ensures that the derived element is in all other respects identical.)

<relationType id="height">
  <any/>
</relationType>

<relationType id="#favoriteCatFood"
extends="#favoriteFood"/>

<elementType id="cat" extends="#pet"/>
  <relation href="#height"/>
  <relation href="#favoriteCatFood"
default="Fish"/>
</elementType>

Schema Extension

We can also use subtyping to extend an existing schema without editing it. Suppose that we cannot edit the schema defining pet, cat or dog, but want to use elements with those names and semantics in our document. The following adds the "eyeColor" property to cat.

<relationType id="eyeColor"
extends="http://whereever.org/#eyeColor">
    <pcdata/>
</relationType>

<elementType id="cat"
extends="http://whereever.org/#cat"/>
  <relation href="#eyeColor"/>
</elementType>

The rules for allowable subtyping must enforce certain constraints, which are in principle that a subtype can have additional relations and attributes (provided this is consistent with the super type's content model, but never fewer) and can add restrictions (but never relax them). In practice, this principle leads to rules such as that default values can be added if there are none, changed, or converted to FIXED if DEFAULT.

Implements

Subtyping as we have described it here is actually a combination of two effects: First, we assert that an element of one type is also of another (as in a cat is a pet).

Second, we achieve economies and maintainability in the declarations to make sure that the first is true. That is, the derived element class is automatically provided with all the properties of the super type. Sometimes it is valuable to have the first effect without the second. (This is equivalent to the Java implements facility.) We indicate this by using the implements element, as in

<relationType id="favoriteFood" >
  <mixed/>
</relationType>

<relationType id="weight" >
  <mixed/>
</relationType>

<elementType id="cat" >
  <implements href="http://whereever.org/#pet" />
  <attribute name="name"/>
  <relation href="#favoriteFood" />
  <relation href="#weight" />
</elementType>

This has no effect on the attributes or relations of instances of cat, but asserts in the schema that every cat is also a pet (that is, any cat is semantically a pet, and any valid cat is also a valid pet).

Relation of Type Extension to Parameter Entities

Sophisticated DTDs often make complex use of parameter entities in an attempt to consolidate common structures in one, reusable place. Such parameter entities often represent implicit classes.

The need is real, but the approach often leads to obscurity, and reduced maintainability. Further, expansion of entities loses all connection with their source: once expanded, the fact that some set of element types was a co-declared set, re-used in multiple places, is lost.

3.8 Lexical Data Types

Information such as dates and numbers is often expressed in a format that requires some further parsing. For example, the same date can be written "October 22, 1954" or "19541022". (And from what I've seen, about 300 other ways.) The lextype attribute discriminates formats. Appearing on instance elements, it describes the format of the remainder of the element. The value of the lextype attribute is always by reference to a URI identifying the parsing rules. XML-Data should define a small number of these. We propose NUMBER, INTEGER, REAL and DATE.ISO8061.

<birthday
lextype="DATE.ISO8061">19541022</birthday>

These are declared in the schema as follows:

<relationType id="birthday">
  <attribute name="lextype"
default="DATE.ISO8061"
presence="fixed"/>
</relationType>

When giving the lexical type of an attribute in the schema, lextypeIs is used, as in:

<attribute name="price"
presence="REQUIRED"
lextypeIs="number"/>

Some patterns will indicate that several properties or attributes should be used in combination to arrive at a value. For example, a custom pattern could indicate a date expressed as the following:

<relationType id="birthday">
  <attribute name="lextype"
default="DATE.ATTR-YMD"
presence="specified"/>
</relationType>
...
<birthday year="1954"
month="10"
day="22" >

3.9. Basic Semantic Data Types

We need to define here a small number of basic types and their hierarchy, corresponding to simple data types such as Number and Date. (Dates are a subtype of numbers.)

We also need to define the expression of each of the basic Java and SQL data types in terms of these basic ones, plus additional properties giving units, precision, min, max, default pattern, and other properties. For example, an INTEGER typically is a number a certain min and max property values. Note that units should be an element type with possible structure, so that things like "miles/hours" or "feet/(sec*sec)" can be represented and used for automatic conversions.

4. Standard Vocabulary

We expect standard libraries of vocabulary to be developed to capture common semantic used in vertical applications and particularly in industry and application domains. Dublin Core and CDF are two examples of such standard libraries.

5. Relations to other proposed standards

The W3C site at http://www.w3.org/PICS/Member/NG contains links to several related papers, including Ora Lassila's PICS-NG document, Renato Ianella's small PICS extension proposal, CDF, MCF in XML, the Web Collections using XML proposal. Specific notes on some of these follow:

5.1 XML-LINK

All relations use href in a manner consistent with XML-LINK working draft dated April 6, 1997 (the most recent as of the time of this writing). XML-Links are a type of relation (with extra attributes, elements, and semantics indicating traversal).

5.2 PICS-NG

PICS-NG Metadata Model and Label Syntax describes a set of requirements for structured data to be used on the Internet. XML-Data is an application of XML concepts to those requirements.

5.3 CDF

The Channel Definition Format (CDF) is a natural application of XML-Data and is fully compatible with the syntax and the ideas presented in this document. Its format is a validatable grammar given a proper schema. The existing use of href in CDF is consistent with XML-LINK and XML-Data usage. CDF defines a number of basic element types that would be appropriate for a standard library.

5.4 MCF in XML

MCF in XML has two principal components: The ability to represent a "directed labeled graph" and also a set of predefined element types. The first of these is effected by a convention on use of the href attribute (the same convention used in XML-Data relations, with the same effect). Of the second, some element types are genuinely necessary to represent schemata and a type system (these are also present in XML-Data) while others would be appropriate for a standard library.

XML-Data has a number of features not in MCF:

This chart tabulates the MCF "bootstrap" element types and describes their equivalence in XML-Data

Category
"elementType" in XML-Data.
typeOf
"typeOf" relation in XML-Data. Also,"extends" and "implements" in XML-Data assert the relationship in the schema.
Unit
"href" in XML-Data.
domain
"propertyOf" in XML-Data.
range
"range" in XML-Data. This gives the allowed type of the target of a property.
superType
This may correspond to "implements" XML Data. However the MCF document is not clear on this point.
Property
This corresponds to the abstract concept of a link class expressed in schemata by relation and relationType..
FunctionalProperty
This appears to be a relation with occurs = OPTIONAL or REQUIRED (that is, occurs at most once).
mutuallyDisjoint
This is a relationship asserted among the members of an enumeration. XML-Data does not contain a predefined propertyType for this. It could be added easily if this is useful.
parent
A generic property, whose meaning appears to be contextual. XML-Data does not contain a predefined elementType for this. It is unneeded because parentage is expressed by containment, while when out-of-line, specific meanings are conveyed by more precise relationship types such as propertyOf.
name
"name" in XML-Data. However, note that like parent, the interpretation of name in MCF seems to be contextual.
description
XML-Data does not contain a predefined elementType for this. We think that this belongs to a standard library and not in this specification.
Sequence
This is a special arc type in MCF that expresses the same fact as lexical order in XML.
ord
This is a MCF helper element type for Sequence.

Comparative examples of XML-Data and MCF in XML representation of an order for several books. (All persons in this example are assumed to be not in the document, but elsewhere.) The id attribute is on all elements representing real-world objects, in both models. In the MCF model id also appears on elements needed artificially for reference.

MCF in XML XML-Data

<ORDER id="order1">
  <SOLD-TO
unit="http:/people#person1"/>
  <SOLD-ON value="19970317"/>
  <ITEMS unit="sequence1"/>
</ORDER>

<BOOK id="book1">
  <TITLE value="Number, the Language of
Science"/>
  <AUTHOR unit="http:/people#person2"/>
</BOOK>

<SEQUENCE id="sequence1">
  <ORD UNIT="book1">
    <PRICE value="5.95"/>
  </ORD>
  <ORD UNIT="cd1">
    <PRICE value="12.95"/>
  </ORD>
  <ORD UNIT="book2">
    <PRICE value="6.95"/>
  </ORD>
  <ORD UNIT="food1">
    <PRICE value="1.50"/>
  </ORD>
</SEQUENCE>

<COFFEE id="food1">
  <size value="small"/>
  <style value="cafe macchiato"/>
</RECORD>

<RECORD id="cd1">
  <TITLE value="Rachmaninoff's Second Piano
Concerto"/>
  <ARTIST unit="http:/people#person3"/>
</RECORD>

<BOOK id="book2">
  <TITLE value="The Evolution of
Complexity"/>
  <AUTHOR unit="http:/people#person4"/>
</BOOK>
<ORDER id="order1">
  <SOLD-TO
href="http:/people#person1"/>
  <SOLD-ON value="9970317"/>
  <ITEM>
    <PRICE>5.95</PRICE>
    <BOOK id="book1">
      <TITLE >Number, the Language of
Science</TITLE>
      <AUTHOR
href="http:/people#person2"/>
    </BOOK>
  </ITEM>
  <ITEM>
    <PRICE>12.95</PRICE>
    <RECORD id="cd1">
    <TITLE >Rachmaninoff's Second Piano
Concerto</TITLE>
      <ARTIST
href="http:/people#person3"/>
    </RECORD>
  </ITEM>
  <ITEM>
    <PRICE>6.95</PRICE>
    <BOOK id="book2">
      <TITLE >The Evolution of
Complexity</TITLE>
      <AUTHOR
unit="http:/people#person4"/>
    </BOOK>  
  </ITEM>
  <ITEM>
    <PRICE>1.50</PRICE>
    <COFFEE>
      <SIZE>small</SIZE>
      <STYLE>cafe macchiato</STYLE>
    </COFFEE>
  </ITEM>
</ORDER>

 

6. Conclusion

Future applications of the Internet will focus on adding user value to information through semantic annotation. Semantics will permit information to be discovered, targeted, reused, and integrated. Not only does this make the content more usable, but it opens up opportunities for software developers to build components that exploit these semantics. Such components could include applications as prosaic as application or user logging, or as futuristic as user agents that assist in finding or organizing contents, World-Wide Web "surf buddies" that accompany a user's browsing and adding valuable or entertaining comments, or natural language query systems. Semantic annotation turns the Internet into a platform for programming powerful and valuable applications.

This proposal lays the foundation for how applications can annotate their information content. The proposal adds powerful new constructs for representing semantics, sufficiently advanced for use in artificial intelligence and natural language systems, yet retains the architecture and investment of existing XML and the efficiency of its representation.


Appendix A - The XML DTD for a schema


<!ENTITY % nodeattrs 'id ID #IMPLIED'  >
<!-- href is as per XML-LINK, but is not required unless there is
      no content -->

<!ENTITY % exattrs   'extends CDATA #IMPLIED'  >

<!ENTITY % linkattrs 'id ID #IMPLIED
                      href CDATA #IMPLIED' >

<!-- The shared content model of elementType, linkType and
relationType -->
<!-- Omitted element type same as "empty." -->
<!ENTITY % extendedmodel 'implements*,
                          (elt|group|empty|any|pcdata|mixed)?,
                          (relation|attribute)*'>

<!-- The top-level container -->
<!element schema         ((elementType|propertyOf|linkType|
                          relationType|extendType|augmentElementType|
                          intEntityDcl|extEntityDcl|
                          notationDcl|extDcls|c)*)>
<!attlist schema %nodeattrs;>

<!-- Element Type Declarations -->
<!element elementType   (%extendedmodel)>
<!-- Either name or id must be present - - absent name defaults to id
-->
<!attlist elementType %nodeattrs;
                      %exattrs;
                name    CDATA      #IMPLIED>

<!-- Element types allowed in content model -->
<!-- Note this is just short for a model group with only one elt in
it -->
<!element elt           EMPTY>
<!-- Elements can have exponents as well as groups -->
<!-- The href is required -->
<!attlist elt   %linkattrs;
                occurs     (required|optional|star|plus) 'required'>

<!-- A group in a content model, sequential or disjunctive -->
<!element group         ((group|elt)+)>
<!attlist group         %nodeattrs;
                groupType (seq|or) 'seq'
                occurs  (required|optional|plus) 'required'>

<!element any           EMPTY>
<!element empty         EMPTY>
<!element pcdata	EMPTY>

<!-- mixed content is just a flat, non-empty list of elts -->
<!-- We don't need to say anything about #pcdata, it's implied -->
<!element mixed         (elt+)>
<!attlist mixed         %nodeattrs;> 

<!-- Attributes -->
<!-- default value must be present iff presence is specified or fixed
-->
<!-- presence defaults to specified if default is present, else
implied -->
<!-- name attribute is locally unique, defaults to id if absent
-->
<!element attribute  empty>
<!attlist attribute  %linkattrs;
                name    CDATA #IMPLIED
                type
(id|idref|idrefs|entity|entities|nmtoken|nmtokens|
                         enumeration|notation|cdata) 'cdata'
                default CDATA #IMPLIED
                values NMTOKENS #IMPLIED
                presence (implied|specified|required|fixed) #IMPLIED 
                lextypeIs CDATA #IMPLIED>

<!-- Relations - - relationTypes are pointed to from relations,
            just as elementTypes are pointed to from elts -->
<!element relationType  (%extendedmodel;,
                         range*)>
<!attlist relationType  %nodeattrs;
                        %exattrs;
                        name CDATA #IMPLIED >

<!element range empty >
<!attlist range %linkattrs; >

<!element relation  EMPTY>
<!attlist relation  %linkattrs;
                    default CDATA #IMPLIED
                    occurs (required|optional|star|plus) 'optional'>

<!-- For adding attributes to existing element types -->
<!element propertyOf    EMPTY>
<!attlist propertyOf    href CDATA #REQUIRED>

<!element augmentElementType
((relation|attribute)*)>
<!attlist augmentElementType %linkattrs;
                             %exattrs;>

<!-- Shorthand for simple XML-LINKs -->
<!element linkType (%extendedmodel;)>
<!attlist linkType %nodeattrs;
                   %exattrs;
                   name CDATA #IMPLIED
                   role CDATA #IMPLIED
                   title CDATA #IMPLIED
                   show (embed|replace|new) #IMPLIED
                   actuate (auto|user) #IMPLIED
                   behaviour CDATA #IMPLIED >

<!element implements EMPTY>
<!attlist implements href CDATA #REQUIRED>

<!-- Entity Declarations -->
<!-- Note as this is written only external entities
      can have structure without escaping it -->
<!-- Name defaults to id if absent -->
<!element intEntityDcl     (#PCDATA)>
<!attlist intEntityDcl %nodeattrs;
                name    CDATA #IMPLIED>

<!-- The entity will be treated as binary if a notation is present
-->
<!-- systemID and publicId (if present) must have the required syntax
-->
<!element extEntityDcl    ( systemId, publicId?)>
<!attlist extEntityDcl %nodeattrs;
                name    CDATA #IMPLIED
		notation CDATA #IMPLIED>

<!-- Pointers for above -->
<!element systemID      EMPTY>
<!attlist systemID      %linkattrs;>
<!-- Must be empty if href is used -->
<!element publicID      (#PCDATA) >
<!attlist publicID      %linkattrs;>

<!-- Notation Declarations -->
<!-- systemID and publicId (if present) must have the required syntax
-->
<!element notationDcl        (systemId, publicId?)>
<!attlist notationDcl   %linkattrs;
                name    CDATA #IMPLIED>

<!-- External entity with declarations to be included -->
<!-- systemID and publicId (if present) must have the required syntax
-->
<!element extDcls       empty>
<!attlist extDcls
                systemId CDATA #REQUIRED
                publicId CDATA #IMPLIED>

<!-- Namespace Declarations -->
<!-- systemID and publicId (if present) must have the required syntax
-->
<!element namespaceDcl  EMPTY>
<!attlist namespaceDcl  %linkattrs;
                name    CDATA #IMPLIED>