Figure 1 –Course, Student, Teacher and Address Example

For example, consider a database or other graph (for example described by the UML diagram of Figure 1 or another notation) that contains the following data:

Courses, Students, Teachers and Addresses
Courses, Students, and Teachers each have a 'name' property, whose datatype is ‘String.’
Students have an 'attends' relation to 0 or more Courses; Courses have the inverse relation, 'attendedBy' to 1 or more Students.
Teachers have a 'teaches' relation to 1 or more Courses; Courses have the inverse relation, 'taughtBy' to exactly one Teacher.
Students have a 'parentaddr' and a 'dormaddr' relation to 0 or 1 Address (for purposes of illustration we make the unrealistic assumption that the parent addresses are not shared).
An Address consists of street, city and state properties that all are of type ‘String.’

A serialized instance produced according to the canonical form would look like (see Appendix A for the DTD and the XML schema):


  <School>
      <Address id="Address-3" street="28 Campus Drive #103" city="Unicity" state="CA"/>
      <Course id="Course-19" name="Western Civilization" taughtBy="Teacher-83"/>
      <Course id="Course-253" name="English Literature" taughtBy="Teacher-83"/> 
      <Student id="Student-30006" name="Raphael" parentaddr="Address-1" 
                dormaddr="Address-3" attends="Course-19">
          <Address id="Address-1" street="950 Greenhill Rd" city="Mill Valley" state="CA"/>
      </Student>
      <Student id="Student-2567" name="Michael" dormaddr="Address-3" 
                attends="Course-19 Course-253" />
      <Student id="Student-31415" name="Sandra" parentaddr="Address:4" attends="Course-253">
          <Address id="Address:4" street="14 16 Street" city="San Raphael" state="CA"/>
      </Student>
      <Teacher id="Teacher-83" name="Thorsten"/>
  </School>

In the example, the student with the name Raphael has both a parent and a dorm address and attends one course. The student with name Michael does not have a parent address and shares the dorm address with Raphael (i.e., they are roommates), whereas the Student with name Sandra only has a parent address. Michael attends two courses, Sandra attends one.

If we compare the rules and the canonical form they produce with our stated design goals, we can say that we have achieved our goals. The typical syntax generated is easily readable by humans. The rules certainly conform to XML 1.0 and are easy to teach. By using ID/IDREF and URIs, the serialization is able to express graphs and directed relationships (see below). The query model is simple, and it is certainly possible to embed data in the canonical form within web pages. As we will see in sections 3 and 4 below, it is easy to define rules on how to map other syntax families to and from the canonical form. The only problem that may arise is in the context of very long text properties, where a different encoding might have been a bit more readable. However, any encoding that maps properties to content model has the problems of added complexity (relations still need to be mapped to attributes, because elements cannot express references) and of XML 1.0 conformity (local property names become global).

2.3 Serializing Relationships

As rule 3 already states, relationships between entities are represented using ID/IDREF(S) attributes. In the following, we look a bit closer at each of the binary relationships normally found in data models and show with some examples how they are serialized.

1:1 Relationship

A 1:1 relationship is expressed by an attribute of type IDREF. For example, if a student attends a course, the canonical XML looks like

If we want to be able to query the relationship in both directions, we will add explicit attributes in both directions:

1:Many Relationship

A 1:many relationship is expressed by an attribute of type IDREFS for the 1:many direction or an attribute of type IDREF for the many:1 direction. For example, if a teacher teaches two courses and courses are taught by only one teacher, the canonical solution in XML is

This works even if the taught events are not all courses, but some heterogeneous mixture of types, as in:

Many:Many Relationship

Many:many relationships that have no additional properties attached to them are expressed by attributes of type IDREFS. If additional properties are attached, the relationship will be serialized as an explicit join element, providing two IDREF attributes for each entity involved and attributes for the attached properties.

If there are no properties (e.g. grades) that are attached to the student-course relationships attends and attendees, we just show student and course elements as

However, if there are properties of the many:many relation, we need to show these in the serialization, as in:

Note that there is a pretty smooth evolution from a single-valued IDREF attribute all the way through a many-to-many relation. We never actually reify a relation (e.g. "enrollment") unless the relation has properties.

Hierarchical Owner Relationships

Hierarchical (1:many) relationships that express ownership can take advantage of the nesting available in XML by making the owned entity a sub-entity of the owner entity. For example, the relationship that a school owns several classrooms can be serialized as

</School>

Note that such owned entities are normally also known as "weak entities".

Inverse Relationships

Relationships between entities are bi-directional. For example, a course is taughtBy a person; that person teaches the course: ‘teaches’ and ‘taughtBy’ are inverses. Their inverse relationship can be declared in the schema. For maximum convenience when querying, we would like to write instance documents that contain both directions; that means writing each relation once on each element, for example

But often, database developers and programming language programmer take a shortcut and provide only one side of the relationship, knowing that the other can be inferred (hopefully based on schema). A database developer will typically write the many-to-one direction, since that is where the foreign key is found, and it minimizes scanning rows twice.

Programmers will also often write only one direction, but unfortunately this may be the direction opposite to that chosen by databases. They will often write the one-to-many direction, since this corresponds to an array of references.

Thus when both directions are materialized in the canonical serialization, the resulting XML can be easily transferred from one world to the other, provided that the inverse relationship of the two attributes is preserved in the schema and the schema of the canonical representation is passed with the data.

2.4 External References, Arrays and Order

Entities may have relations to entities not in the serialized graph using the same general mechanism, but where the attribute's datatype is URI (Universal Resource Identifier) or URIS. By that, we mean that the attribute is not an IDREF, but rather will be interpreted by a processing application as a URI. We anticipate that future schema mechanisms will provide the means to declare such attributes in a standard way. For example, webPage in the following element would be referencing another entity via its URI:

<Student id="Student-31415" name="Linda" webPage="http://www.lindamann.com"/>

As mentioned earlier, if the same relation type relates several entities, they are expressed as a single attribute with datatype IDREFS. The order in which the ids are listed is presumed significant, and expresses the ordering (if any) of the collection of related entities (e.g. chapters in a book). When significant, it is fundamentally an aspect of the relations between the elements (e.g. between the chapters, such that chapter 1 precedes chapter two, and so on).

This does not preclude application domains designing vocabulary for collections with more specialized semantics, for example Arrays, Sets, Bags, etc. In these cases, the semantics would be indicated by explicit collection elements, or by information in the schema for the relation attribute, as appropriate. For example, if the courses that the Student with id "Student-2567" attends are grouped in an array, then the canonical form can express this explicit array as follows:

  <Student id="Student-2567" name="Michael" home="Address:3" attends="Array-1"/>
  <dt:Array id="Array-1">
    <dt:Idref>Course-19</dt:Idref>
    <dt:Idref>Course-253</dt:Idref>
  </dt:Array>

Here the datatype element "Array" is referenced via an IDREF and the array elements are available as sub datatype elements "Idref".

Similarly, while these rules permit the serialization of any graph, they neither require nor preclude elements or attributes with specific semantics, including elements or attributes designed to layer-on additional graph facilities such as reference, attribution or subsumption. Appropriate vocabularies and namespaces can effect all of these facilities.

3. Mappings to the Canonical Form

The following sections give some cookbook recipes for converting some commonly encountered data descriptions and formats into the canonical form and back.

3.1 UML to XML Schema Conversion

Given an arbitrary UML diagram, we can mechanically produce a canonical grammar.

Objects are expressed as elements. They always have id attributes.
Properties are expressed as attributes.
Relations are expressed as attributes. The value of the attribute is an IDREF (or space-separated list of IDREFS) to the related element. (Relations to objects potentially not in the serialized instance are of datatype URI or URIS.)
The top-level element is the name of the package or message.
The top-level element has a content model that allows any other element type in any order.
If an object can only be referenced once, and its existence is dependent on the existence of another element, it may also appear in the content model of the referencing element (which continues to have an attribute making the relation explicit).
Regarding ordering, content models use a group that allows sub-elements to appear in any order.
Associations have two roles (each role is the 'inverse' of the other). While the graph could be serialized by representing only one role, processing by the reader is eased if both roles are explicitly represented. This is a trade-off between expense for the writer versus expense for the reader. Either form is acceptable as 'canonical model.' If one must represent only one side, we recommend the following process for choosing sides:

1. If only one role is named, use that, else
2. Pick the role with the smallest maximum cardinality, else
3. Pick the role with the largest minimum cardinality, else
4. Pick the role with the shortest name, else
5. Pick the role whose name appears first in the alphabet.

3.2 Graph to XML Instance Conversion

Emit the top-level element tag corresponding to the package or message. Within this,
Walk the graph using any of the well-known techniques. For each node,

1. Emit an element corresponding to the node, with the generic identifier indicating the node's type and with a unique id attribute/value. Emit attributes corresponding to each property and relation, where the value of a relation is expressed as an IDREF if the object of the relation is in the graph, else as a full URI. If order of related nodes is significant, emit relations in that order.
2. Optionally, if the object of a relation is known to be only potentially referred to by a single node, emit that object node as a child element, following these rules recursively. Else defer to later in the graph walk.
3. When a graph edge has an inverse edge (e.g. 'teaches' and 'taughtBy') emit both.

3.3 XML Instance to Graph Conversion

For each element in the document, create a node identified by the id attribute of the element, and with a node type given by the type of the element.
For each relation (IDREF or IDREFS) attribute of each element create an edge whose role name is identified by the attribute's name and whose value is the node identified by the attribute's value.
For each property attribute of each element create a property whose role name is identified by the attribute's name and whose value is the attribute's value. The type of the value is identified by the datatype of the attribute.

3.4 Converting a set of Database Tables to XML Instance

Emit the top-level element tag corresponding to the package or message. Within this,
For each row in each table (except for tables that express many:many relationships without additional properties):
1. Emit an element corresponding to the row. Unless the database provides globally unique identifiers (guids), the ID attribute has a value formed by concatenating the element type name with the value of the primary key column, separating with a dash (minus sign). Multi-column keys should be concatenated into a single key value, using a separator that allows the parts to be separated again. If guids are used to identify the relational data, these guids can directly be used in an ID attribute. Emit attributes corresponding to each property, formatting as necessary according to the column's datatype. Emit attributes corresponding to each relation, where the value of a relation is expressed as an IDREF if the object of the relation type is always in the instance, else as a URI. If order of related rows is significant, emit relations in that order.
2. Optionally, if the object of a relation is known to be only potentially referred to by this single row (e.g., based on an "on delete cascade" foreign key constraint), emit that object as a child element, following these rules recursively. Else defer to later in the output.
3. Emit inverse relations according to the rules given under 'UML to XML Schema Conversion," above.
For each row in each table that expresses a many:many relationship without additional properties: Add IDREFS attributes to each element instance involved in the relationship.

3.5 XML Instance to set of Database Tables Conversion

For each element type, create a table with columns corresponding to the attributes of the element type.
For each element in the document, create a row in the table corresponding to the element's name, with the row identified by its id value.
For each property attribute of each element, set the corresponding column to have a value equal to the attribute's value, decoding it if necessary according to the datatype.
For each relation (IDREF or IDREFS) attribute of each element, set the corresponding column to have a (foreign key) value equal to the primary key attribute of the referenced element. Multi-column keys are more complicated. If the element contains the attributes corresponding to the foreign key columns, the row will reference the right other row. But if it does not, extra information will be needed to know which primary key column corresponds to which foreign key column. If we have information about the inverse relationship among two IDREFS attributes, we generate a table for this many:many relationship and issue a row for every valid combination of the two IDREFS attributes.

4. Mapping other XML Syntax Families

A fully explicit, canonical syntax makes it easy to convert from syntax to a graph of objects. Provided one has a schema telling which attributes are IDREFs, one merely interprets all attributes as either properties or relations via IDREF. However, the canonical syntax is not the only syntax that could be used to serialize a graph. In many cases, alternative syntaxes may be used, due to historical or political factors, or to take advantage of compressions that are available if one has domain knowledge. We call all of these "abbreviated syntaxes." These are not canonical syntax, though they may be mapped to it. For example, we might find an instance such as this:

<Course>
  <name>Western Civilization</name>
  <taughtBy>Thorsten</taughtBy>
  <attendedBy>Raphael</attendedBy>
  <attendedBy>Smith</attendedBy>
</Course>

Here, the course's name was expressed by a sub-element, and teachers and students were identified only by their name.

We need a means to convert such abbreviated syntax to a fully explicit (canonical) syntax. There are two basic approaches possible. One is to have some declarative information in the schema that restores the missing elements. The other is to use a transform language such as XSL to convert the abbreviated to an explicit syntax.

The declarative approach is initially simpler. Each abbreviated syntactic schema declares its relation to a canonical schema and provides appropriate declarative mappings. The drawback to this is that it requires additions to the schema vocabulary, and can only handle a limited number of simple cases. In the real world, judging by the experience with Architectural Forms and our own attemps to design general mapping declarations, especially given the deployment of systems that evolve over several years, declarative mappings eventually either fail or become very complex.

If we take the transform language approach, then each abbreviated syntactic schema declares its relation to a canonical schema and provides appropriate transforms to and from the canonical form.

We right now favor a composite approach. For a small number of very common and simple cases we can annotate schemas with declarative mapping information in the form of attributes of the element types. The exact details of what constitutes "common and simple" should be determined, but candidates appear to be (a) simple renaming of elements or attributes, (b) conversion of a sub-element to an attribute, (c) inference of a relation based on element containment, (d) reference by a "foreign key" converted to reference by IDREF or URI. For more complex cases we should look to a transform language such as XSL.

Finally, one might reasonably ask why we have a canonical syntax at all. Why not provide mappings directly to the graph's schema? But if we ask that, we need to also ask what those mappings would look like. In effect, they would map elements and attributes to objects and properties, much as XSL maps things today, but using new keywords to signal the difference in result types. Having done all that – introducing a new vocabulary for syntax to graph mapping – we would not have any greater functionality than provided by the canonical syntax approach, but we would have doubled the vocabulary needed. Further, we would require that all clients of XML implement mapping machinery (while with the canonical syntax approach a server could choose to emit canonical syntax, thereby avoiding any need for a special mapper). We would not be able to leverage future developments in XSL. Finally, we would not be providing any clear suggestions for syntax that people should use, and would therefore greatly increase the actual amount of mapping that would need to occur.

5. Conclusion

The proposed canonical form for serializing graph-structured data gives XML a powerful tool to transfer complex-structured data between different systems. These systems may have heterogeneous mechanisms for implementing data structures, yet the canonical form faithfully preserves the data’s graph structure during the transmission. The proposed format achieves the preservation solely by using techniques conforming to XML 1.0 and produces XML fragments that are both highly readable and easy to query. The rules are simple and can be taught easily.

This model provides a way to encode any graph of instances as XML unambiguously. It is important to note that the use of this goes considerably beyond allowing XML messages to include graphs of data. It allows us to model existing sets of data such as relational databases or applications with complex graphs of objects as though they are XML. In other words, datastores and network of programmatic objects can be treated as though they are XML documents. This in turn will allow standard ways for databases and applications to interoperate without requiring that each party of the interaction have custom knowledge of the underlying implementation. So long as the transmitted data is mapped to an XML canonical format, the message can be decoded by any receiver.

Viewing datastores as XML structures is an important advance in data integration. It means that XML facilities such as XPointers and any upcoming XML query language can – in principle – ask for information from any store. Thereby, they can provide a consistent model across facilities ranging from documents to databases to application servers. One can imagine the day in which one query language is used across the entire net to access information. No new infrastructure is required in XML to enable this. As we have shown, the existing capabilities inherent in XML ranging from the nested scoping of attributes against elements to the ID/IDREF(s) capability are sufficient to the job. All that is missing is a richer schema model that enables us to know about data types, inverses and, potentially, constraints on the element types to which references refer. This is exciting, because it means that we are already well along the road to what had once been a futuristic possibility: a world in which any consumer of data has the tools to talk to any data producer, wherever located, and communication based on the meaning of the data, not the accidents of its representation.

Appendix A: DTD and Schema for Courses, Students, Teachers Example

A.1 DTD

This sample schema uses DTD notation to describe a vocabulary and syntax for serializing the example of Courses, Students and Teachers given in Figure 1.

    <!ELEMENT Address>
    <!ATTLIST Address
              id ID #IMPLIED
              street CDATA #IMPLIED
              city CDATA #IMPLIED
              state CDATA #IMPLIED >

    <!ELEMENT Course>
    <!ATTLIST Course
              id ID #IMPLIED
              name CDATA #IMPLIED
              taughtBy IDREF #IMPLIED >

    <!ELEMENT Student (Address)? >
    <!ATTLIST Student
              id ID #IMPLIED
              name CDATA #IMPLIED
      attends IDREFS #IMPLIED

      parentaddr IDREF #IMPLIED >
              dormaddr IDREF #IMPLIED >

    <!ELEMENT Teacher>
    <!ATTLIST Teacher
              id ID #IMPLIED
              name CDATA #IMPLIED >

    <!ELEMENT School (Student|Course|Teacher|Address)* >
    <!ATTLIST School
              id ID #IMPLIED
              students IDREFS #IMPLIED 
              courses IDREFS #IMPLIED 
              teachers IDREFS #IMPLIED 
              addresses IDREFS #IMPLIED >

A.2 XML-Data Schema

This sample schema uses XML-Data notation to describe a vocabulary and syntax for serializing the example of Courses, Students and Teachers given in Figure 1.

<?xml version="1.0" encoding="windows-1252" ?>
<!-- Schema for package CoursesStudentsTeachers  -->
<Schema   xmlns="urn:schemas-microsoft-com:xml-data"
            xmlns:dt="urn:schemas-microsoft-com:datatypes"
            xmlns:x="urn:schemas-microsoft-com:xml-data-ex">
  
    <!-- *****  TYPE Address ***** -->
 
    <ElementType name="Address">
 
        <AttributeType name="id" dt:type="id"/>
            <attribute type="id" />
        <AttributeType name="street" dt:type="string"/>
            <attribute type="street" />         
        <AttributeType name="city" dt:type="string"/>
            <attribute type="city" />         
        <AttributeType name="state" dt:type="string"/>
            <attribute type="state" />         

    </ElementType>
  
    <!-- *****  TYPE Course ***** -->
 
    <ElementType name="Course">
 
        <AttributeType name="id" dt:type="id"/>
            <attribute type="id" />
  
        <AttributeType name="name" dt:type="string"/>
            <attribute type="name" required="yes" />         
        <AttributeType name="taughtBy" dt:type="idref" />
            <attribute type="taughtBy" required='yes' x:range="Teacher"/>
 
    </ElementType>
 
    <!-- *****  TYPE Student ***** -->
 
    <ElementType name="Student">
 
        <AttributeType name="id" dt:type="id"/>
            <attribute type="id" />
         
        <AttributeType name="attends" dt:type="idrefs" />
            <attribute type="attends" x:range="Course"/>         
        <AttributeType name="name" dt:type="string"/>
            <attribute type="name" required='yes' />
         <AttributeType name="parentaddr" dt:type="idref" />
            <attribute type="parentaddr" x:range="Address"/> 
         <AttributeType name="dormaddr" dt:type="idref" />
            <attribute type="dormaddr" x:range="Address"/> 

        <group seq='many'>
            <element  type="Address" minOccurs="0" maxOccurs="1" />
        </group>
 
    </ElementType>
 
    <!-- *****  TYPE Teacher ***** -->
 
    <ElementType name="Teacher">
 
        <AttributeType name="id" dt:type="id"/>
            <attribute type="id" />
 
        <AttributeType name="name" dt:type="string"/>
            <attribute type="name" required='yes' /> 
 
    </ElementType>
  
  <!-- The PACKAGE -->

    <!-- *****  TYPE School ***** -->
 
    <ElementType name="School">
 
        <AttributeType name="id" dt:type="id"/>
            <attribute type="id" />
         
        <AttributeType name="courses" dt:type="idrefs" />
            <attribute type="courses" x:range="Course"/>         
        <AttributeType name="students" dt:type="idrefs" />
            <attribute type="students" x:range="Student"/>         
        <AttributeType name="teachers" dt:type="idrefs" />
            <attribute type="teachers" x:range="Teacher"/>
        <AttributeType name="addresses" dt:type="idrefs" />
            <attribute type="addresses" x:range="Address"/>
 
        <group seq='many'>
            <element  type="Student" minOccurs="0" maxOccurs="*" />
            <element  type="Course" minOccurs="0" maxOccurs="*" />
            <element  type="Teacher" minOccurs="0" maxOccurs="*" />
            <element  type="Address" minOccurs="0" maxOccurs="*" />
        </group>
 
    </ElementType>
</Schema>

1. Introduction

2. The Canonical Form

2.1 Design Guidelines

2.2 The Canonical Form