[From: http://www.biztalk.org/Resources/canonical.asp; see also http://msdn.microsoft.com/xml/archive/WWW8DevDay.ppt.]
|
||||||||||||
|
|
|
Figure 1 –Course, Student, Teacher and Address Example For example, consider a database or other graph (for example described by the UML diagram of Figure 1 or another notation) that contains the following data:
A serialized instance produced according to the canonical form would look like (see Appendix A for the DTD and the XML schema): <School> <Address id="Address-3" street="28 Campus Drive #103" city="Unicity" state="CA"/> <Course id="Course-19" name="Western Civilization" taughtBy="Teacher-83"/> <Course id="Course-253" name="English Literature" taughtBy="Teacher-83"/> <Student id="Student-30006" name="Raphael" parentaddr="Address-1" dormaddr="Address-3" attends="Course-19"> <Address id="Address-1" street="950 Greenhill Rd" city="Mill Valley" state="CA"/> </Student> <Student id="Student-2567" name="Michael" dormaddr="Address-3" attends="Course-19 Course-253" /> <Student id="Student-31415" name="Sandra" parentaddr="Address:4" attends="Course-253"> <Address id="Address:4" street="14 16 Street" city="San Raphael" state="CA"/> </Student> <Teacher id="Teacher-83" name="Thorsten"/> </School> In the example, the student with the name Raphael has both a parent and a dorm address and attends one course. The student with name Michael does not have a parent address and shares the dorm address with Raphael (i.e., they are roommates), whereas the Student with name Sandra only has a parent address. Michael attends two courses, Sandra attends one. If we compare the rules and the canonical form they produce with our stated design goals, we can say that we have achieved our goals. The typical syntax generated is easily readable by humans. The rules certainly conform to XML 1.0 and are easy to teach. By using ID/IDREF and URIs, the serialization is able to express graphs and directed relationships (see below). The query model is simple, and it is certainly possible to embed data in the canonical form within web pages. As we will see in sections 3 and 4 below, it is easy to define rules on how to map other syntax families to and from the canonical form. The only problem that may arise is in the context of very long text properties, where a different encoding might have been a bit more readable. However, any encoding that maps properties to content model has the problems of added complexity (relations still need to be mapped to attributes, because elements cannot express references) and of XML 1.0 conformity (local property names become global). 2.3 Serializing RelationshipsAs rule 3 already states, relationships between entities are represented using ID/IDREF(S) attributes. In the following, we look a bit closer at each of the binary relationships normally found in data models and show with some examples how they are serialized. 1:1 Relationship A 1:1 relationship is expressed by an attribute of type IDREF. For example, if a student attends a course, the canonical XML looks like <Student id="Student-1" name="Alice" attends="Course-1" /> <Course id="Course-1" name="Greek" /> If we want to be able to query the relationship in both directions, we will add explicit attributes in both directions: <Student id="Student-1" name="Alice" attends="Course-1" /> <Course id="Course-1" name="Greek" attendees="Student-1"/> 1:Many Relationship A 1:many relationship is expressed by an attribute of type IDREFS for the 1:many direction or an attribute of type IDREF for the many:1 direction. For example, if a teacher teaches two courses and courses are taught by only one teacher, the canonical solution in XML is <Teacher id="Teacher-1" name="Alice" teaches="Course-1 Course-2" /> <Course id="Course-1" name="Greek" taughtBy="Teacher-1"/> <Course id="Course-2" name="English History" taughtBy="Teaacher-1"/> This works even if the taught events are not all courses, but some heterogeneous mixture of types, as in: <Teacher id="Teacher-1" name="Alice" teaches="Course-1 Dance-2" /> <Course id="Course-1" name="Greek" taughtBy="Teacher-1"/> <Dance id="Dance-2" taughtBy="Teacher-1"/> Many:Many Relationship Many:many relationships that have no additional properties attached to them are expressed by attributes of type IDREFS. If additional properties are attached, the relationship will be serialized as an explicit join element, providing two IDREF attributes for each entity involved and attributes for the attached properties. If there are no properties (e.g. grades) that are attached to the student-course relationships attends and attendees, we just show student and course elements as <Student id="Student-1" name="Alice" attends="Course-1 Course-2" /> <Student id="Student-2" name="Bob" attends="Course-2 Course-3" /> <Course id="Course-1" name="Greek" attendees="Student-1"/> <Course id="Course-2" name="English History" attendees="Student-1 Student-2"/> <Course id="Course-3" name="Physics" attendees="Student-2"/> However, if there are properties of the many:many relation, we need to show these in the serialization, as in: <Student id="Student-1" name="Alice" enrollment="e1 e2" /> <Student id="Student-2" name="Bob" enrollment="e3 e4" /> <Course id="Course-1" name="Greek" enrollment="e1"/> <Course id="Course-2" name="English History" enrollment="e2 e3"/> <Course id="Course-3" name="Physics" enrollment="e4"/> <Enrollment id="e1" student="Student-1" course="Course-1" grade="A" /> <Enrollment id="e2" student="Student-1" course="Course-2" grade="B" /> <Enrollment id="e3" student="Student-2" course="Course-2" grade="C" /> <Enrollment id="e4" student="Student-2" course="Course-3" grade="D" /> Note that there is a pretty smooth evolution from a single-valued IDREF attribute all the way through a many-to-many relation. We never actually reify a relation (e.g. "enrollment") unless the relation has properties. Hierarchical Owner Relationships Hierarchical (1:many) relationships that express ownership can take advantage of the nesting available in XML by making the owned entity a sub-entity of the owner entity. For example, the relationship that a school owns several classrooms can be serialized as <School id="School-1" name="Goethe High" classrooms="Room-1-1 Room-1-2"> <Classroom id="Room-1-1" name="A1" /> <Classroom id="Room-1-2" name="A2" /> </School> Note that such owned entities are normally also known as "weak entities". Inverse Relationships Relationships between entities are bi-directional. For example, a course is taughtBy a person; that person teaches the course: ‘teaches’ and ‘taughtBy’ are inverses. Their inverse relationship can be declared in the schema. For maximum convenience when querying, we would like to write instance documents that contain both directions; that means writing each relation once on each element, for example <Course id="Course-1" name="English" taughtBy="Teacher-1" /> <Course id="Course-1" name="French" taughtBy="Teacher-1" /> <Teacher id="Teacher-1" name="Mitchell" teaches="Course-1 Course-2" /> But often, database developers and programming language programmer take a shortcut and provide only one side of the relationship, knowing that the other can be inferred (hopefully based on schema). A database developer will typically write the many-to-one direction, since that is where the foreign key is found, and it minimizes scanning rows twice. <Course id="Course-1" name="English" taughtBy="Teacher-1" /> <Course id="Course-1" name="French" taughtBy="Teacher-1" /> <Teacher id="Teacher-1" name="Mitchell" /> Programmers will also often write only one direction, but unfortunately this may be the direction opposite to that chosen by databases. They will often write the one-to-many direction, since this corresponds to an array of references. <Course id="Course-1" name="English" /> <Course id="Course-1" name="French" /> <Teacher id="Teacher-1" name="Mitchell" teaches="Course-1 Course-2" /> Thus when both directions are materialized in the canonical serialization, the resulting XML can be easily transferred from one world to the other, provided that the inverse relationship of the two attributes is preserved in the schema and the schema of the canonical representation is passed with the data. 2.4 External References, Arrays and OrderEntities may have relations to entities not in the serialized graph using the same general mechanism, but where the attribute's datatype is URI (Universal Resource Identifier) or URIS. By that, we mean that the attribute is not an IDREF, but rather will be interpreted by a processing application as a URI. We anticipate that future schema mechanisms will provide the means to declare such attributes in a standard way. For example, webPage in the following element would be referencing another entity via its URI: <Student id="Student-31415" name="Linda" webPage="http://www.lindamann.com"/> As mentioned earlier, if the same relation type relates several entities, they are expressed as a single attribute with datatype IDREFS. The order in which the ids are listed is presumed significant, and expresses the ordering (if any) of the collection of related entities (e.g. chapters in a book). When significant, it is fundamentally an aspect of the relations between the elements (e.g. between the chapters, such that chapter 1 precedes chapter two, and so on). This does not preclude application domains designing vocabulary for collections with more specialized semantics, for example Arrays, Sets, Bags, etc. In these cases, the semantics would be indicated by explicit collection elements, or by information in the schema for the relation attribute, as appropriate. For example, if the courses that the Student with id "Student-2567" attends are grouped in an array, then the canonical form can express this explicit array as follows: <Student id="Student-2567" name="Michael" home="Address:3" attends="Array-1"/> <dt:Array id="Array-1"> <dt:Idref>Course-19</dt:Idref> <dt:Idref>Course-253</dt:Idref> </dt:Array> Here the datatype element "Array" is referenced via an IDREF and the array elements are available as sub datatype elements "Idref". Similarly, while these rules permit the serialization of any graph, they neither require nor preclude elements or attributes with specific semantics, including elements or attributes designed to layer-on additional graph facilities such as reference, attribution or subsumption. Appropriate vocabularies and namespaces can effect all of these facilities. 3. Mappings to the Canonical FormThe following sections give some cookbook recipes for converting some commonly encountered data descriptions and formats into the canonical form and back. 3.1 UML to XML Schema ConversionGiven an arbitrary UML diagram, we can mechanically produce a canonical grammar.
3.2 Graph to XML Instance Conversion
3.3 XML Instance to Graph Conversion
3.4 Converting a set of Database Tables to XML Instance
3.5 XML Instance to set of Database Tables Conversion
4. Mapping other XML Syntax FamiliesA fully explicit, canonical syntax makes it easy to convert from syntax to a graph of objects. Provided one has a schema telling which attributes are IDREFs, one merely interprets all attributes as either properties or relations via IDREF. However, the canonical syntax is not the only syntax that could be used to serialize a graph. In many cases, alternative syntaxes may be used, due to historical or political factors, or to take advantage of compressions that are available if one has domain knowledge. We call all of these "abbreviated syntaxes." These are not canonical syntax, though they may be mapped to it. For example, we might find an instance such as this: <Course> <name>Western Civilization</name> <taughtBy>Thorsten</taughtBy> <attendedBy>Raphael</attendedBy> <attendedBy>Smith</attendedBy> </Course> Here, the course's name was expressed by a sub-element, and teachers and students were identified only by their name. We need a means to convert such abbreviated syntax to a fully explicit (canonical) syntax. There are two basic approaches possible. One is to have some declarative information in the schema that restores the missing elements. The other is to use a transform language such as XSL to convert the abbreviated to an explicit syntax. The declarative approach is initially simpler. Each abbreviated syntactic schema declares its relation to a canonical schema and provides appropriate declarative mappings. The drawback to this is that it requires additions to the schema vocabulary, and can only handle a limited number of simple cases. In the real world, judging by the experience with Architectural Forms and our own attemps to design general mapping declarations, especially given the deployment of systems that evolve over several years, declarative mappings eventually either fail or become very complex. If we take the transform language approach, then each abbreviated syntactic schema declares its relation to a canonical schema and provides appropriate transforms to and from the canonical form. We right now favor a composite approach. For a small number of very common and simple cases we can annotate schemas with declarative mapping information in the form of attributes of the element types. The exact details of what constitutes "common and simple" should be determined, but candidates appear to be (a) simple renaming of elements or attributes, (b) conversion of a sub-element to an attribute, (c) inference of a relation based on element containment, (d) reference by a "foreign key" converted to reference by IDREF or URI. For more complex cases we should look to a transform language such as XSL. Finally, one might reasonably ask why we have a canonical syntax at all. Why not provide mappings directly to the graph's schema? But if we ask that, we need to also ask what those mappings would look like. In effect, they would map elements and attributes to objects and properties, much as XSL maps things today, but using new keywords to signal the difference in result types. Having done all that – introducing a new vocabulary for syntax to graph mapping – we would not have any greater functionality than provided by the canonical syntax approach, but we would have doubled the vocabulary needed. Further, we would require that all clients of XML implement mapping machinery (while with the canonical syntax approach a server could choose to emit canonical syntax, thereby avoiding any need for a special mapper). We would not be able to leverage future developments in XSL. Finally, we would not be providing any clear suggestions for syntax that people should use, and would therefore greatly increase the actual amount of mapping that would need to occur. 5. ConclusionThe proposed canonical form for serializing graph-structured data gives XML a powerful tool to transfer complex-structured data between different systems. These systems may have heterogeneous mechanisms for implementing data structures, yet the canonical form faithfully preserves the data’s graph structure during the transmission. The proposed format achieves the preservation solely by using techniques conforming to XML 1.0 and produces XML fragments that are both highly readable and easy to query. The rules are simple and can be taught easily. This model provides a way to encode any graph of instances as XML unambiguously. It is important to note that the use of this goes considerably beyond allowing XML messages to include graphs of data. It allows us to model existing sets of data such as relational databases or applications with complex graphs of objects as though they are XML. In other words, datastores and network of programmatic objects can be treated as though they are XML documents. This in turn will allow standard ways for databases and applications to interoperate without requiring that each party of the interaction have custom knowledge of the underlying implementation. So long as the transmitted data is mapped to an XML canonical format, the message can be decoded by any receiver. Viewing datastores as XML structures is an important advance in data integration. It means that XML facilities such as XPointers and any upcoming XML query language can – in principle – ask for information from any store. Thereby, they can provide a consistent model across facilities ranging from documents to databases to application servers. One can imagine the day in which one query language is used across the entire net to access information. No new infrastructure is required in XML to enable this. As we have shown, the existing capabilities inherent in XML ranging from the nested scoping of attributes against elements to the ID/IDREF(s) capability are sufficient to the job. All that is missing is a richer schema model that enables us to know about data types, inverses and, potentially, constraints on the element types to which references refer. This is exciting, because it means that we are already well along the road to what had once been a futuristic possibility: a world in which any consumer of data has the tools to talk to any data producer, wherever located, and communication based on the meaning of the data, not the accidents of its representation. Appendix A: DTD and Schema for Courses, Students, Teachers ExampleA.1 DTDThis sample schema uses DTD notation to describe a vocabulary and syntax for serializing the example of Courses, Students and Teachers given in Figure 1. <!ELEMENT Address> <!ATTLIST Address id ID #IMPLIED street CDATA #IMPLIED city CDATA #IMPLIED state CDATA #IMPLIED > <!ELEMENT Course> <!ATTLIST Course id ID #IMPLIED name CDATA #IMPLIED taughtBy IDREF #IMPLIED > <!ELEMENT Student (Address)? > <!ATTLIST Student id ID #IMPLIED name CDATA #IMPLIED attends IDREFS #IMPLIED parentaddr IDREF #IMPLIED > dormaddr IDREF #IMPLIED > <!ELEMENT Teacher> <!ATTLIST Teacher id ID #IMPLIED name CDATA #IMPLIED > <!ELEMENT School (Student|Course|Teacher|Address)* > <!ATTLIST School id ID #IMPLIED students IDREFS #IMPLIED courses IDREFS #IMPLIED teachers IDREFS #IMPLIED addresses IDREFS #IMPLIED > A.2 XML-Data SchemaThis sample schema uses XML-Data notation to describe a vocabulary and syntax for serializing the example of Courses, Students and Teachers given in Figure 1. <?xml version="1.0" encoding="windows-1252" ?> <!-- Schema for package CoursesStudentsTeachers --> <Schema xmlns="urn:schemas-microsoft-com:xml-data" xmlns:dt="urn:schemas-microsoft-com:datatypes" xmlns:x="urn:schemas-microsoft-com:xml-data-ex"> <!-- ***** TYPE Address ***** --> <ElementType name="Address"> <AttributeType name="id" dt:type="id"/> <attribute type="id" /> <AttributeType name="street" dt:type="string"/> <attribute type="street" /> <AttributeType name="city" dt:type="string"/> <attribute type="city" /> <AttributeType name="state" dt:type="string"/> <attribute type="state" /> </ElementType> <!-- ***** TYPE Course ***** --> <ElementType name="Course"> <AttributeType name="id" dt:type="id"/> <attribute type="id" /> <AttributeType name="name" dt:type="string"/> <attribute type="name" required="yes" /> <AttributeType name="taughtBy" dt:type="idref" /> <attribute type="taughtBy" required='yes' x:range="Teacher"/> </ElementType> <!-- ***** TYPE Student ***** --> <ElementType name="Student"> <AttributeType name="id" dt:type="id"/> <attribute type="id" /> <AttributeType name="attends" dt:type="idrefs" /> <attribute type="attends" x:range="Course"/> <AttributeType name="name" dt:type="string"/> <attribute type="name" required='yes' /> <AttributeType name="parentaddr" dt:type="idref" /> <attribute type="parentaddr" x:range="Address"/> <AttributeType name="dormaddr" dt:type="idref" /> <attribute type="dormaddr" x:range="Address"/> <group seq='many'> <element type="Address" minOccurs="0" maxOccurs="1" /> </group> </ElementType> <!-- ***** TYPE Teacher ***** --> <ElementType name="Teacher"> <AttributeType name="id" dt:type="id"/> <attribute type="id" /> <AttributeType name="name" dt:type="string"/> <attribute type="name" required='yes' /> </ElementType> <!-- The PACKAGE --> <!-- ***** TYPE School ***** --> <ElementType name="School"> <AttributeType name="id" dt:type="id"/> <attribute type="id" /> <AttributeType name="courses" dt:type="idrefs" /> <attribute type="courses" x:range="Course"/> <AttributeType name="students" dt:type="idrefs" /> <attribute type="students" x:range="Student"/> <AttributeType name="teachers" dt:type="idrefs" /> <attribute type="teachers" x:range="Teacher"/> <AttributeType name="addresses" dt:type="idrefs" /> <attribute type="addresses" x:range="Address"/> <group seq='many'> <element type="Student" minOccurs="0" maxOccurs="*" /> <element type="Course" minOccurs="0" maxOccurs="*" /> <element type="Teacher" minOccurs="0" maxOccurs="*" /> <element type="Address" minOccurs="0" maxOccurs="*" /> </group> </ElementType> </Schema>
|