The agenda of this tutorial is following the classification of XML schema languages proposed by the ISO DSDL Working Group (http://dsdl.org):
Since I will give more background about the history and context of the different XML schema languages on Wednesday morning, I have decided to remove those items from the agenda of this tutorial to focus on the technical features (3 hours will not be too much to cover the technical ground needed to have a good understanding of these technologies!).
Thanks for your attention.
Eric van der Vlist
I will insist more on this point during my comparison of XML schema languages on Wednesday morning, but one thing is sure: a XML schema language is probably not what you're expecting, and its main feature is not (or not always) to describe a class of XML documents but rather to act as a filter or firewall to protect applications from the wide diversity of well formed XML documents.
All over this tutorial we will use the following example:
<?xml version="1.0"?>
<library>
<book id="_0836217462">
<isbn>
0836217462
</isbn>
<title>
Being a Dog Is a Full-Time Job
</title>
<author-ref id="Charles-M.-Schulz"/>
<character-ref id="Peppermint-Patty"/>
<character-ref id="Snoopy"/>
<character-ref id="Schroeder"/>
<character-ref id="Lucy"/>
</book>
<book id="_0805033106">
<isbn>
0805033106
</isbn>
<title>
Peanuts Every Sunday
</title>
<author-ref id="Charles-M.-Schulz"/>
<character-ref id="Sally-Brown"/>
<character-ref id="Snoopy"/>
<character-ref id="Linus"/>
<character-ref id="Snoopy"/>
</book>
<author id="Charles-M.-Schulz">
<name>
Charles M. Schulz
</name>
<nickName>
SPARKY
</nickName>
<born>
1992-11-26
</born>
<dead>
2000-02-12
</dead>
</author>
<character id="Peppermint-Patty">
<name>
Peppermint Patty
</name>
<since>
1966-08-22
</since>
<qualification>
bold, brash and tomboyish
</qualification>
</character>
<character id="Snoopy">
<name>
Snoopy
</name>
<since>
1950-10-04
</since>
<qualification>
extroverted beagle
</qualification>
</character>
<character id="Schroeder">
<name>
Schroeder
</name>
<since>
1951-05-30
</since>
<qualification>
brought classical music to the Peanuts strip
</qualification>
</character>
<character id="Lucy">
<name>
Lucy
</name>
<since>
1952-03-03
</since>
<qualification>
bossy, crabby and selfish
</qualification>
</character>
<character id="Sally-Brown">
<name>
Sally Brown
</name>
<since>
1960-08-22
</since>
<qualification>
always looks for the easy way out
</qualification>
</character>
<character id="Linus">
<name>
Linus
</name>
<since>
1952-09-19
</since>
<qualification>
the intellectual of the gang
</qualification>
</character>
</library>
An application managing the library described by this document, or even a XSLT stylesheet designed to display it would probably be very confused if the name or content of the elements are not what they expect and a the main feature of a XML schema language is to provide a formal way to describe what is expected to protect the applications from these risks of errors.
The most basic way to implement this firewall is to give a set of rules which need to followed by the instance documents.
This is the approach followed by rule based XML schema languages which main representative is Schematron. Before presenting Schematron itself, we will have a look on how XSLT may be used as a XML schema language since this is a good exercise to understand the basics of those schema languages.
We can use "classical" programming languages to write a rule based XML schema either general purpose using a XML API or XML specific such as XSLT or XQuery.
To illustrate this point, let's take the following very simple snippet of our example:
<?xml version="1.0"?>
<library>
<book id="_0836217462"/>
<book id="_0805033106"/>
</library>
Why so simple? Because we will see that even if it is true that we can use XSLT as a rule based XML schema language, this is quite verbose and I don't want spend all the time allocated to this tutorial to develop our schema!
To write this schema, we have basically two options which are the same than we have when we configure a firewall: the closed one where all what is not allowed is forbidden and the open one where all what is not forbidden is allowed and we will implement both schemas.
The first conclusion from this simple example is that XML applications tend to forbid much more than they allow: closed schemas are often easier to write than open schemas.
On the other hand, it's easier to define user friendly error messages in a open schema since the context in which something is forbidden is always determined.
To implement an open schema with XSLT, we will start defining a default template which will accept anything:
<xsl:template match="*|@*|text()">
<xsl:apply-templates select="*|@*|text()"/>
</xsl:template>
With this single template, our "schema" would accept any well formed XML document and never raise any error and we need to add templates to define what's forbidden.
Like with the design of any XSLT transformation, we have the choice to implement the tests as conditions in the "match" attribute of templates or within the templates using if or choose statements. When we are using if or choose statements, we have also the choice of the location where we will will do the test.
To check that the document element is "library", we can for instance:
Now that we've set up the background, we can generalize it and a pretty much complete "schema" including a test for unicity of the identifiers could be:
Note that we have left a degree of opening and that arbitrary element and text nodes can be added to the book element.
We can write a template to allow library as document element:
<xsl:template match="/library">
<xsl:apply-templates select="*|@*|text()"/>
</xsl:template>
But we also need to forbid other document elements:
<xsl:template match="/*">
<xsl:message terminate="no">
The document element should be "library".
</xsl:message>
<xsl:apply-templates select="*|@*|text()"/>
</xsl:template>
Or, alternatively, we can rely on the default template and replace both templates by a slightly more complex match expression:
<xsl:template match="/*[not(self::library)]">
<xsl:message terminate="no">
The document element should be "library".
</xsl:message>
<xsl:apply-templates select="*|@*|text()"/>
</xsl:template>
We can also perform the test in a template for the root of the document:
<xsl:template match="/">
<xsl:if test="not(library)">
<xsl:message terminate="no">
The document element should be "library".
</xsl:message>
</xsl:if>
<xsl:apply-templates select="*|@*|text()"/>
</xsl:template>
or do the same test in a template for document element:
<xsl:template match="/*">
<xsl:if test="not(self::library)">
<xsl:message terminate="no">
The document element should be "library".
</xsl:message>
</xsl:if>
<xsl:apply-templates select="*|@*|text()"/>
</xsl:template>
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" >
<xsl:template match="*|@*|text()">
<xsl:apply-templates select="*|@*|text()"/>
</xsl:template>
<xsl:template match="/*[not(self::library)]">
<xsl:message terminate="no">
<xsl:text>The document element should be "library", not "</xsl:text>
<xsl:value-of select="name()"/>
<xsl:text>"!</xsl:text>
</xsl:message>
<xsl:apply-templates select="*|@*|text()"/>
</xsl:template>
<xsl:template match="/*/*[not(self::book)]">
<xsl:message terminate="no">
<xsl:text>The children elements of library should be "book", not "</xsl:text>
<xsl:value-of select="name()"/>
<xsl:text>"!</xsl:text>
</xsl:message>
<xsl:apply-templates select="*|@*|text()"/>
</xsl:template>
<xsl:template match="library/@*">
<xsl:message terminate="no">
<xsl:text>The "library" element should have no attribute, </xsl:text>
<xsl:value-of select="name()"/>
<xsl:text> shouldn't appear!</xsl:text>
</xsl:message>
<xsl:apply-templates select="*|@*|text()"/>
</xsl:template>
<xsl:template match="library/text()[normalize-space()]">
<xsl:message terminate="no">
<xsl:text>The "library" element should have no text, "</xsl:text>
<xsl:value-of select="normalize-space()"/>
<xsl:text>" shouldn't appear!</xsl:text>
</xsl:message>
<xsl:apply-templates select="*|@*|text()"/>
</xsl:template>
<xsl:template match="book/@*">
<xsl:message terminate="no">
<xsl:text>The "book" element should have no other attribute than "id", </xsl:text>
<xsl:value-of select="name()"/>
<xsl:text> shouldn't appear!</xsl:text>
</xsl:message>
<xsl:apply-templates select="*|@*|text()"/>
</xsl:template>
<xsl:template match="book/@id">
<xsl:if test=". = ../preceding-sibling::book/@id">
<xsl:message terminate="no">
<xsl:text>The "book" id should be unique, </xsl:text>
<xsl:value-of select="."/>
<xsl:text> is duplicated.</xsl:text>
</xsl:message>
</xsl:if>
<xsl:apply-templates select="*|@*|text()"/>
</xsl:template>
</xsl:stylesheet>
A closed schema is the other way round and will define defaults templates which are forbidding everything (except eventually "empty" text nodes):
<xsl:template match="*">
<xsl:message terminate="no">
<xsl:text>Forbidden element:</xsl:text>
<xsl:value-of select="name()"/>
</xsl:message>
<xsl:apply-templates select="*|@*|text()"/>
</xsl:template>
<xsl:template match="@*">
<xsl:message terminate="no">
<xsl:text>Forbidden attribute:</xsl:text>
<xsl:value-of select="name()"/>
</xsl:message>
</xsl:template>
<xsl:template match="text()"/>
<xsl:template match="text()[normalize-space()]">
<xsl:message terminate="no">
<xsl:text>Forbidden text:</xsl:text>
<xsl:value-of select="."/>
</xsl:message>
</xsl:template>
and then define everything which is allowed, ie in fact very few things:
<xsl:template match="/library">
<xsl:apply-templates select="*|@*|text()"/>
</xsl:template>
<xsl:template match="library/book">
<xsl:apply-templates select="*|@*|text()"/>
</xsl:template>
<xsl:template match="book/@id[not (.=../preceding-sibling::book/@id)]">
<xsl:apply-templates select="*|@*|text()"/>
</xsl:template>
Technically speaking, Schematron is a concise formalization of one of the examples which we have seen and generates a XSLT transformation which is an open schema (everything which has not been forbidden is allowed) with tests inside the templates.
That being said, XSLT is totally hidden from the Schematron user who needs to know the Schematron syntax and XPath which is used to express the rules.
A Schematron schema is composed of a set of patterns each pattern including one or more rules and each rule being composed of asserts and reports, however, to present the syntax used by Schematron, we'll take it "bottom/up" and start with asserts and reports before seeing how they are associated into rules, patterns and schemas.
The "assert" and "report" elements are where the rules are defined in a Schematron schema. Both carry a "test" attribute which is an XPath expression they differ in a couple of ways:
There are some goodies which we will not cover in this tutorial, but the basic syntax is:
<sch:assert test="library">
The document element should be 'library'.
</sch:assert>
Which raises an error with the corresponding message if there is no "library" element under the context node, or:
<sch:report test="@*">
The library element should not contain attributes.
</sch:report>
Which raises an error if there is any attribute under the context node.
In both cases, the context node is set by the "rule" parent element of the report or assert node.
Schematron rule elements are roughly equivalent to XSLT templates and are used to define the context under which a set of assert and report elements will be performed.
An example of rule (without bells and whistles) performing the tests done in our open schema on the book element could be:
<sch:rule context="book">
<sch:report test="@*[name() != 'id']">
The book element should not include any attribute other than "id".
</sch:report>
<sch:assert test="@*[namespace-uri() = '']">
The book element should not include any attribute other than "id" (namespace).
</sch:assert>
<sch:report test="@id = preceding-sibling::book/@id">
The book id should be unique.
</sch:report>
</sch:rule>
Some notes about rules:
Pattern elements are sets of rules which are evaluated independently (technically using different modes in the XSLT stylesheet generated out of the Schematron schema).
An example of pattern roughly equivalent to our open XSLT schema could be:
<sch:pattern>
<sch:rule context="/">
<sch:assert test="library">
The document element should be 'library'.
</sch:assert>
</sch:rule>
<sch:rule context="library">
<sch:report test="*[not(self::book)]">
The library element should contain only book elements.
</sch:report>
<sch:report test="@*">
The library element should not contain attributes.
</sch:report>
<sch:report test="text()[normalize-space()]">
The library element should not contain attributes.
</sch:report>
</sch:rule>
<sch:rule context="book">
<sch:report test="@*[name() != 'id']">
The book element should not include any attribute other than "id".
</sch:report>
<sch:assert test="@*[namespace-uri() = '']">
The book element should not include any attribute other than "id" (namespace).
</sch:assert>
<sch:report test="@id = preceding-sibling::book/@id">
The book id should be unique.
</sch:report>
</sch:rule>
</sch:pattern>
One of the differences with what we had implemented is that Schematron will stop the evaluation of a pattern after the first error found (following the order of the source tree) and if we wanted to be potentially able to raise several errors, we would have to spread our rules within several different patterns.
Finally, the schema element is the document element of a Schematron schema and basically include a title and one or more patterns. To implement our rules within separated patterns, we could write:
<?xml version="1.0" encoding="utf-8"?>
<sch:schema xmlns:sch="http://www.ascc.net/xml/schematron">
<sch:title>Example Schematron Schema</sch:title>
<sch:pattern>
<sch:rule context="/">
<sch:assert test="library">
The document element should be 'library'.
</sch:assert>
</sch:rule>
</sch:pattern>
<sch:pattern>
<sch:rule context="library">
<sch:report test="*[not(self::book)]">
The library element should contain only book elements.
</sch:report>
<sch:report test="@*">
The library element should not contain attributes.
</sch:report>
<sch:report test="text()[normalize-space()]">
The library element should not contain attributes.
</sch:report>
</sch:rule>
</sch:pattern>
<sch:pattern>
<sch:rule context="book">
<sch:report test="@*[name() != 'id']">
The book element should not include any attribute other than "id".
</sch:report>
<sch:assert test="@*[namespace-uri() = '']">
The book element should not include any attribute other than "id" (namespace).
</sch:assert>
<sch:report test="@id = preceding-sibling::book/@id">
The book id should be unique.
</sch:report>
</sch:rule>
</sch:pattern>
</sch:schema>
Solving the same problem with Schematron and XSLT shows the nature of Schematron which is a subset of XSLT tailored to XML validation through open rule based schemas.
Why do I insist that much on the openness of Schematron schemas?
Because the default behavior of Schematron is to be open, but it is still possible to write closed (or semi closed schemas) with Schematron even though it isn't a common practice.
The main trick when doing so is to note that the rules within a Schematron pattern are evaluated in lexical order instead of following the rules of priorities as defined by XSLT. The default rules which will forbid content not described in any rule need therefore to be located after all the other rules, such as in:
<?xml version="1.0" encoding="utf-8"?>
<sch:schema xmlns:sch="http://www.ascc.net/xml/schematron">
<sch:title>Example Schematron Schema</sch:title>
<sch:pattern>
<sch:rule context="/library">
<sch:report test="@*">
The library element should not contain attributes.
</sch:report>
</sch:rule>
<sch:rule context="library/book">
<sch:report test="@*[name() != 'id']">
The book element should not include any attribute other than "id".
</sch:report>
<sch:assert test="@*[namespace-uri() = '']">
The book element should not include any attribute other than "id" (namespace).
</sch:assert>
<sch:report test="@id = preceding-sibling::book/@id">
The book id should be unique.
</sch:report>
</sch:rule>
<sch:rule context="*">
<sch:report test="1">
Element "<sch:name/>" forbidden under "<sch:name path=".."/>".
</sch:report>
</sch:rule>
<sch:rule context="text()[normalize-space()]">
<sch:report test="1">
Text forbidden in "<sch:name path=".."/>" element.
</sch:report>
</sch:rule>
</sch:pattern>
</sch:schema>
Note also the usage of the "name" element and the fact that rules can't be defined for attributes.
We have seen that a schema can be described as a set of rules formalized using a language such as Schematron of XSLT (other languages such as Prolog are probably good candidates too). The fact that this is possible isn't a proof that it's easy and people have developed other classes of more specific schema languages describing the structure of the documents rather than the rules to apply to validate them.
RELAX NG is the main example of such languages qualified of "grammar based" since they describe documents in the manner of a BNF adapted to describing XML trees.
Although its syntax is very different from XPath, RELAX NG is all about named patterns allowed in the structure.
The description of our simplified library could be:
<?xml version="1.0" encoding="UTF-8"?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0">
<start>
<element name="library">
<zeroOrMore>
<element name="book">
<attribute name="id"/>
</element>
</zeroOrMore>
</element>
</start>
</grammar>
This schema reads almost as plain English and describes: 'a grammar starting with a document element named "library" containing zero or more elements named "book" with an attribute named "id"' and his equivalent to our XSLT closed schema accepting only "/library", "/library/book" and "/library/book/@id" --except that the restriction on ids being unique is not captured (yet) in our RELAX NG schema.
The XML syntax of this schema is still quite verbose and James Clark has proposed an equivalent yet more concise non XML syntax. Using this syntax, our schema would become:
grammar {
start = element library{
element book {attribute id {text}} *
}
}
This syntax has roughly the same meaning, except that a) it's non XML b) some DTD goodies are used: here the "*" means "zero or more" and we will see more of these goodies in more complete examples later on.
We are still behind what we had implemented with our XSLT or Schematron schemas which did test the uniqueness of the book identifiers. Although it is generally impossible to implement with a grammar based XML schema language all the constraints which can be expressed as rules, this example has been chosen so that we can find a way to define a nearly equivalent constraint with RELAX NG.
This is achievable through a set of features defined to achieve a certain level of compatibility with the XML DTD and the ability of RELAX NG to interface with datatype systems.
The datatype system to use in this case is "http://relaxng.org/ns/compatibility/datatypes/1.0" and the datatype to use is "ID" since our "id" attributes can be considered as DTD ID attributes (they are globally unique all over a document and they match the XML "NMTOKEN" production).
The amended schema to express this new constraint becomes:
<?xml version="1.0" encoding="UTF-8"?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0"
datatypeLibrary="http://relaxng.org/ns/compatibility/datatypes/1.0">
<start>
<element name="library">
<zeroOrMore>
<element name="book">
<attribute name="id">
<data type="ID"/>
</attribute>
</element>
</zeroOrMore>
</element>
</start>
</grammar>
The syntax is still straightforward: the attribute is now specified as holding data of type "ID" per the datatype library "http://relaxng.org/ns/compatibility/datatypes/1.0" defined through the datatypeLibrary attribute of an ancestor of the "data" element.
The non XML syntax uses a namespace prefix declaration (also available in the XML syntax) and becomes:
datatypes dtd = "http://relaxng.org/ns/compatibility/datatypes/1.0"
grammar {
start = element library{
element book {attribute id {dtd:ID}} *
}
}
We will see later on that these datatypes are not without side effects: they are provided to provide compatibility with DTDs and emulate DTDs to the point of affecting the flexibility of RELAX NG.
All over our brief experience with RELAX NG, we've been manipulating patterns and it's worth coming back on this concept which is really fundamental.
The basic think to note is that when we write something such as "element library{element book {attribute id {dtd:ID}} *}", we are not giving definitions of what the elements "library", "book" and the attribute "id" are but defining a pattern of nodes which may appear in the documents.
In this respect, we are here much closer to the schemas which we have written with XSLT or Schematron than to the schemas we will write later on with W3C XML Schema and the meaning of the pattern defined above is "accept here an element node library with children element nodes book having an id attribute having data of type ID".
The nodes manipulated in this pattern are always anonymous, which means that we cannot make a reference to these nodes elsewhere in the schema. What's possible, though, is to define global named patterns (aka named templates in a XSLT transformation) and to refer to these patterns in other patterns.
The syntax to define a named pattern holding the set of book elements would be:
<define name="bookElements">
<zeroOrMore>
<element name="book">
<attribute name="id">
<data type="ID"/>
</attribute>
</element>
</zeroOrMore>
</define>
or (non XML):
bookElements = element book {attribute id {dtd:ID}} *
And a reference to this pattern would be:
<start>
<element name="library">
<ref name="bookElements"/>
</element>
</start>
or (non XML):
start = element library{ bookElements }
Note that there is no restriction on the "content" located in named patterns. We have chosen here to include a set of zero or more book elements but could also have created patterns to include a single book element or the id attributes. In every case, named patterns are containers and even when a name pattern contains a single element, it's a pattern containing a single element rather than a definition of this element.
It's now time to add some more elements to explore more features from RELAX NG... let's describe the "author" element:
<author id="Charles-M.-Schulz">
<name>
Charles M. Schulz
</name>
<nickName>
SPARKY
</nickName>
<born>
1922-11-26
</born>
<dead>
2000-02-12
</dead>
</author>
Since the definition of the id attribute is common to several elements, we can isolate it in a pattern:
<define name="idAttribute">
<attribute name="id">
<data type="ID" datatypeLibrary="http://relaxng.org/ns/compatibility/datatypes/1.0"/>
</attribute>
</define>
or:
idAttribute = attribute id {dtd:ID}
This description of the author element is straightforward using the few features which we've already seen:
<element name="author">
<ref name="idAttribute"/>
<element name="name">
<text/>
</element>
<element name="nickName">
<text/>
</element>
<element name="born">
<text/>
</element>
<element name="dead">
<text/>
</element>
</element>
or:
element author {
idAttribute,
element name {text},
element nickName {text},
element born {text},
element dead {text}
}
Note that we have defined all the sub-elements as "text" meaning that they can hold any text node. We could also use a datatype library such as the W3C XML Schema datatype library which we can define as the default type library since we've define the datatype library used for the id attribute in the type definition itself.
The definition involves then choosing the right type for each of the element. Here for instance, we've been lucky enough to have date expressed in the ISO 8601 date format supported by W3C XML Schema and can use this type in our schema. For string types, we need to distinguish between "token" and "string" depending on the behavior we want to space normalization (token applies full space normalization and trimming while string applies none). Depending on these choices, our definition might become:
<element name="author">
<ref name="idAttribute"/>
<element name="name">
<data type="token"/>
</element>
<element name="nickName">
<data type="token"/>
</element>
<element name="born">
<data type="date"/>
</element>
<element name="dead">
<data type="date"/>
</element>
</element>
or:
element author {
idAttribute,
element name {xs:token},
element nickName {xs:token},
element born {xs:date},
element dead {xs:date}
}
Writing the full schema for the complete example is pretty much repeating the same process:
<?xml version="1.0" encoding="UTF-8"?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0"
datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
<start>
<element name="library">
<oneOrMore>
<choice>
<ref name="bookElement"/>
<ref name="authorElement"/>
<ref name="characterElement"/>
</choice>
</oneOrMore>
</element>
</start>
<define name="idAttribute">
<attribute name="id">
<data type="ID" datatypeLibrary="http://relaxng.org/ns/compatibility/datatypes/1.0"/>
</attribute>
</define>
<define name="idrefAttribute">
<attribute name="id">
<data type="IDREF" datatypeLibrary="http://relaxng.org/ns/compatibility/datatypes/1.0"/>
</attribute>
</define>
<define name="bookElement">
<element name="book">
<ref name="idAttribute"/>
<element name="isbn">
<data type="token"/>
</element>
<element name="title">
<data type="token"/>
</element>
<zeroOrMore>
<element name="author-ref">
<ref name="idrefAttribute"/>
</element>
</zeroOrMore>
<zeroOrMore>
<element name="character-ref">
<ref name="idrefAttribute"/>
</element>
</zeroOrMore>
</element>
</define>
<define name="authorElement">
<element name="author">
<ref name="idAttribute"/>
<element name="name">
<data type="token"/>
</element>
<element name="nickName">
<data type="token"/>
</element>
<element name="born">
<data type="date"/>
</element>
<element name="dead">
<data type="date"/>
</element>
</element>
</define>
<define name="characterElement">
<element name="character">
<ref name="idAttribute"/>
<element name="name">
<data type="token"/>
</element>
<element name="since">
<data type="date"/>
</element>
<element name="qualification">
<data type="string"/>
</element>
</element>
</define>
</grammar>
or:
datatypes dtd = "http://relaxng.org/ns/compatibility/datatypes/1.0"
datatypes xs = "http://www.w3.org/2001/XMLSchema-datatypes"
grammar {
start = element library{ (bookElement|authorElement|characterElement)+ }
idAttribute = attribute id {dtd:ID}
idrefAttribute = attribute id {dtd:IDREF}
bookElement = element book {
idAttribute,
element isbn {xs:token},
element title {xs:token},
element author-ref{idrefAttribute} *,
element character-ref{idrefAttribute} *
}
authorElement = element author {
idAttribute,
element name {xs:token},
element nickName {xs:token},
element born {xs:date},
element dead {xs:date}
}
characterElement = element character {
idAttribute,
element name {xs:token},
element since {xs:date},
element qualification {xs:string}
}
}
Note the usage to define the "library" element of the "choice" element (XML syntax) represented in the non XML syntax by the "|" operator. The meaning of this compositor is to allow one possibility only within a list. Here, the choice may have "zeroOrMore" (or "*" in the non XML syntax) occurrences which means that the choice may be repeated indefinitely.
There are cases when the relative order between elements doesn't matter for the application. For instance, one may wonder what's the point of constraining the order of the sub-elements of "author" and impose to write:
<author id="Charles-M.-Schulz">
<name>
Charles M. Schulz
</name>
<nickName>
SPARKY
</nickName>
<born>
1992-11-26
</born>
<dead>
2000-02-12
</dead>
</author>
rather than
<author id="Charles-M.-Schulz">
<name>
Charles M. Schulz
</name>
<dead>
2000-02-12
</dead>
<born>
1992-11-26
</born>
<nickName>
SPARKY
</nickName>
</author>
After all, the elements have names and it's not much more complex to write applications which will retrieve the information they need whatever the order of the sub-elements is. So, why should we bother document writers with respecting a fixed order?
RELAX NG allows such definitions without any restriction through the use of "interleave" elements (XML syntax) or "&" operator (non XML syntax) and the updated definition of the author element to remove the restriction on the order of the sub-elements would be:
<element name="author">
<ref name="idAttribute"/>
<interleave>
<element name="name">
<data type="token"/>
</element>
<element name="nickName">
<data type="token"/>
</element>
<element name="born">
<data type="date"/>
</element>
<element name="dead">
<data type="date"/>
</element>
</interleave>
</element>
or:
element author {
idAttribute&
element name {xs:token}&
element nickName {xs:token}&
element born {xs:date}&
element dead {xs:date}
}
Note that this does apply even when the number of occurrences of some of the sub-elements is greater than one such as for our "book" element:
bookElement = element book {
idAttribute&
element isbn {xs:token}&
element title {xs:token}&
element author-ref{idrefAttribute} *&
element character-ref{idrefAttribute} *
}
If we come back to our highly simplified example with only "library" and "book" elements, we have achieved a pretty good equivalence with the closed schemas previously developed with XSLT and Schematron and you may wonder if we can open our schema to allow arbitrary text and element nodes within our book element like we had been able to do.
The first step to do so is to define an open pattern for accepting any element. There is no predefined pattern to do so with RELAX NG, but this is not a big deal with all what we've seen so far and a new goodies which is the "anyName" element implementing name wildcards (or "*" in the non XML syntax):
<define name="anyElement">
<element>
<anyName/>
<zeroOrMore>
<choice>
<attribute>
<anyName/>
</attribute>
<text/>
<ref name="anyElement"/>
</choice>
</zeroOrMore>
</element>
</define>
or:
anyElement = element * {(attribute * {text}|text|anyElement)*}
The other thing to note is that recursive patterns are allowed when the recursion happens within an element like it's the case here.
The surprise comes when we try to use this named pattern in our book element:
<element name="book">
<attribute name="id">
<data type="ID"/>
</attribute>
<zeroOrMore>
<choice>
<ref name="anyElement"/>
<text/>
</choice>
</zeroOrMore>
</element>
or:
element book {attribute id {dtd:ID}, anyElement*}
The schema is then detected as invalid with the following error:
Error at URL ...
line number 5, column number 22:
conflicting ID-types for attribute "id" of element "book"
We've been hit by a side effect of the DTD compatibility library used for our id attribute and to make sure that this is not a limitation of the RELAX NG language itself, we can just change the definition of these attributes to be plain text:
<element name="book">
<attribute name="id">
<text/>
</attribute>
<zeroOrMore>
<choice>
<ref name="anyElement"/>
<text/>
</choice>
</zeroOrMore>
</element>
or:
element book {attribute id {text}, anyElement*}
And our schemas become valid.
What's happening here is that to emulate the behavior of a DTD, RELAX NG imposes that if an ID attribute is defined somewhere a in a element, the same ID attribute must be defined in all the other definitions of this element and this is not the case in the definition of our "anyElement" pattern which may -through the wildcard- include a "book" element which does not include a mandatory id attribute with the type dtd:ID...
To workaround this issue, we may either avoid using the ID type as shown above or if we want to use this type, exclude or handle separately the case of a book element included as a sub-element of the top level book. This exclusion can be done through the "except" and "name" elements (or "-" operator in the non XML syntax):
<define name="anyElement">
<element>
<anyName>
<except>
<name>book</name>
</except>
</anyName>
<zeroOrMore>
<choice>
<attribute>
<anyName/>
</attribute>
<text/>
<ref name="anyElement"/>
</choice>
</zeroOrMore>
</element>
</define>
or:
anyElement = element * - book {(attribute * {text}|text|anyElement)*}
RELAX NG has some other nice features which we will not cover here and are detailed on the very good tutorial available on their web site (http://relaxng.org), such as:
There are two ways or two axis in which a XML schema language can be object oriented and W3C XML Schema which we will be covering now is object oriented on these two axis.
The first way to be object oriented is to specialize the patterns described by grammar based schema languages depending on the node set being described.
XML DTDs are object oriented per this definition since the definition of elements and attributes is clearly differentiated.
W3C XML Schema is also object oriented per this definition since different mechanisms with different syntax, features and semantic are available to define patterns depending if the node set which is described is a single element, a single attribute, a text node, the content of an element, a set of elements or a set of attributes.
The second way to be object oriented for a XML schema language is to facilitate the mapping between XML structures and programing objects through adding type (ie class) information to the instance documents and formalizing the hierarchy between the classes and we will see that W3C XML Schema is strongly object oriented per this second definition.
While the basic item of RELAX NG was a pattern, the basic items of W3C XML Schema is the component, ie a specialized pattern which can take six different forms to describe:
These patterns have different syntaxes and features. They can all be defined "globally" and then reused by reference like RELAX NG named patterns and all of them (except elements and attributes groups) can also be used anonymously in the context where there are defined).
They have also a different semantic. Elements and attributes are clearly matching real XML productions and are there to describe what's in the document. Groups of elements and attributes are containers pretty close to generic RELAX NG patterns (except that they can't contain mixed elements, text and attributes nodes).
Simple and complex types are more central to the architecture of W3C XML Schema based systems since they describe the content of the elements and attributes and can be derived they are very similar to the classes of object oriented systems and play the same role.
If for instance, our author element is known as being (or being derived from) a generic type "person", we can assume that a generic application designed to do something useful with this type will be able to process the content of the "author" element.
Let's get started by defining a first schema for our very simple example:
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="library">
<xs:complexType>
<xs:sequence>
<xs:element name="book" minOccurs="0" maxOccurs="unbounded">
<xs:complexType>
<xs:attribute name="id" type="xs:ID" use="required"/>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
This is probably the easiest schema which we can write to describe our document. It's defining a "library" element having an anonymous complex type composed of a sequence of "book" elements.
The book element is defined locally and has itself an anonymous complex type composed of an "id" attribute.
This is fairly similar to our first RELAX NG schema, the differences being the anonymous complex types which need to be added with W3C XML Schema, the sequence which was implicit for RELAX NG and the way to express the number of occurrences more precise with W3C XML Schema than with RELAX NG.
The last difference between the two schemas is the absence of a start element (or equivalent) to define the elements which may be used as document elements in W3C XML Schema. That's a FAQ and all the elements globally defined in a schema can be used as document elements.
A first variation over this first schema can be to use global types.
It's a "building blocks" process and we need to start by the lower level which is in our case the "id" attribute and we will take the opportunity to add a new constraint to our "id" which in first approximation can be considered as an underscore followed by 10 digits:
<xs:simpleType name="idType">
<xs:restriction base="xs:ID">
<xs:pattern value="_[0-9]{10}"/>
</xs:restriction>
</xs:simpleType>
This defines a new simple type as a restriction of the xs:ID predefined type (W3C XML Schema comes with the library of predefined types which we've already used with RELAX NG and we can use these types as starting points to derive new types) and this restriction is done through applying a "pattern" to the values.
People familiar with regular expressions will have recognize the syntax used in the value attribute of the pattern element. These regular expressions support Unicode and this is the reason why we've preferred using the syntax "[0-9]" than a class such as "\d" which would have included digits in any language supported by Unicode.
Now that we have our "idType", we can use it to define the complex type of the "book" element:
<xs:complexType name="bookType">
<xs:attribute name="id" type="idType" use="required"/>
</xs:complexType>
It's a copy of the anonymous type definition previously used except that it needs now to be located at the top level (just under the schema document element), that we've given it a name and that we're using our new simple type.
And we can play the same game again and define a type for the "library" element:
<xs:complexType name="libraryType">
<xs:sequence>
<xs:element name="book" minOccurs="0" maxOccurs="unbounded" type="bookType"/>
</xs:sequence>
</xs:complexType>
Note that both simple and complex types are referenced through the "type" attribute of elements and attribute definitions.
The full schema being:
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="library" type="libraryType"/>
<xs:simpleType name="idType">
<xs:restriction base="xs:ID">
<xs:pattern value="_[0-9]{10}"/>
</xs:restriction>
</xs:simpleType>
<xs:complexType name="bookType">
<xs:attribute name="id" type="idType" use="required"/>
</xs:complexType>
<xs:complexType name="libraryType">
<xs:sequence>
<xs:element name="book" minOccurs="0" maxOccurs="unbounded" type="bookType"/>
</xs:sequence>
</xs:complexType>
</xs:schema>
The point to note here is that (except for the new constraint added on the "id" attribute) this schema is validating exactly the same set of documents than the previous one but might have a different effect on applications if the global types were generic and used in different contexts.
The type information (and more generally the information gathered by W3C XML Schema validators) is merged to the information (or infoset) of the source document, packaged in a "Post Schema Validation Infoset" (PSVI) and potentially accessible to the applications.
These applications may then rely on the datatypes instead (or in addition) to relying on the elements and attributes names to choose the treatment to apply.
A second and independent variation over our first schema is to use global elements and attributes (ie elements and attributes defined at top level which can be reused at different locations and -for elements- used as a document element).
To do so, we define the elements at top level, out of the context in which they will be used:
<xs:element name="library" type="libraryType"/>
<xs:element name="book" type="bookType"/>
<xs:attribute name="id" type="idType"/>
Note that we have removed the declarations about the number of occurrences which are context related.
And when now make a reference to these elements and attributes when we need them:
<xs:complexType name="bookType">
<xs:attribute ref="id" use="required"/>
</xs:complexType>
<xs:complexType name="libraryType">
<xs:sequence>
<xs:element ref="book" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
</xs:complexType>
The "name" attribute has been replaced by a "ref", the type declaration which belongs to the definition is forbidden here and we provide information about the occurrences.
What we see here is that we have been able to use the "element" and "attribute" components directly as patterns.
The same could have been done on our first schema (with local anonymous types), these features being independent and the full schema would then have been:
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="library">
<xs:complexType>
<xs:sequence>
<xs:element ref="book" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="book">
<xs:complexType>
<xs:attribute ref="id" use="required"/>
</xs:complexType>
</xs:element>
<xs:attribute name="id" type="xs:ID"/>
</xs:schema>
A point to note here is that using global elements instead of local ones has an impact on the set of XML documents which can be validated by the schema: all the global elements can be used as document element and the following instance document would be valid per our new schemas:
<?xml version="1.0" encoding="utf-8"?>
<book id="_0805033106"/>
We have seen that we can use elements and attributes as patterns, but this doesn't mean that we have to do so and we can also embed them into containers, ie elements and attributes groups.
Again, this feature is independent of the previous two features which we have seen and we can use groups with local or reference to global elements and attributes which can themselves use anonymous or global types.
We won't show the eight different combinations of these "styles" but rather come back to our first schema and implement it using elements and attributes group. Although groups have generally more than one element or attribute, we will see that even with our very limited example, there might still be some added value in doing so!
Again, the first thing to do is to define the attribute group:
<xs:attributeGroup name="idAttribute">
<xs:attribute name="id" type="xs:ID" use="required"/>
</xs:attributeGroup>
What we have no is extremely similar to RELAX NG patterns (expect again that we can't mix different types of nodes in our groups). The advantage of doing so is that the usage of the attribute is now defined in the group and that each element using this group will have the same usage for the attribute. If we had more attributes, the other value would also be to manipulate all of them in a single container.
Now that we have this group, we can define a group containing our books:
<xs:group name="bookElements">
<xs:sequence>
<xs:element name="book" minOccurs="0" maxOccurs="unbounded">
<xs:complexType>
<xs:attributeGroup ref="idAttribute"/>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:group>
And here again, we should note that the number of occurrences is now defined in the group. The other effect of having used a group is that the book element is now defined through a local definition and, although the group may be reused, the "book" element cannot be used as a document element like this was the case when it was globally defined.
This group can be used to define the library element and the full schema is:
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="library">
<xs:complexType>
<xs:sequence>
<xs:group ref="bookElements"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:group name="bookElements">
<xs:sequence>
<xs:element name="book" minOccurs="0" maxOccurs="unbounded">
<xs:complexType>
<xs:attributeGroup ref="idAttribute"/>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:group>
<xs:attributeGroup name="idAttribute">
<xs:attribute name="id" type="xs:ID" use="required"/>
</xs:attributeGroup>
</xs:schema>
Again, we've seen enough features to have pretty much exhausted our simple example and it's time to switch to the complete one. We've learned that there are much more than one way to right such a schema since for each attribute and element we have the choice between not less than eight different combinations of local vs global type, elements/attributes and groups and will certainly not enumerate each of theme here! One of them could be:
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="library">
<xs:complexType>
<xs:choice minOccurs="0" maxOccurs="unbounded">
<xs:element name="book" type="bookType"/>
<xs:element name="author" type="authorType"/>
<xs:element name="character" type="characterType"/>
</xs:choice>
</xs:complexType>
</xs:element>
<xs:complexType name="bookType">
<xs:sequence>
<xs:element name="isbn" type="xs:token"/>
<xs:element name="title" type="xs:string"/>
<xs:element name="author-ref" type="refType" minOccurs="0" maxOccurs="unbounded"/>
<xs:element name="character-ref" type="refType" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute ref="id"/>
</xs:complexType>
<xs:complexType name="refType">
<xs:attributeGroup ref="idref"/>
</xs:complexType>
<xs:complexType name="authorType">
<xs:sequence>
<xs:element name="name" type="xs:token"/>
<xs:element name="nickName" type="xs:token"/>
<xs:element name="born" type="xs:date"/>
<xs:element name="dead" type="xs:date"/>
</xs:sequence>
<xs:attribute ref="id"/>
</xs:complexType>
<xs:complexType name="characterType">
<xs:sequence>
<xs:element name="name" type="xs:token"/>
<xs:element name="since" type="xs:date"/>
<xs:element name="qualification" type="xs:string"/>
</xs:sequence>
<xs:attribute ref="id"/>
</xs:complexType>
<xs:attributeGroup name="idref">
<xs:attribute name="id" type="xs:IDREF" use="required"/>
</xs:attributeGroup>
<xs:attribute name="id" type="xs:ID"/>
</xs:schema>
The rule we've followed here is to give our preference to datatypes except for the attributes (why not?) which gave us the opportunity to show a last advantage of groups which is the ability to use a name which is not the name of the elements or attributes.
Here, we had used "id" attributes both for the IDs (in "book", "author" and "character" elements) and for the IDREFs (in "author-ref" and "character-ref") and could not define two global elements with the name "id" and different types. To workaround this fact, we have defined one of them as a global attribute and the second one in a global attributes group.
The two other things to note are the "xs:choice" compositor (with the same meaning than in RELAX NG) and the fact that in a complex type, the definition of the attributes must always come at the end (after the xs:sequence, xs:choice or xs:all -we will see this one later on-) defining the content of the element.
When order doesn't matter, ... in the general case it still matters for W3C XML Schema!
The two cases where W3C XML Schema can handle content models where the order doesn't matter is when all the children elements can be included any number of time and when they can all be included at most once.
In fact we've already seen the first case in the previous schema when we've written:
<xs:element name="library">
<xs:complexType>
<xs:choice minOccurs="0" maxOccurs="unbounded">
<xs:element name="book" type="bookType"/>
<xs:element name="author" type="authorType"/>
<xs:element name="character" type="characterType"/>
</xs:choice>
</xs:complexType>
</xs:element>
Like we had done with RELAX NG, defining a choice with no limit on the number of occurrences does not constrain the order of the items of the choice.
The second case involves a third compositor (in addition to xs:sequence and xs:choice) called xs:all which is somewhat similar to RELAX NG's "interleave" although much more restrictive. Those restrictions can be summarized by saying that it can only been used when all the children elements may only appear zero or one time and that the order is significant for none of them and that the recommendation has been careful to explicitly forbid any workaround you can imagine.
In our example, it would be valid for the author and character elements:
<xs:complexType name="authorType">
<xs:all>
<xs:element name="name" type="xs:token"/>
<xs:element name="nickName" type="xs:token"/>
<xs:element name="born" type="xs:date"/>
<xs:element name="dead" type="xs:date"/>
</xs:all>
<xs:attribute ref="id"/>
</xs:complexType>
<xs:complexType name="characterType">
<xs:all>
<xs:element name="name" type="xs:token"/>
<xs:element name="since" type="xs:date"/>
<xs:element name="qualification" type="xs:string"/>
</xs:all>
<xs:attribute ref="id"/>
</xs:complexType>
but could not be used for the book element:
<xs:complexType name="bookType">
<xs:sequence>
<xs:element name="isbn" type="xs:token"/>
<xs:element name="title" type="xs:string"/>
<xs:element name="author-ref" type="refType" minOccurs="0" maxOccurs="unbounded"/>
<xs:element name="character-ref" type="refType" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute ref="id"/>
</xs:complexType>
for which multiple occurrences of author-ref and character-ref are allowed.
To be honest, I should add that in this specific case there is a workaround using a xs:choice compositor and uniqueness constraints on the elements occurring only once, but this is a rather dirty hack which is not always possible and the error message when the constraint is not met is rather cryptic:
[Error] library-absolutely-noorderNOK.xml:14:14: Duplicate key value [ID Value:
Being a Dog Is a Full-Time Job
] declared for identity constraint of element "book".
This workaround is above the level of this tutorial, but for the record, it involves using xs:choice in the complex type:
<xs:complexType name="bookType">
<xs:choice maxOccurs="unbounded">
<xs:element name="isbn" type="xs:token"/>
<xs:element name="title" type="xs:string"/>
<xs:element name="author-ref" type="refType" minOccurs="0" maxOccurs="unbounded"/>
<xs:element name="character-ref" type="refType" minOccurs="0" maxOccurs="unbounded"/>
</xs:choice>
<xs:attribute ref="id"/>
</xs:complexType>
and adding xs:unique or xs:key constraints to the element:
<xs:element name="book" type="bookType">
<xs:key name="hackOneIsbn">
<xs:selector xpath="isbn"/>
<xs:field xpath="."/>
</xs:key>
<xs:key name="hackOneTitle">
<xs:selector xpath="title"/>
<xs:field xpath="."/>
</xs:key>
</xs:element>
Opening our W3C XML Schema will be done through wildcards in a similar fashion than we've seen with RELAX NG the main difference being that while RELAX NG gives the basic building blocks and lets you build your wildcard, W3C XML Schema gives you wildcards which are more ready for use and therefore both easier to use when they match what you want to do but also less flexible and thus powerful.
The thing here just a matter of using one of these wildcards after defining the complex type of the "book" element as "mixed":
<xs:element name="book" minOccurs="0" maxOccurs="unbounded">
<xs:complexType mixed="true">
<xs:sequence>
<xs:any namespace="##any" processContents="skip"
minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attributeGroup ref="idAttribute"/>
</xs:complexType>
</xs:element>
The purpose of this mixed attribute is to allow text nodes within the element in addition to the children elements defined in its model (note that unlike with RELAX NG, we cannot constraint the location or the values of the text nodes in this case).
The "xs:any" wildcard (a similar "xs:anyAttribute" wildcard is also available for attributes) is equivalent to the pattern "anyElement" which we had defined in our RELAX NG schema. Note that we do not have the same problem with the xs:ID type here than we had with the dtd:ID type in RELAX NG: W3C XML Schema does not try to strictly emulate DTDs and doesn't not impose the strict rules imposed by the RELAX NG DTD compatibility library.
We've just mentioned the ability for "Schema composition and pattern redefinitions" in the list of RELAX NG features which we will not have time to cover in this tutorial but couldn't afford to do so with W3C XML Schema for which these mechanisms are core features and the center of its "object oriented" nature.
The derivation methods and features are quite different depending of the component : elements and attributes cannot be directly derived, element and attribute groups can be derived through schema "redefinition" ala RELAX NG, complex types can be derived by restriction and extension of their content models and simple types can be derived by restriction of the values which are allowed, by list and by union.
These mechanisms are rather complex to use and we will not cover them in great detail in this tutorial but rather give some simple (though incomplete) examples of simple and complex type derivations.
Although possible per W3C XML Schema, we will not cover here the derivation of complex content models from simple types nor the redefinition of elements and attributes groups.
As mentioned the derivation of simple types can be performed in three different fashions:
We have already seen an example of such a derivation when we have written:
<xs:simpleType name="idType">
<xs:restriction base="xs:ID">
<xs:pattern value="_[0-9]{10}"/>
</xs:restriction>
</xs:simpleType>
The new type "idType" is created through a derivation by restriction of the predefined type "xs:ID" called the "base type" (the "xs" prefix assigned to the W3C XML Schema by the namespace declaration on the "schema" document element indicates that it's a predefined type).
The "restriction" element is used as a container for a set of "facets" restricting the values accepted by the datatype. Here we have defined a single facet which is a "pattern" expressing a regular expression.
The list of facets depends on the predefined type used as a base type (not all the facets are meaningful for all the datatypes) and include facets to constrain the length of the value, the minimum and maximum values, the number of digits, ...).
Another example could be a date time for "modern" dates defined as:
<xs:simpleType name="modernDate">
<xs:restriction base="xs:date">
<xs:minInclusive value="1492-01-01"/>
</xs:restriction>
</xs:simpleType>
Derivation by list is the process to create a list type from an "atomic" data type, ie a type accepting whitespace separated values of the atomic type.
An example of such a derivation would be to create a type accepting a list of our "idType":
<xs:simpleType name="idTypes">
<xs:list itemType="idType"/>
</xs:simpleType>
Such a type would let us replace our multiple character-ref elements:
<character-ref id="Peppermint-Patty"/>
<character-ref id="Snoopy"/>
<character-ref id="Schroeder"/>
<character-ref id="Lucy"/>
by a unique element:
<character-refs ids="Peppermint-Patty Snoopy Schroeder Lucy"/>
Note that RELAX NG has a similar feature which is more flexible since list types can be defined with several different atomic types.
Derivation by union is the process of performing the union of the values accepted by two simple types.
If we want to add to the list of values previously defined a special value "##none" to use instead on an empty string when there are no character references, we could write:
<xs:simpleType name="idTypes">
<xs:union>
<xs:simpleType>
<xs:list itemType="idType"/>
</xs:simpleType>
<xs:simpleType>
<xs:restriction base="xs:token">
<xs:enumeration value="##none"/>
</xs:restriction>
</xs:simpleType>
<xs:union>
</xs:simpleType>
The "idTypes" now validates values valid per one of the two anonymous simple types "list of 'idType'" and "xs:token restricted to an enumeration of the unique value '##none'".
This type will can be used for a "ids" attribute accepting values such as:
<character-refs ids="Peppermint-Patty Snoopy Schroeder Lucy"/>
or
<character-refs ids="##none"/>
Note that the same feature can be implemented as a "choice" between patterns in RELAX NG.
Complex types can be derived either by restriction or by extension:
The purpose of the derivation by restriction of a complex type is to reduce the set of structures valid per the datatype and any structure valid per the derived type must also be valid per the base type.
The aim behind this is that if an application knows how to process the base type, it can be sure to be able to process the derived type.
Let's say for instance that we have in our "types library" a "personType" defined as:
<xs:complexType name="personType">
<xs:sequence>
<xs:element name="name" type="xs:token"/>
<xs:element name="nickName" type="xs:token" minOccurs="0" maxOccurs="unbounded"/>
<xs:element name="born" type="xs:date"/>
<xs:element name="dead" type="xs:date"/>
<xs:element name="email" type="xs:token" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute ref="id"/>
<xs:attribute name="password" type="xs:token" use="optional"/>
</xs:complexType>
This type is different in many ways from our "authorType": its "nickName" may have an arbitrary number of occurrences, there is an additional "email" element and "passwd" attributes. However, any structure valid per our "authorType" is also valid per the "personType" (the single occurrence required for the "nickName" element in "authorType" is in the range "[0-unbounded]" defined in the "personType" and the attribute and element from the "personType" which are not allowed in the "authorType" are both optionals.
The "authorType" can thus be derived by restriction from the "personType":
<xs:complexType name="authorType">
<xs:complexContent>
<xs:restriction base="personType">
<xs:sequence>
<xs:element name="name" type="xs:token"/>
<xs:element name="nickName" type="xs:token"/>
<xs:element name="born" type="xs:date"/>
<xs:element name="dead" type="xs:date"/>
</xs:sequence>
<xs:attribute ref="id"/>
<xs:attribute name="password" type="xs:token" use="prohibited"/>
</xs:restriction>
</xs:complexContent>
</xs:complexType>
Yes, the derivation is more concise than the definition! That may be surprising for object oriented developers, but the purpose of such a derivation is not the modularity of the schema itself (if the "personType" is modified, I will probably have to modify most of the derivations too) but to provide a declaration that those types are derived which may be used to increase the modularity of the applications processing these documents.
The first thing to note is that derivation by extension is not the inverse operation of derivation by restriction.
The purpose of a derivation by extension is to add attributes and elements after the content model of the base type. The aim being that if an application has been designed to process structures valid per a complex type and to ignore the attributes which are not defined in the complex type and the elements included after those which are defined in the complex type, this application can also process complex types derived by extension from this base type.
The elements and attributes added do not need to be optional and documents valid per the base type will not necessarily be valid per the derived type.
Let's say that we have defined in our "types library" an "objectType" as:
<xs:complexType name="objectType">
<xs:sequence>
<xs:element name="name" type="xs:token"/>
</xs:sequence>
<xs:attribute ref="id"/>
</xs:complexType>
That's a rather basic definition for an element including an id attribute and a name element and both our "authorType" and "characterType" can be defined by extension from this "objectType":
<xs:complexType name="authorType">
<xs:complexContent>
<xs:extension base="objectType">
<xs:sequence>
<xs:element name="nickName" type="xs:token"/>
<xs:element name="born" type="xs:date"/>
<xs:element name="dead" type="xs:date"/>
</xs:sequence>
</xs:extension>
</xs:complexContent>
</xs:complexType>
<xs:complexType name="characterType">
<xs:complexContent>
<xs:extension base="objectType">
<xs:sequence>
<xs:element name="name" type="xs:token"/>
<xs:element name="since" type="xs:date"/>
<xs:element name="qualification" type="xs:string"/>
</xs:sequence>
</xs:extension>
</xs:complexContent>
</xs:complexType>
Note that here, we are just defining the elements (and eventually attributes) which need to be added to the content model of the base type.
Again, the benefit is not that much the modularity of the schema itself (even though this time a modification of the base type will be automatically taken into account in the derived types) but more the modularity of the applications using these schemas. In our case, an application designed to sort and display structures "objectType" by "name" and "id" will know how to sort and display the names and ids of structures "authorType" and "characterType".
We have not covered namespace support by RELAX NG which do not bring any restriction (a RELAX NG schema is not describing a namespace but a class of document which can include elements and attributes from different namespaces) and is well described in the RELAX NG tutorial but we need to be more verbose about the namespace support by W3C XML Schema which is more restrictive and a FAQ.
The basic principle of W3C XML Schema is that a schema can describe by itself one and only one namespace (or lack of namespace) and that one schema per namespace must be used when different namespaces cohabit in a document.
This is true even for attributes from the "xml" namespace such as xml:space, xml:base and xml:lang and a schema for a document using these attributes needs to include a schema for the "xml" namespace.
We've now covered the main features of the W3C XML Schema language and will just give a brief list of what we've left apart.
We have already mentioned the restrictions on unordered content models, but we must note that these restrictions are the consequence of two main rules which are reducing the scope of XML structures which can be described by W3C XML Schema. The first rule can be summarized as the fact that two different types cannot be defined for an element at the same "context" in a document (I can't say that an author as either a type "authorType" or "anotherAuthorType").
The second rule is to forbid what was called "non deterministic" content models by XML DTDs and formalizes the fact that W3C XML Schema validators must be implementable by finite state machines without "look ahead" (at any time during the validation process, the validator must be able to determine where it is located in alternatives expressed in the schema without having to make any assumption). Violations of this second rule are often due to an unfortunate way of matching the "sequence" and "choice" compositors and can most of the time be solved by rewriting the schema (which is unfortunately not always trivial) but there are content models which are basically non deterministic.
Those two rules are non restrictions for RELAX NG.
The main features which we have skipped are: