James Clark and Rick Jelliffe on RELAX NG and W3C XML Schema

James Clark and Rick Jelliffe on RELAX NG and W3C XML Schema
Re: http://www.imc.org/ietf-xml-use/draft-hollenbeck-ietf-xml-guidelines-04.html
Re: http://lists.xml.org/archives/xml-dev/200206/threads.html#00039

RELAX NG and W3C XML Schema

--------------------------------------------------------------------------------
URL: http://www.imc.org/ietf-xml-use/mail-archive/msg00217.html

Subject: RELAX NG and W3C XML Schema 
From: "James Clark" <jjc@jclark.com> 
Date: Tue, 4 Jun 2002 16:18:09 +0700 
List-archive: <http://www.imc.org/ietf-xml-use/mail-archive/> 
List-id: <ietf-xml-use.imc.org> 

--------------------------------------------------------------------------------

I just had a look at draft-hollenbeck-ietf-xml-guidelines-04.  Section
4.6 says "XML Schema should be used as the formalism in the absence of
clearly stated reasons to choose another."  I strongly disagree with
this recommendation.

I believe RELAX NG is preferable in many situations to XML Schema and
should receive at least equal billing.  Concretely, I propose in the
sentence above changing "XML Schema" to "XML Schema or RELAX NG".

Currently, 4.6 mentions RELAX NG in the following terms: "There are
also a number of other mechanisms for describing XML instance
validity; these include, for example, Schematron [48], RELAX NG [47],
and the Document Schema Definition Language [34]."  Firstly, this
mentions RELAX NG and DSDL as if they were separate things.  This is
incorrect.  RELAX NG is in fact Part 2 of DSDL (which now stands for
Document Schema Definition Language*s*).  I don't think RELAX NG is
just another mechanism.  It is a solid, mature and stable
specification.  It has been developed in an open standards process (in
OASIS).  It has multiple, independent and interoperable
implementations.  It is based on a solid body of CS theory (tree
automata). It is on track to become a fully-fledged International
Standard: it recently went out as a Draft International Standard [1].

Certainly no one can deny that at this point W3C XML Schema enjoys
much greater acceptance in the marketplace.  However, I would argue
this should not be the key criteria to use to select which schema
languages to recommend for use in IETF specifications.  I believe the
key function of a schema language in a specification of an XML
application is to communicate unambiguously and precisely to a human
reader what XML documents are legal for that application; it serves a
similar role for XML that ABNF does for text.  Thus, the
key criteria should be how well the schema language performs this
function.

On this criteria, there are many reasons to prefer RELAX NG.

1. RELAX NG was designed to be simple and easy to understand.  RELAX
NG is simple enough that without even reading the RELAX NG spec,
somebody familiar with XML can read a RELAX NG grammar and understand
what it means.  You can learn to write RELAX NG in 30 minutes by
reading the tutorial [2].  RELAX NG is fairly free of surprises.
Constructs mean what you would guess they mean.

This is not the case with W3C XML Schema.  It requires considerable
expertise to be able to understand a W3C XML Schema correctly.  There
are many cases where you cannot guess what a construct means or where
you might guess wrong.  For example, if you derive a complex type by
restriction you have to specify the new restricted content model
explicitly.  However, attributes are treated in the opposite way: by
default you get all the attributes and you have to explicitly rule out
the ones you get.  This may be more convenient but it make for schemas
that can be easily misunderstood by the uninitiated: somebody who is
not an expert, seeing a restriction with a content model but no
attributes, might well assume that no attributes were allowed.  This
is not an isolated example.

There are many things about XML Schema that are just plain bizarre.
Here's a random example I ran across yesterday.  Suppose you have two
attribute groups g1 and g2, containing sets of attributes a1 and a2
and attribute wildcards w1 and w2.  Now suppose you have a complex
type t that references g1 and g2.  The effective attributes of t will,
as you would expect, be the union of a1 and a2, but the attribute
wildcards will be the *intersection* of w1 and w2. For example, given

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema";
  elementFormDefault="qualified"
  xmlns="http://eg.com";
  targetNamespace="http://eg.com";>

<xs:attributeGroup name="g1">
  <xs:attribute name="a1" type="xs:string"/>
  <xs:anyAttribute namespace="http://eg.com/1 http://eg.com/2";
       processContents="skip"/>
</xs:attributeGroup>

<xs:attributeGroup name="g2">
  <xs:attribute name="a2" type="xs:string"/>
  <xs:anyAttribute namespace="http://eg.com/2"; processContents="skip"/>
</xs:attributeGroup>

<xs:element name="foo">
  <xs:complexType>
    <xs:attributeGroup ref="g1"/>
    <xs:attributeGroup ref="g2"/>
  </xs:complexType>
</xs:element>

</xs:schema>

the foo element could have an a1 attribute or an a2 attribute or any
attribute from the http://eg.com/2 namespace, but could not have
attributes from the http://eg.com/1 namespace.

Maybe there's some good reason behind this, but I believe this sort of
design decision makes W3C XML Schema a very poor choice as a formalism
for communicating an XML grammar to a human reader.

2. The problem described in 1 above might be tolerable if the W3C XML
Schema Recommendation [3] were easy to understand. However, it is
without doubt the hardest to understand specification that I have ever
read.  In order to be able to understand the precise meaning of a
schema in in an IETF specification, readers would have to consult the
W3C XML Schema Recommendation.  But it is extraordinarily hard for a
reader to determine from the Recommendation the meaning of some
particular construct they are not sure of.

I often hear people say: "It doesn't really matter that the spec W3C
XML Schema Rec is so hard to understand; only W3C XML Schema
implementors need to do this". I think this is misguided.  People who
want to be sure they have understood exactly what a particular W3C XML
schema means also have to understand the W3C XML Schema Rec.

3. The RELAX NG specification includes a normative, formal description
of the semantics of a RELAX NG schema. This was not developed as an
afterthought but was a guide throughout the design of the semantics.
More than a year after the publication of the W3C XML Schema
Recommendation, "XML Schema: Formal Description" [4] is still a work
in progress and is still far from being a complete and correct
description of the semantics of XML Schema; moreover, it cannot be
relied on as it has no normative force.

The RELAX NG formalism has a solid basis in tree automata theory.  W3C
XML Schema has no such basis.

The role of a schema in a specification is to serve as a formalism.
How good is a formalism if that formalism itself lacks a proper formal
basis?

4. W3C XML Schema's support for attributes is totally inadequate and
provides no advance over DTDs.  As with DTDs, W3C XML Schema only
allows the specification of whether attributes are required or
optional.  There is no way to specify more complex constraints between
attributes or between attributes or elements.  There is no way to say
that either attribute X or attribute Y is allowed or that either
attribute X or element Y is allowed.  In my experience, this sort of
constraint is extremely common in XML grammars.

RELAX NG integrates attributes into content models.  Exactly the same
mechanism that is used to constrain the cooccurrence of child elements
can be used to constrain the cooccurrence of attributes and the
cooccurrence of attributes and child elements.

5. W3C XML Schema provides very weak support for unordered content.
When the designer of an XML vocabulary does not wish to force child
elements to occur in a particular order, it can be impractical to
describe the XML vocabulary using XML Schema, because XML Schema
imposes such limitations on its "all" element as to make it virtually
useless.

RELAX NG provides an "interleave" element, which is restricted
enough to be efficiently implementable but provides adequate support
for designers who do no wish to allow flexibility in the ordering of
child elements.

6. The approach to handling datatypes in W3C XML Schema is totally
lacking in modularity.  W3C XML Schema is tied to the single
collection of datatypes defined in Part 2 of W3C XML Schema. Yet this
collection of datatypes is a very ad-hoc collection. It includes datatypes
of
highly debatable utility (gYearMonth, gDay etc).  Yet it lacks many
datatypes that are important for many applications.

I would argue that no one single collection of datatypes can be
adequate for all applications across the diverse range of domains
supported by XML.  What's needed is a modular approach where a schema
language for specifying structure can be combined with one or more
standard collections of datatypes, some general-purpose and some
domain-specific.  RELAX NG adopts this approach.  You can the
datatypes defined by W3C XML Schema if you choose, but it is also
possible to use other systems of datatypes instead of or in addition
to these.

With RELAX NG, an IETF specification could define a collection of
datatypes that are useful for IETF applications.  For example, might
it not be useful to have a datatype for an IP address or a domain
name? Such datatypes could be used with RELAX NG with no change to
RELAX NG itself.

7. In W3C XML Schema there is no way to specify what is allowed as the
root element.  W3C XML Schema does not define a single notion of
validity of a document with respect to a schema.  There are different
varieties of validation (lax and strict) and many different ways to
validate a document against a schema.  From a W3C XML Schema alone, it
is not possible to know what it is a valid document.

For example, consider a totally trivial schema like this:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema";
  elementFormDefault="qualified"
  xmlns="http://www.example.com";
  targetNamespace="http://www.example.com";>

<xs:element name="foo">
  <xs:complexType/>
</xs:element>

</xs:schema>

Now consider a totally bogus document like this:

<bar/>

Believe it or not, the W3C XML Schema processors that I have tried
report this as valid!  The definition of validity is so flexible in
W3C XML Schema as to seriously impact interoperability.  If an
application was relying on the W3C XML Schema validation to screeen
out incorrect input, it would be in serious trouble.

With RELAX NG, this sort of bogosity does not arise: there is a clear,
unambiguous notion of validity.  If you have a RELAX NG schema, there
is no doubt about what instances are valid.

8. W3C XML Schema provides the xsi:schemaLocation attribute, which
allows an XML document instance to indicate the schema that should be
used to validate the document.  I think this is a serious problem for
a couple of reasons.

One reason is that this is a potential security problem.  One
important use schemas is to protect an application against invalid
data.  This use of schemas is easily undermined by documents that use
xsi:schemaLocation.

Another reason is that this leads to interoperability problems.  Its
use is not mandated by the XML Rec: it's just a hint.  Yet, in some
implementations, this is the only way to specify the schema to use to
validate the document.

In RELAX NG, validation is treated as a process with two independent
inputs, a schema and an instance to be validated with respect to the
schema.

There is no way in a W3C XML Schema to prohibit the instance from
containing xsi:schemaLocation attributes.  Indeed, this is also the
case for other xsi attributes: there is no way to prevent the document
containing xsi:type attributes.  The use of W3C XML Schema infects the
grammar you are defining. If you want a closed grammar that only
allows specific attributes not including the xsi attributes, you
cannot express that in W3C XML Schema.  RELAX NG has no such magic
attributes.

9. Another problematic area in W3C XML Schema is the support for
infoset augmentation, such a default attributes.  Experience with XML
1.0 has, I believe, shown that this is not a good feature to include
in a schema language.  Apart from being a violation of modularity, it
tends to cause interoperability problems, because it leads to the
possibility of the application getting different information depending
on whether or not validation has been performed.  RELAX NG, by
contrast, never changes the information that an application receives.
It specifies purely what is valid and what is invalid.

I've looked through the archives and I haven't seen any technical
justification for the recommendation of W3C XML Schema as the default
choice of schema language.

Section 1.2 of RFC 2026 lists as two of the goals of the Internet
Standards Process:

- technical excellence
- clear, concise and easily understood documentation

I believe these should be considered in selecting a schema
language. On both of these, I believe RELAX NG is far superior to W3C
XML Schema.  I invite anybody who disagrees to go off and read the two
specifications [3], [5].

I am sorry to have gone on at such length, but I think this is an
important issue.  There seems to be a tendency for people to suspend
their technical judgment when it comes to W3C XML Schema. The attitude
seems to be "It's a W3C Recommendation; everybody is using it, so we
should too, regardless of its technical merits."  I don't think this
attitude serves the best long-term interests of the Internet.  I and
others have sacrificed a huge amount of time and effort to try and
provide the community with a solid, technically credible alternative
and I think it deserves to be considered seriously on its technical
merits and not dismissed on the basis of its current level of market
acceptance.

James

[1] http://www.y12.doe.gov/sgml/sc34/document/0320.htm
[2] http://www.oasis-open.org/committees/relax-ng/tutorial-20011203.html
[3] http://www.w3.org/TR/xmlschema-1
[4] http://www.w3.org/TR/xmlschema-formal/
[5] http://www.oasis-open.org/committees/relax-ng/spec-20011203.html

URL: http://lists.xml.org/archives/xml-dev/200206/msg00059.html

Date:    Wed, 5 Jun 2002 16:24:39 +1000
From:    Rick Jelliffe <ricko@allette.com.au>
To:      xml-dev@lists.xml.org
Subject: Re: [xml-dev] XML Schema considered harmful?

From: "Michael Leditschke" <mike@ammd.com.au>

> In terms of the complexities of specs, yes XML Schema Part 1 makes
> difficult reading but the Primer, Part 0, is quite readable and, to
> their credit, was updated with each release of the spec. It covers the
> ground and I have only occasionally had to refer to Part 1, despite
> designing schemas using a large percentage of the supported constructs.

I had the experience of being *very* familiar with the XML Schema specs,
then going away for a few months.  When I returned, I found them quite
difficult to fathom.  There have been several times when I have not been 
able to answer user's (of our validator) questions and have had to rely on 
another Schema expert here.

The issue is not whether it is possible to become a fulltime expert in XML Schemas;
the issue is how much protocol designers should be required to cope with,
and whether IETF should support plurality or be exclusive.

IETF has so far been build on making layers to support plurality, allowing
protocols to thrive on their own.  XML Schemas is monolithic and badly
architected: it will be very difficult to upgrade the bits that are incomplete
(keys and datatypes) because of this. 

> I may have missed it, but the support in RELAX NG seems, by the nature of
> RELAX NG, purely structural. I assume I will need to add Schematron to the
> mix, which is the same situation as with XML Schema currently. 

Thanks for the plug! However there is (at least) one significant difference:
Schematron has not been designed with streaming implementations in mind
(and I am not aware of any streaming implementations):  a schema language 
that requires a DOM be built is not suitable for high-speed transaction validation 
over the Web, which is what we are talking about.  

Now, I am aware of people who have used Schematron for testing incoming
pages and generating custom pages to return to the user to ask them
for missing or incorrect information. But that is a different area. 

> I've probably completely missed the point here, but doesn't an XML Schema
> that only has one global element achieve the above? Maybe its a matter of
> semantics but that's how it's panned out in practice for me thus far.

But then you cannot use subsititution groups:  this is the kind of complexity 
that James is talking about I think--the complexity when using one
feature makes another disappear arbitrarily. 

> Don't get me wrong - I don't receive regular brown paper envelopes with
> W3C in the return address, and I'm not saying XML Schema hasn't got warts,
> but its there and supported and to me, its not the **HUGE** conceptual
> and learning leap it seems to be painted as in this newsgroup. It achieved
> my 80% and got the project in on time. In the process a number of other
> organisations had to climb the same learning curve and got there.

Were these projects IETF protocols? and  are you are an XML or schema
expert or, as we can expect IETF people to be,  are you only using XML 
because it will be more convenient than rolling your own syntax and you
are not an expert?   If I were developing a protocol, I would be
take some convincing that XML Schemas was not overkill for my
requirements.   

> James is emphatic, and that is only natural, but his arguments paint
> issues as black and white (XML Schema = bad, RELAX NG = good) and my
> experience with XML Schema suggests shades of grey.

But it is not James who is being black and white: it is the draft RFC wanting to 
ban the use of RELAX NG! (and, Schematron or the DSDL effort for that matter!)

> To my mind, the bigger issue to decide is how many schema langauges
> the IETF want appearing in RFCs. Simply allowing both means that RFC
> readers have to learn both. And since RELAX NG focusses on structure,
> what will be used to express content based co-constraints? Perhaps it
> would be better to be arguing for DSDL.

DSDL is an ISO standard in several parts, and I think the ISO WG involved
is very keen to not repeat the mistakes of XML Schemas w.r.t premature
standardization.  So the technologies that are mature (now RELAX NG,
shortly Schematron) are being standardized. 

In any case, it seems that many people who are cowed by XML Schemas 
actually write their Schemas as DTDs then convert them using an automated
tool.  I used James' dtdinst program last night for the first time 
(to convert the EAD DTD into RELAX NG) and I found it was
excellent. If there is a large class of users who just learn XML and
are content to automatically convert, they have no requirement that
a single schema language be mandated. I don't think the argument
that people will be confused by multiple schema languages holds water: 
some people will be confused by XML Schemas anyway and 
turn to simplifying tools (e.g. writing in DTDs) or different interfaces.

The best way is to try both schema languages and to get a feel for their
different capabilities.

Clearly XML Schemas has  innumerable nice features for transfering data
between backend database systems by big business.  Clearly RELAX NG 
has nice features for multimedia languages and documents.  But are
IETF protocols more like big-business data transfers or like multimedia
languages?  

It would make more sense for the RFC to merely say something like
this

"Standard schema languages (E.g. ISO RELAX NG or W3C XML Schemas)
should be used in preference to proprietary or non-standard languages. 
Schema languages should be used conservatively: exotic or difficult or 
badly-described features may be badly implemented or used incorrectly 
or be difficult to diagnose."

Cheers
Rick Jelliffe

Prepared by Robin Cover for The XML Cover Pages archive. For schema description and references, see "XML Schemas."