James Clark and Rick Jelliffe on RELAX NG and W3C XML Schema
James Clark and Rick Jelliffe on RELAX NG and W3C XML Schema Re: http://www.imc.org/ietf-xml-use/draft-hollenbeck-ietf-xml-guidelines-04.html Re: http://lists.xml.org/archives/xml-dev/200206/threads.html#00039
RELAX NG and W3C XML Schema -------------------------------------------------------------------------------- URL: http://www.imc.org/ietf-xml-use/mail-archive/msg00217.html Subject: RELAX NG and W3C XML Schema From: "James Clark" <jjc@jclark.com> Date: Tue, 4 Jun 2002 16:18:09 +0700 List-archive: <http://www.imc.org/ietf-xml-use/mail-archive/> List-id: <ietf-xml-use.imc.org> -------------------------------------------------------------------------------- I just had a look at draft-hollenbeck-ietf-xml-guidelines-04. Section 4.6 says "XML Schema should be used as the formalism in the absence of clearly stated reasons to choose another." I strongly disagree with this recommendation. I believe RELAX NG is preferable in many situations to XML Schema and should receive at least equal billing. Concretely, I propose in the sentence above changing "XML Schema" to "XML Schema or RELAX NG". Currently, 4.6 mentions RELAX NG in the following terms: "There are also a number of other mechanisms for describing XML instance validity; these include, for example, Schematron [48], RELAX NG [47], and the Document Schema Definition Language [34]." Firstly, this mentions RELAX NG and DSDL as if they were separate things. This is incorrect. RELAX NG is in fact Part 2 of DSDL (which now stands for Document Schema Definition Language*s*). I don't think RELAX NG is just another mechanism. It is a solid, mature and stable specification. It has been developed in an open standards process (in OASIS). It has multiple, independent and interoperable implementations. It is based on a solid body of CS theory (tree automata). It is on track to become a fully-fledged International Standard: it recently went out as a Draft International Standard [1]. Certainly no one can deny that at this point W3C XML Schema enjoys much greater acceptance in the marketplace. However, I would argue this should not be the key criteria to use to select which schema languages to recommend for use in IETF specifications. I believe the key function of a schema language in a specification of an XML application is to communicate unambiguously and precisely to a human reader what XML documents are legal for that application; it serves a similar role for XML that ABNF does for text. Thus, the key criteria should be how well the schema language performs this function. On this criteria, there are many reasons to prefer RELAX NG. 1. RELAX NG was designed to be simple and easy to understand. RELAX NG is simple enough that without even reading the RELAX NG spec, somebody familiar with XML can read a RELAX NG grammar and understand what it means. You can learn to write RELAX NG in 30 minutes by reading the tutorial [2]. RELAX NG is fairly free of surprises. Constructs mean what you would guess they mean. This is not the case with W3C XML Schema. It requires considerable expertise to be able to understand a W3C XML Schema correctly. There are many cases where you cannot guess what a construct means or where you might guess wrong. For example, if you derive a complex type by restriction you have to specify the new restricted content model explicitly. However, attributes are treated in the opposite way: by default you get all the attributes and you have to explicitly rule out the ones you get. This may be more convenient but it make for schemas that can be easily misunderstood by the uninitiated: somebody who is not an expert, seeing a restriction with a content model but no attributes, might well assume that no attributes were allowed. This is not an isolated example. There are many things about XML Schema that are just plain bizarre. Here's a random example I ran across yesterday. Suppose you have two attribute groups g1 and g2, containing sets of attributes a1 and a2 and attribute wildcards w1 and w2. Now suppose you have a complex type t that references g1 and g2. The effective attributes of t will, as you would expect, be the union of a1 and a2, but the attribute wildcards will be the *intersection* of w1 and w2. For example, given <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"; elementFormDefault="qualified" xmlns="http://eg.com"; targetNamespace="http://eg.com";> <xs:attributeGroup name="g1"> <xs:attribute name="a1" type="xs:string"/> <xs:anyAttribute namespace="http://eg.com/1 http://eg.com/2"; processContents="skip"/> </xs:attributeGroup> <xs:attributeGroup name="g2"> <xs:attribute name="a2" type="xs:string"/> <xs:anyAttribute namespace="http://eg.com/2"; processContents="skip"/> </xs:attributeGroup> <xs:element name="foo"> <xs:complexType> <xs:attributeGroup ref="g1"/> <xs:attributeGroup ref="g2"/> </xs:complexType> </xs:element> </xs:schema> the foo element could have an a1 attribute or an a2 attribute or any attribute from the http://eg.com/2 namespace, but could not have attributes from the http://eg.com/1 namespace. Maybe there's some good reason behind this, but I believe this sort of design decision makes W3C XML Schema a very poor choice as a formalism for communicating an XML grammar to a human reader. 2. The problem described in 1 above might be tolerable if the W3C XML Schema Recommendation [3] were easy to understand. However, it is without doubt the hardest to understand specification that I have ever read. In order to be able to understand the precise meaning of a schema in in an IETF specification, readers would have to consult the W3C XML Schema Recommendation. But it is extraordinarily hard for a reader to determine from the Recommendation the meaning of some particular construct they are not sure of. I often hear people say: "It doesn't really matter that the spec W3C XML Schema Rec is so hard to understand; only W3C XML Schema implementors need to do this". I think this is misguided. People who want to be sure they have understood exactly what a particular W3C XML schema means also have to understand the W3C XML Schema Rec. 3. The RELAX NG specification includes a normative, formal description of the semantics of a RELAX NG schema. This was not developed as an afterthought but was a guide throughout the design of the semantics. More than a year after the publication of the W3C XML Schema Recommendation, "XML Schema: Formal Description" [4] is still a work in progress and is still far from being a complete and correct description of the semantics of XML Schema; moreover, it cannot be relied on as it has no normative force. The RELAX NG formalism has a solid basis in tree automata theory. W3C XML Schema has no such basis. The role of a schema in a specification is to serve as a formalism. How good is a formalism if that formalism itself lacks a proper formal basis? 4. W3C XML Schema's support for attributes is totally inadequate and provides no advance over DTDs. As with DTDs, W3C XML Schema only allows the specification of whether attributes are required or optional. There is no way to specify more complex constraints between attributes or between attributes or elements. There is no way to say that either attribute X or attribute Y is allowed or that either attribute X or element Y is allowed. In my experience, this sort of constraint is extremely common in XML grammars. RELAX NG integrates attributes into content models. Exactly the same mechanism that is used to constrain the cooccurrence of child elements can be used to constrain the cooccurrence of attributes and the cooccurrence of attributes and child elements. 5. W3C XML Schema provides very weak support for unordered content. When the designer of an XML vocabulary does not wish to force child elements to occur in a particular order, it can be impractical to describe the XML vocabulary using XML Schema, because XML Schema imposes such limitations on its "all" element as to make it virtually useless. RELAX NG provides an "interleave" element, which is restricted enough to be efficiently implementable but provides adequate support for designers who do no wish to allow flexibility in the ordering of child elements. 6. The approach to handling datatypes in W3C XML Schema is totally lacking in modularity. W3C XML Schema is tied to the single collection of datatypes defined in Part 2 of W3C XML Schema. Yet this collection of datatypes is a very ad-hoc collection. It includes datatypes of highly debatable utility (gYearMonth, gDay etc). Yet it lacks many datatypes that are important for many applications. I would argue that no one single collection of datatypes can be adequate for all applications across the diverse range of domains supported by XML. What's needed is a modular approach where a schema language for specifying structure can be combined with one or more standard collections of datatypes, some general-purpose and some domain-specific. RELAX NG adopts this approach. You can the datatypes defined by W3C XML Schema if you choose, but it is also possible to use other systems of datatypes instead of or in addition to these. With RELAX NG, an IETF specification could define a collection of datatypes that are useful for IETF applications. For example, might it not be useful to have a datatype for an IP address or a domain name? Such datatypes could be used with RELAX NG with no change to RELAX NG itself. 7. In W3C XML Schema there is no way to specify what is allowed as the root element. W3C XML Schema does not define a single notion of validity of a document with respect to a schema. There are different varieties of validation (lax and strict) and many different ways to validate a document against a schema. From a W3C XML Schema alone, it is not possible to know what it is a valid document. For example, consider a totally trivial schema like this: <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"; elementFormDefault="qualified" xmlns="http://www.example.com"; targetNamespace="http://www.example.com";> <xs:element name="foo"> <xs:complexType/> </xs:element> </xs:schema> Now consider a totally bogus document like this: <bar/> Believe it or not, the W3C XML Schema processors that I have tried report this as valid! The definition of validity is so flexible in W3C XML Schema as to seriously impact interoperability. If an application was relying on the W3C XML Schema validation to screeen out incorrect input, it would be in serious trouble. With RELAX NG, this sort of bogosity does not arise: there is a clear, unambiguous notion of validity. If you have a RELAX NG schema, there is no doubt about what instances are valid. 8. W3C XML Schema provides the xsi:schemaLocation attribute, which allows an XML document instance to indicate the schema that should be used to validate the document. I think this is a serious problem for a couple of reasons. One reason is that this is a potential security problem. One important use schemas is to protect an application against invalid data. This use of schemas is easily undermined by documents that use xsi:schemaLocation. Another reason is that this leads to interoperability problems. Its use is not mandated by the XML Rec: it's just a hint. Yet, in some implementations, this is the only way to specify the schema to use to validate the document. In RELAX NG, validation is treated as a process with two independent inputs, a schema and an instance to be validated with respect to the schema. There is no way in a W3C XML Schema to prohibit the instance from containing xsi:schemaLocation attributes. Indeed, this is also the case for other xsi attributes: there is no way to prevent the document containing xsi:type attributes. The use of W3C XML Schema infects the grammar you are defining. If you want a closed grammar that only allows specific attributes not including the xsi attributes, you cannot express that in W3C XML Schema. RELAX NG has no such magic attributes. 9. Another problematic area in W3C XML Schema is the support for infoset augmentation, such a default attributes. Experience with XML 1.0 has, I believe, shown that this is not a good feature to include in a schema language. Apart from being a violation of modularity, it tends to cause interoperability problems, because it leads to the possibility of the application getting different information depending on whether or not validation has been performed. RELAX NG, by contrast, never changes the information that an application receives. It specifies purely what is valid and what is invalid. I've looked through the archives and I haven't seen any technical justification for the recommendation of W3C XML Schema as the default choice of schema language. Section 1.2 of RFC 2026 lists as two of the goals of the Internet Standards Process: - technical excellence - clear, concise and easily understood documentation I believe these should be considered in selecting a schema language. On both of these, I believe RELAX NG is far superior to W3C XML Schema. I invite anybody who disagrees to go off and read the two specifications [3], [5]. I am sorry to have gone on at such length, but I think this is an important issue. There seems to be a tendency for people to suspend their technical judgment when it comes to W3C XML Schema. The attitude seems to be "It's a W3C Recommendation; everybody is using it, so we should too, regardless of its technical merits." I don't think this attitude serves the best long-term interests of the Internet. I and others have sacrificed a huge amount of time and effort to try and provide the community with a solid, technically credible alternative and I think it deserves to be considered seriously on its technical merits and not dismissed on the basis of its current level of market acceptance. James [1] http://www.y12.doe.gov/sgml/sc34/document/0320.htm [2] http://www.oasis-open.org/committees/relax-ng/tutorial-20011203.html [3] http://www.w3.org/TR/xmlschema-1 [4] http://www.w3.org/TR/xmlschema-formal/ [5] http://www.oasis-open.org/committees/relax-ng/spec-20011203.html
URL: http://lists.xml.org/archives/xml-dev/200206/msg00059.html Date: Wed, 5 Jun 2002 16:24:39 +1000 From: Rick Jelliffe <ricko@allette.com.au> To: xml-dev@lists.xml.org Subject: Re: [xml-dev] XML Schema considered harmful? From: "Michael Leditschke" <mike@ammd.com.au> > In terms of the complexities of specs, yes XML Schema Part 1 makes > difficult reading but the Primer, Part 0, is quite readable and, to > their credit, was updated with each release of the spec. It covers the > ground and I have only occasionally had to refer to Part 1, despite > designing schemas using a large percentage of the supported constructs. I had the experience of being *very* familiar with the XML Schema specs, then going away for a few months. When I returned, I found them quite difficult to fathom. There have been several times when I have not been able to answer user's (of our validator) questions and have had to rely on another Schema expert here. The issue is not whether it is possible to become a fulltime expert in XML Schemas; the issue is how much protocol designers should be required to cope with, and whether IETF should support plurality or be exclusive. IETF has so far been build on making layers to support plurality, allowing protocols to thrive on their own. XML Schemas is monolithic and badly architected: it will be very difficult to upgrade the bits that are incomplete (keys and datatypes) because of this. > I may have missed it, but the support in RELAX NG seems, by the nature of > RELAX NG, purely structural. I assume I will need to add Schematron to the > mix, which is the same situation as with XML Schema currently. Thanks for the plug! However there is (at least) one significant difference: Schematron has not been designed with streaming implementations in mind (and I am not aware of any streaming implementations): a schema language that requires a DOM be built is not suitable for high-speed transaction validation over the Web, which is what we are talking about. Now, I am aware of people who have used Schematron for testing incoming pages and generating custom pages to return to the user to ask them for missing or incorrect information. But that is a different area. > I've probably completely missed the point here, but doesn't an XML Schema > that only has one global element achieve the above? Maybe its a matter of > semantics but that's how it's panned out in practice for me thus far. But then you cannot use subsititution groups: this is the kind of complexity that James is talking about I think--the complexity when using one feature makes another disappear arbitrarily. > Don't get me wrong - I don't receive regular brown paper envelopes with > W3C in the return address, and I'm not saying XML Schema hasn't got warts, > but its there and supported and to me, its not the **HUGE** conceptual > and learning leap it seems to be painted as in this newsgroup. It achieved > my 80% and got the project in on time. In the process a number of other > organisations had to climb the same learning curve and got there. Were these projects IETF protocols? and are you are an XML or schema expert or, as we can expect IETF people to be, are you only using XML because it will be more convenient than rolling your own syntax and you are not an expert? If I were developing a protocol, I would be take some convincing that XML Schemas was not overkill for my requirements. > James is emphatic, and that is only natural, but his arguments paint > issues as black and white (XML Schema = bad, RELAX NG = good) and my > experience with XML Schema suggests shades of grey. But it is not James who is being black and white: it is the draft RFC wanting to ban the use of RELAX NG! (and, Schematron or the DSDL effort for that matter!) > To my mind, the bigger issue to decide is how many schema langauges > the IETF want appearing in RFCs. Simply allowing both means that RFC > readers have to learn both. And since RELAX NG focusses on structure, > what will be used to express content based co-constraints? Perhaps it > would be better to be arguing for DSDL. DSDL is an ISO standard in several parts, and I think the ISO WG involved is very keen to not repeat the mistakes of XML Schemas w.r.t premature standardization. So the technologies that are mature (now RELAX NG, shortly Schematron) are being standardized. In any case, it seems that many people who are cowed by XML Schemas actually write their Schemas as DTDs then convert them using an automated tool. I used James' dtdinst program last night for the first time (to convert the EAD DTD into RELAX NG) and I found it was excellent. If there is a large class of users who just learn XML and are content to automatically convert, they have no requirement that a single schema language be mandated. I don't think the argument that people will be confused by multiple schema languages holds water: some people will be confused by XML Schemas anyway and turn to simplifying tools (e.g. writing in DTDs) or different interfaces. The best way is to try both schema languages and to get a feel for their different capabilities. Clearly XML Schemas has innumerable nice features for transfering data between backend database systems by big business. Clearly RELAX NG has nice features for multimedia languages and documents. But are IETF protocols more like big-business data transfers or like multimedia languages? It would make more sense for the RFC to merely say something like this "Standard schema languages (E.g. ISO RELAX NG or W3C XML Schemas) should be used in preference to proprietary or non-standard languages. Schema languages should be used conservatively: exotic or difficult or badly-described features may be badly implemented or used incorrectly or be difficult to diagnose." Cheers Rick Jelliffe
Prepared by Robin Cover for The XML Cover Pages archive. For schema description and references, see "XML Schemas."