[This local archive copy is from the official and canonical URL, http://www.ascc.net/~ricko/notation.htm; please refer to the canonical source document if possible.]
Rick Jelliffe
1999-05-12
This note is for discussion purposes by the W3C Schema Working Group. It provides an alternative characterization of the schema problem. This provides a fromwork for addressing many issues not handled in the first working draft of the XML Schema specification. It builds on
XML Schemas should be able to provide enough information to allow:
As for general test cases for XML Schema:
Furthermore, XML Schemas must provide a framework for maximum extensibility, to allow vendors to compete on depth and quality of coverage.
The particular functional shortfalls in the current draft of XML Schema: Structure that should be addressed are:
There are two severely broken non-conformances in the XML Schema draft which I avoid in this proposal:
There are two integration problems with the XML Schema draft:
Finally, there is a scenario issue:
An XML Notation Schema
In other words, XML Notations Schemas build on top of XML markup declarations strengths:
An XML Notation Schema is a set of multi-layered grammars. The grammars may be
The grammars can apply against any kind of document object.
A grammar is known by a notation identifier. It is the funamantal declaration. I follow a revised form from XML Schema.
<!ELEMENT notation ( html:body?, ( %grammar; )*, handler*, parameter*) > <!ATTLIST notation name ID #REQUIRED system CDATA #IMPLIED public CDATA #IMPLIED mime CDATA #IMPLIED >
A notation identifer has several possible names according to different schemas.
There following notations are built-in:
A handler is some downloadable program which can deal with the notation. The html:body element is provided for documentation.
A grammar defines a notation layer. (Note: For simpler explanation I have not included archetype features here: it should be easy to see how they can be added.)
<!ENTITY % grammar " regularExpression | lexicalModel | contentModel | contextModel| BNF | userDefined ">
<!ELEMENT regularExpression ( #PCDATA )> <!ATTLIST regularExpression classes NOTATION "xml:POSIX.regex" tokenizer IDREF "xml:xml.character" >
<!ELEMENT lexicalModel ( lexical* )> <!ATTLIST lexicalModel classes NOTATION "xml:schema.lexical-types" tokenizer IDREF "xml:xml.character" > <!ELEMENT lexical ( #PCDATA )>
<!ELEMENT contentModel ( #PCDATA )> <!ATTLIST contentModel classes NOTATION #IMPLIED tokenizer IDREF "xml:dom.element-nodes" > <!ELEMENT lexical ( #PCDATA )>
<!ELEMENT contextModel ( #PCDATA )> <!ATTLIST contextModel classes NOTATION "xml:xsl.patterns" tokenizer IDREF "xml:dom.element-nodes"
<!ELEMENT BNF ( production )*> <!ATTLIST BNF terminals NOTATION "xml:rfc????.ebnf" tokenizer IDREF "xml:xml.character" > <!ELEMENT production ( #PCDATA )> <!ATTLIST production nonterminal NMTOKEN #REQUIRED >
<!ELEMENT userDefined ANY > <!ATTLIST userDefined name ID #REQUIRED notation NOTATION #REQUIRED tokenizer #IMPLIED >
Each of these forms allows variations on the same function: parsing or validating against a grammar:
Each of these grammars represent a slightly different way of expressing statements about some data. Some set operation or logical operations could also be provided.
What is notable about each of these grammars is that they are parameterized by tokenizer and notation. The parameterization of tokenization means that you may define a regular expression over characters or over elements. In fact, a contentModel is a regular expression tokenized over sibling elements and data (with a slightly different syntax)! And, in fact, a lexicalModel is a kind of BNF grammar, in that it associates terminals with non-terminals. (Tokenizing is the process of determining the data takens to be presented to the grammar. )
If the tokenization was selected so that an IDREF was followed rather than the children elements, then it is possible to make schemas of graph structures: to validate links and provide strong type checking. It also then becomes possible to select the content model based on what element has linked to an element, rather than according to its type.
A further implication of allowing specifyable tokenization is that is becomes possible to specify, for example, that the grammar used should be some set operation on some other grammars. This has major impact on construction, and ties in with Murata Makoto's work.
The classes attribute (which is called "terminals" in the BNF grammar) specifies the particular notation used in the grammar. It allows, for example, an particular schema maker to use different letters for the lexicalModels. The different classes allows slightly different conventions for regular expressions: POSIX, XML content models, PERL, etc.
One particular use of these would be to allow whitespace found between certain elements to be labelled as non-signifcant, to overcome the bad mixed-content restriction: Paul Prescod has made a suggestion about this, to allow in effect content models of (#WS, x, #WS, y, #PCDATA, z), where #WS means ignorable whitespace.
The grammars could be augmented to allow any of the W3C specified languages: XPointers, Fragment Context Specifications, XQL, etc, to serve as the grammar.
The next stage is to associate a notation with a context. Note that contexts may have multiple notations apply, in layers or overlapping.
The simplest way to associate a notation with a context is to use a NOTATION attribute on an element.
The next simplest way is to associate a document object type with a notation.
<!ENTITY % simple-associations " element | PI | comment | data | elementType | attribute">
<!ELEMENT elements (html:body?, handler*, parameter*)> <!ATTLIST elements id ID #REQUIRED namespace CDATA #IMPLIED notation NOTATION #REQUIRED > <!-applies to all elements, perhaps qualified by namespace alone -->
<!ELEMENT PI (html:body?, handler*, parameter*)> <!ATTLIST PI id ID #REQUIRED > <!-- notation determined by the PI target -->
<!ELEMENT comment (html:body?, handler*, parameter*)> <!ATTLIST comment id ID #IMPLIED notation NOTATION #REQUIRED> <!-- applies to all comments -->
<!ELEMENT data (html:body?, handler*, parameter*)> <!ATTLIST data id ID #IMPLIED notation NOTATION #REQUIRED> <!-- applies to all data -->
<!ELEMENT elementType (html:body?, handler*, parameter*)> <!ATTLIST elementType name ID #REQUIRED namespace CDATA #IMPLIED notation NOTATION #REQUIRED > <!-- applies to an element type of this name, in any context. If the namespace is provided, it overrides the namespace prefix on the element type name, if present -->
<!ELEMENT attribute (html:body?, handler*, parameter*)> <!ATTLIST attribute name ID #IMPLIED elementType IDREF #IMPLIED namespace CDATA #IMPLIED notation NOTATION #REQUIRED > <!-- applies to an attribute of this name, in any context If the namespace is provided, it overrides the namespace prefix on the element type name, if present, -->
<!ELEMENT attributeValue (html:body?, handler*, parameter*)> <!ATTLIST attributeValue name IDREF #IMPLIED elementType IDREF #IMPLIED namespace CDATA #IMPLIED notation NOTATION #REQUIRED >
The html:body element provides documentation. A handler is a downloadable application which can deal with a simple association: to display it, for example.
In extended associations, a context must be satisfied to associate the notation. This directly mirrors XSL: where a pattern is used to select document objects.
<!ELEMENT extendedAssociation ( html:body?, ( %grammar; ), handler*, parameter* )> <!ATTLIST extendedAssociation name ID #REQUIRED grammar IDREF #REQUIRED >
A context is found using the grammar mechanism itself! So, for example, the context could be an XSL pattern or it could be a content model, or it could use Xpointers to test the value of an attribute.
The html:body element provides documentation. A handler is a downloadable application which can deal with a simple association: to display it, for example.
Handlers are XLinks. They locate applications (e.g. Java applets) which provide some kind of useful service for an application. Examples are viewers, tokenizers, external validators. Parameters are xlinks which point to generic infomation that can be loaded by applications to make use of the data.
<!ELEMENT handler EMPTY> <!ATTLIST handler %xlink; > <!ELEMENT parameters EMPTY> %xlink; >
One immediate use for this mechanism is to provide a solution to the Private-Use-Area (PUA) characters. The PUA is an area reserved in Unicode; these kind of characters are widely used in East Asia, though less so under Unicode. To be able to transport non-standard characters in XML, one has to attach to the document information which describes the properties of the PUA characters, which the receiving end could make use of. This is a schema issue, because it involves the notation of characters. Some proposals, notably those of Prof. C.C. Hsieh, allow Chinese characters to be formed from parts using placement operators.
Handlers could be defined to render the character in that notation. A parameter could be provided that gave the collation sequence, to be loaded into the sort routines.
Here is the top-level declaration.
<!ELEMENT notationSchema ( extendedAssociation | %simple-associations; | notation )*>
As is traditional, this section gives the Notations Schema DTD both as XML markup declarations and using Notations Schemas, for comparison. (I haven't put in the attribute defaulting yet)
<schema>
<elementType name="notationSchema" notation="notationSchema.model" /> <notation name="notationSchema.model"> <contentModel>( extendedAssociation | elements | PI | comment | elementType | data | attribute | notation )* </contentModel> </notation>
<elementType name="handler" notation="handler.model" /> <notation name="handler.model"> <contentModel>EMPTY</contentModel> </notation> <attribute elementType="handler" notation="" />
<elementType name="parameters" notation="parameters.model" /> <notation name="paremeters.model"> <contentModel>EMPTY</contentModel> </notation> <attribute elementType="handler" notation="" />
<elementType name="extendedAssociation" notation="extendedAssociation.model" /> <notation name="extendedAssociation.model"> <contentModel>( html:body?, ( regularExpression | lexicalModel | contentModel | contextModel| BNF | userDefined), handler*, parameter* ) </contentModel> </notation> <attribute elementType="extendedAssociation" notation="extendedAssociation.attribute.model" /> <notation name="extendedAssociation.attribute.model" > <contentModel>( name & grammar ) </contentModel> </notation> <attributeValue model="extendedAssociation.attribute.model" name="name" notation="xml:ID"> <attributeValue model="extendedAssociation.attribute.model" name="grammar" notation="xml:IDREF">
<elementType name="elements" notation="elements.model" /> <notation name="elements.model"> <contentModel>(html:body?, handler*, parameter*)</contentModel> </notation> <attribute elementType="elements" notation="elements.attribute.model" /> <notation name="elements.attribute.model" > <contentModel>( id & namespace? & notation) </contentModel> </notation> <attributeValue model="elements.attribute.model" name="id" notation="xml:ID"> <attributeValue model="elements.attribute.model" name="namespace" notation="xml:CDATA"> <attributeValue model="elements.attribute.model" name="notation" notation="xml:NOTATION">
<elementType name="PI" notation="PI.model" /> <notation name="PI.model"> <contentModel>(html:body?, handler*, parameter*)</contentModel> </notation> <attribute elementType="PI" notation="PI.attribute.model" /> <notation name="PI.attribute.model" > <contentModel>( id ) </contentModel> </notation> <attributeValue model="PI.attribute.model" name="id" notation="xml:ID">
<elementType name="comment" notation="comment.model" /> <notation name="comment.model"> <contentModel>(html:body?, handler*, parameter*)</contentModel> </notation> <attribute elementType="comment" notation="comment.attribute.model" /> <notation name="comment.attribute.model" > <contentModel>( id? & notation ) </contentModel> </notation> <attributeValue model="comment.attribute.model" name="id" notation="xml:ID"> <attributeValue model="comment.attribute.model" name="notation" notation="xml:NOTATION">
<elementType name="data" notation="data.model" /> <notation name="data.model"> <contentModel>(html:body?, handler*, parameter*)</contentModel> </notation> <attribute elementType="data" notation="data.attribute.model" /> <notation name="data.attribute.model" > <contentModel>( id? & notation ) </contentModel> </notation> <attributeValue model="data.attribute.model" name="id" notation="xml:ID"> <attributeValue model="data.attribute.model" name="notation" notation="xml:NOTATION">
<elementType name="elementType" notation="elementType.model" /> <notation name="elementType.model"> <contentModel>(html:body?, handler*, parameter*)</contentModel> </notation> <attribute elementType="elementType" notation="elementType.attribute.model" /> <notation name="elementType.attribute.model" > <contentModel>( name & notation & namespace? ) </contentModel> </notation> <attributeValue model="elementType.attribute.model" name="name" notation="xml:ID"> <attributeValue model="elementType.attribute.model" name="namespace" notation="xml:CDATA"> <attributeValue model="elementType.attribute.model" name="notation" notation="xml:NOTATION">
<elementType name="attribute" notation="attribute.model" /> <notation name="attribute.model"> <contentModel>(html:body?, handler*, parameter*)</contentModel> </notation> <attribute elementType="attribute" notation="attribute.attribute.model" /> <notation name="attribute.attribute.model" > <contentModel>( name? & elementType? & namespace? & notation ) </contentModel> </notation> <attributeValue model="attribute.attribute.model" name="name" notation="xml:ID"> <attributeValue model="attribute.attribute.model" name="elementType" notation="xml:IDREF"> <attributeValue model="attribute.attribute.model" name="namespace" notation="xml:CDATA"> <attributeValue model="attribute.attribute.model" name="notation" notation="xml:NOTATION">
<!-- to be done:
<!ELEMENT attributeValue (html:body?, handler*, parameter*)> <!ATTLIST attributeValue name IDREF #IMPLIED elementType IDREF #IMPLIED namespace CDATA #IMPLIED notation NOTATION #REQUIRED >
-->
<elementType name="regularExpression" notation="regularExpression.model" /> <notation name="regularExpression.model"> <contentModel>(#PCDATA)</contentModel> </notation> <attribute elementType="regularExpression" notation="regularExpression.attribute.model" /> <notation name="regularExpression.attribute.model" > <contentModel>( notation? & tokenizer? ) </contentModel> </notation> <attributeValue model="regularExpression.attribute.model" name="classes" notation="xml:NOTATION"> <attributeValue model="regularExpression.attribute.model" name="tokenizer" notation="xml:IDREF">
<elementType name="lexicalModel" notation="lexicalModel.model" /> <notation name="lexicalModel.model"> <contentModel>( lexical )*</contentModel> </notation> <attribute elementType="lexicalModel" notation="lexicalModel.attribute.model" /> <notation name="lexicalModel.attribute.model" > <contentModel>(notation? & tokenizer?) </contentModel> </notation> <attributeValue model="lexicalModel.attribute.model" name="classes" notation="xml:NOTATION"> <attributeValue model="lexicalModel.attribute.model" name="tokenizer" notation="xml:IDREF">
<!ATTLIST lexicalModel classes NOTATION "xml:schema.lexical-types" tokenizer IDREF "xml:xml.character" >
<!-- to do <!ELEMENT lexical ( #PCDATA )> -->
<elementType name="contentModel" notation="contentModel.model" /> <notation name="contentModel.model"> <contentModel>(#PCDATA)</contentModel> </notation> <attribute elementType="contentModel" notation="contentModel.attribute.model" /> <notation name="contentModel.attribute.model" > <contentModel>(classes? & tokenizer?) </contentModel> </notation> <attributeValue model="contentModel.attribute.model" name="classes" notation="xml:NOTATION"> <attributeValue model="contentModel.attribute.model" name="tokenizer" notation="xml:IDREF">
<!--ATTLIST contentModel classes NOTATION #IMPLIED tokenizer IDREF "xml:dom.element-nodes" -->
<!--ATTLIST contextModel classes NOTATION "xml:xsl.patterns" tokenizer IDREF "xml:dom.element-nodes" -->
<elementType name="contextModel" notation="contextModel.model" /> <notation name="contextModel.model"> <contentModel>ANY</contentModel> </notation> <attribute elementType="contextModel" notation="contextModel.attribute.model" /> <notation name="contextModel.attribute.model" > <contentModel>( terminals? & tokenizer ) </contentModel> </notation> <attributeValue model="contextModel.attribute.model" name="classes" notation="xml:NOTATION"> <attributeValue model="contextModel.attribute.model" name="tokenizer" notation="xml:IDREF">
<!--ATTLIST BNF terminals NOTATION "xml:rfc????.ebnf" tokenizer IDREF "xml:xml.character" -->
<elementType name="BNF" notation="BNF.model" /> <notation name="BNF.model"> <contentModel>(production)*</contentModel> </notation> <attribute elementType="BNF" notation="BNF.attribute.model" /> <notation name="BNF.attribute.model" > <contentModel>( nonterminal ) </contentModel> </notation> <attributeValue model="BNF.attribute.model" name="terminals" notation="xml:NOTATION"> <attributeValue model="BNF.attribute.model" name="tokenizer" notation="xml:IDREF">
<elementType name="production" notation="production.model" /> <notation name="production.model"> <contentModel>#PCDATA</contentModel> </notation> <attribute elementType="production" notation="production.attribute.model" /> <notation name="production.attribute.model" > <contentModel>( name & notation & tokenizer?) </contentModel> </notation> <attributeValue model="production.attribute.model" name="nonterminal" notation="xml:NMTOKEN">
<elementType name="userDefined" notation="userDefined.model" /> <notation name="userDefined.model"> <contentModel>ANY</contentModel> </notation> <attribute elementType="userDefined" notation="userDefined.attribute.model" /> <notation name="userDefined.attribute.model" > <contentModel>( id & notation & tokenizer? ) </contentModel> </notation> <attributeValue model="userDefined.attribute.model" name="name" notation="xml:ID"> <attributeValue model="userDefined.attribute.model" name="notation" notation="xml:NOTATION"> <attributeValue model="userDefined.attribute.model" name="tokenizer" notation="xml:IDREF">
<elementType name="notation" notation="notation.model" /> <notation name="notation.model"> <contentModel>( html:body?, ( regularExpression | lexicalModel | contentModel | contextModel| BNF | userDefined )*, handler*, parameter*) </contentModel> </notation> <attribute elementType="notation" notation="notation.attribute.model" /> <notation name="notation.attribute.model" > <contentModel>( name & system? & public? & mime? ) </contentModel> </notation> <attributeValue model="notation.attribute.model" name="id" notation="xml:ID"> <attributeValue model="notation.attribute.model" name="system" notation="xml:SYSTEM"> <attributeValue model="notation.attribute.model" name="public" notation="xml:PUBLIC"> <attributeValue model="notation.attribute.model" name="mime" notation="MIME:mediaType">
</schema>