Resource Directory for Hook 0.2

This document is a RDDL Resource Directory Description for the Hook 0.2 validation language, which is an XHTML document with special XLinks that locate various resources useful for Hook.

The Hook validation language is a thought experiment in minimalism in XML schema languages. The purpose of such a minimal language would be to provide useful but ultra-terse success/fail validation for basic incoming QA, especially of datagrams. It is like a checksum for a schema.

The validation it performs can be characterized as "Does this element have a feasible name, ancestry, previous-siblings and contents?", there being some tradeoff between the how fully the later criteria are tested.

Let us start with the following technical criteria:

Smaller than DTD: if it is downloaded from a server as a separate file, it should be downloadable in the first packet group, so less than 512 (the minimum MTU) -100 (for MIME header) =412 bytes.
Implementable by a streaming processor
No forward references
No pathological schemas as far as blowouts
An efficient implementation should be possible
Suitable for coarse validation of document for some significant issues
The schema should be namespace-aware
The minimal schema should only require 1 element or perhaps fit in a PI
The datatype should be expressible using XML Schemas regular expressions or simple space-separated tokens.
The schema paradigm is the (partial) ordering of elements against the information kept during stream processing

The Language

A Hook schema is an element containing a list of element names, some of which may be grouped by square brackets. This list represents a certain ordering of the names and validation consists of checking conformity to this ordering.

The DTD for the language is

  <!ELEMENT hook:order ( #PCDATA)>
  <!ATTLIST  hook:order
	xmlns:hook  CDATA #FIXED "http://www.ascc.net/xml/hook"
	targetNamespace CDATA #IMPLIED
	friendly ( true | false ) "true"
	short (true | false) "false"
	top (true| false) "true"
 >

The targetNamespace attribute gives the namespace to be validated.
The friendly attribute is whether elements from other namespaces are allowed.
The short attribute is whether the all elements in the namespace have been mentioned or not; if not then unmentioned elements are allowed as if specified in a group at the end of the schema.
The top attribute specifies whether the first element in the schema must be the document element (or the local root of a branch starting this namespace).

The order element has the following grammar, where s is one or more whitespace (or string-start or string-end) and NCame is an XML name with no colons.

  s ( (NCName "."? s )| 
	( "[" s (NCname ("."|";")? s)+ "]" s ) 
         )+

The order element specifies an ordering of elements; element grouped by square brackets are in the same level or order.

Validation occurs by, for each element in the document proceding in document (streaming) order, checking that every previous-sibling element at the same level and then each ancestor element are ordered according to the list order (ignoring intermediate list items, but failing if there is no corresponding item in the schema to any element.) A name may appear more than once. (Actually, an implementation only needs to look a the first child and or next-sibling to perform validation, but explaining it this way around may make the syntax easier to understand.)

A fullstop (period) on an element indicates that the element may have no contents (no subelements and the space-normalized value of the contents is zero): this is almost the same as EMPTY. A semi-colon indicates that the current group will be broken out of: the named element cannot be contain by elements in the same group. (It can still be followed by elements of the same group.) A semi-colon in a group at the end of a schema thus indicates that simple content only is possible

Normally [ x y ] allows

   <x><y/><x/><y/></x>
   <y><x/><y/><x/></y>

but [ x y; ] allows

   <x><y/><x/><y/></x>

but not

   <y><x/><y/><x/></y>

while [ x y. ] allows

   <x><y/><x/>;<y/></x>

but not

   <y><x/></y>

So [ x y ] means

an x can contain any number of nested x and y before any other element
an x can be followed by any number of x and y before any other element
a y can contain any number of nested x and y before any other element
a y can be followed by any number of x and y before any other element
a y can be followed by any number of x and y before any other element

but [ x y; ] adds the constraint

a y cannot contain a y next (unless the next particle in the hook schema happens to be a y e.g. [ x y; ] x )
a y cannot contain an x next (unless the next particle in the hook schema happens to be an x, e.g. [ x y; ] x )

So ";" is used to break out of the recursion allowed in a [ ] group.

Intuitively, this is like first making a big list of every element allowed, putting them all in a choice group. This gives us a complete definition of every allowed element: it defines the namespace and catches spelling errors. Next, if there is some element(s) that can start, move them out to the front (or copy them if they can reappear. Now the schema validates the top-level elements too. Next, if there are some elements that can only appear as the last elements in a coment model ( e.g. the z in (x, y, z) or the b and c in ( a, (b | c)*) ) then move these out to a group at the end. Now we have validation for elements in simple mixed content. Continue factoring until done.

So given the following schema:

   <hook:order>A B. C</hook:order>

then the following documents are valid


  <A/>

  <A><B/></A>

  <A><C/></A>

  <A><B/><C/></A>

  <A><C><C/></C></A>

  <A><A/></A>

But not

 <B/>

  <C><B/></C>

  <B><A/></B>

  <A><C/><B/></A>

  <A><C/><C/><B/></A>

 <A><B><B/></B></A>

It is quite possible that there are languages which exhibit orders that cannot be usefully captured. In those cases, a hook schema still can show the top element, all names in the namespace, and which elements must be empty.

Example

The following example is a hook schema for XHTML Basic

 <hook:order targetNamespace="http://www.w3.org/1999/xhtml" >
  html head  [ title; meta. link. base. ]   body
  [ a br. blockquote caption; div  dl; h1; h2; h3; h4; h5; h6;  
	img. ol; p; pre; table; ul; ]  
  [ tr;  dt; dd; li; ]  td 
  [ a br. blockquote div  form img. ol; ul; li; ]  
  [ input; label; select; textarea; ]  [ option. ]
  [ abbr acronym address cite code dfn em kbd q samp span strong var object; ] 
  param 
 </hook:order>

This schema captures a lot of containment relationships OK, I think: probably it has some mistake. But it will not detect what may be a common XHTML problem, where omit-end-tag HTML elements like <body> are converted to <body />. However it will detect problems like <meta> not being converted to an empty tag and so spuriously including other head elements.

The next example is RSS.

 <hook:order  targetNamespace="http://purl.org/rss/1.0/" >
  channel   title link image items item
  title link url description textinput.
 </hook:order>

A Hook schema for the well-known Purchase Order example would be:

 <hook:order  targetNamespace="..." >
  PurchaseOrder  [comment; ShipTo; ]  Name Street City State Zip  
  ShipDate [ comment; Items; ] Item productName quantity price comment
 </hook:order>

This is a much more successful example! Note, every valid PO document will also be valid against this schema and that the schema validates all sequence requirements. What it won't catch is if an end-tag is in the wrong palce w.r.t what should be a sibling. So it seems that Hook may be good for validating datagrams of this kind.

Following is a schema for Schematron 1.5

 <hook:order  targetNamespace="http://www.ascc.net/xml/schematron" >
  schema  ns title p phase active p pattern rule   [ assert; report; key.] 
  diagnostics  diagnostic [ name. dir; emph; value-of. ]
<hook:order>

Again, this is pretty good: there is a good amount of order to capture. The "daignostics diagnostic" could also come before or or after rule

In all four cases above the character count is less than 400 characters, so it looks like they would be retrieve in the first packet group from a server.

Why Hook?

The name Hook comes from a supposed hook shape of drawing this on a parse tree tracing previous-sibling then up the descendents.

Resource Directory (RDDL) for Hook 0.2

A One-Element Language for Validation of XML Documents based on Partial Order

The Language

Example

Comments

Formalization

Why Hook?

Related Resources for Hook 0.2

Well known URI

Root namespace URI