The Schematron Assertion Language

Rick Jelliffe
Academia Sinica Computing Centre
2000-05-05

Abstract

The Schematron is a language system for specifying and declaring assertions about arbitrary patterns in XML documents, based on the presence or absense, names and values of elements and attributes along paths. Its target uses are for software engineering requiring document validation, for scholarly research over patterns in graph-structured data, for automatic creation of external markup, and to aid accessibility of documents for people with disabilities.

Schemas

For markup languages, a schema is a specification of interlocking constraints between information items in a document. These specification are made formally using a schema language, which implements some schematic paradigm.

This paradigm will be more or less satisfactory in expressing the constraints depending on the particular case; indeed, when the schema was written before the document, the schematic paradigm will influence the structures allowed and used to express interlocking constraints. Thus the schema paradigm acts to extend the effective vocabulary of markup types beyond those provided by the particular markup language. [SGML] DTDs allow 'global inclusion exceptions'; [XML] DTDs do not allow them, so XML DTDs cannot express the particular constraint.

In recognition that any schematic paradigm imposes limitation on the constraints expressable by a schema, SGML declarations refer to its grammar declarations as 'content models'. Thus one can say there is a difference in terminology between the expectations of a schema versus a model: the former defines canonically or exhaustively, whilst the latter describes as best it can according to its schematic paradigm.

Grammars

Conventionally, starting with SGML DTDs, schemas for markup languages are defined in terms of grammars to regulate element containment, lists to regulate attribute containment, augmented by datatype constraints on various information elements. In XML, these additional constraints are concerned with enabling graph structures to be represented, rather than describing the semantics or types of information elements. In the late 1990s, many schema languages were developed for XML in anticipation of the development of the World Wide Web Consortium's [XML Schema] schema language. All these schema languages used the grammar-founded approach mentioned above, elaborating on them using objects [SOX], modules [RELAX], production selectors [Assertion Grammars], etc.

But these approaches have certain deficiencies. For a start, SGML used grammars as its schematic paradigm because one does indeed define grammars with it, down to lexical level. The grammar paradigm is not necessary for schemas for XML. Secondly, the grammar approach is not sufficient to express any constraints between information items in different branches of the attribute-value tree which forms the primary view of an XML document. The mechanisms for declaring unique identifies and references do not alter this; mechanisms such as that proposed for [XML Schema] to introduce grammar non-terminals (termed the tag/type distinction) allow an element name to have a different datatype of content model depending on its parent, however this merely allows the type of an element to be constrained by its parent's type as well as by its generic identifier (i.e., by its tag).

These deficiencies are nothing more than schema paradigm limitations. However there are other pragmatic and policy considerations that may make a grammar-based schema paradigm unattractive.

Considerations

The first is a cultural, or perhaps educational and linguistic one. According to [deFrancis], Western written languages can be thought of as having the broad hierarchy 'character', 'word', 'phrase', 'sentence', while Chinese can be thought of as having the broad hierarchy 'character', 'idea', 'phrase', 'sentence'. A Chinese character does more than a Western character, and represents ideas which are again at a slightly higher-level than Western words (i.e., in non-agglutinating languages.) Thus in Chinese, it is possible that one moves quickly from a non-grammatic level to a semantic level fast; this may be why Taiwanese students are anecdotally reported[1] to find the idea of an abstract grammar to be theoretical rather than practical. At the least, we should not dismiss that using formal grammars as a schematic paradigm may more easily acceptable to members of one language group or culture than another.

Secondly, an perhaps related to the previous point, it may be even more likely that if a schematic paradigm has language or cultural affinities, that different schema languages may be more or less useful to people with different cognative impairments. I only need to take this point as far as saying the obvious that a complex schema paradigm may be more difficult than a simple one. And the difficulty of a schematic paradigm may be more than just its complexity to explain but also its complexity to use and implement.

Third, when considering the needs of schemas for constraints on documents which are needed to support accessibility by disabled people, we come up against what I regard as a fundamental shortfall in existing schema languages: they are designed to support definitional schemas which intend to specifiy exhaustively or canonically the required constraints on a document. However, acessibility constraints are policy constraints imposed on a document in addition to those constraints required to define that document.

Fourthly, after admitting that there can be important non-definitional constraints on a document, the question arises of what other non-definitional constraints can there be? The main one I identify is the requirements of workflow: that some constraints only may come into existence during some phase of a document's lifecycle. Without some notion of constraints that come into play during a phase, one must either weaken constraints on a schema or arbitrarily switch schemas during the document's life cycle.

Fifthly, building on the notion that it is useful to be able to switch constrains in and out during formally defined phases of a document's life cycle, we can see that the ability to group and switch in and out constraints on an ad hoc basis during editing of a document would be useful. It is a common difficulty with validating editors of structured documents that otiose errors are reported for documents under construction and incomplete.

Thus one consideration leads to the next, and the result can be considerable doubt that grammars form an adequate schematic paradigm for documents for the World Wide Web. These are some of the needs and considerations underpinning development of the Schematron assertion language.

Uses

The Schematron language has been developed with four main use-areas in mind:

For software engineering, to allow the expressing of interlocking constraints on standalone XML documents, and for checking pre- and post-conditions of XML documents used as arguments or returned by functions in programming languages.
For scholarly use, where the dataset is a graph expressable by an XML document, where the constraints are hypotheses about the dataset that can be tested.
For use in automated markup systems, where one wishes to detect patterns in data and produce an external or inline document which links

As part of the Schematron project, exemplar software to do these has been produced and is available on the WWW[2].

Assertions

A Schematron schema is made by specifying assertions, which are simple declarative sentences in natural language.

The <assert> element is used to tag positive assertions about a document. For example,

<assert>An 'dog' element should contain two 'ear' elements.</assert>

This asssertion is something that is expected to be true of the document. If a document is validated against the schema, and the test for this assertion fails, an application can take some action. Schematron does not specify any actions: it only allows assertions to be tested, for the parts of assertions to be given roles, for the assertions to be grouped into rules, for the rules to be grouped into patterns, and for the patterns to be activated in various phases.

The <report> element is used to tag negative assertions about a document. For example,

<report>This dog has a bone.</report>

Within these two elements, it is possible to use a <name> element, which gives the specific name of the context element for which the assert statement failed or the report statement succeeded. The <name> element can also have an attribute value in which an [XPath] expression can be given; this allows the name of an element or attribute different to the context element to be specified. Because some implementations of Schematron may format these names differently. For better formatting, an element <emph> is also allowed; its only use is to allow names of elements or attributes to be specified in assertions to have the same format as those provided by evaluating the <name> element.

For internationalization, the element <dir> can be used inside these two elements to support bidirectional written languages; the semantics are those of the dir element of [HTML]. The elements may also have an xml:dir attribute for tagging the written language of the contents of the element; the xml:lang attribute does not express the language of the target document.

For better formating of assertion reports, these two elements may also have an icon attribute, which is the [URL] of a small image that may provide visual clues to a user.

These two elements can also have a subject attribute. This is an [XPath] path which allows very direct specification of the subject of the assertion: this may be useful information for automatically generating [RDF] documents.

Rules

<assert> and <report> elements are grouped inside <rule> elements. The <rule> element has a context attribute which contains an [XPath]. Every element in the document for which this path expression evaluates to true is then used as the context to test the assertions. An assertion is tested by testing an [XPath] expression declared in a test attribute of the <assert> and <report> element.

The full declarations for the assertions above are

<rule context="dog">
   <assert test="count(ear) = 2"
   >A 'dog' element should contain two 'ear' elements.</assert>
   <report test="bone"
   >This dog has a bone.</report>
</rule>

These three elements are the operational core of Schematron. [XPath] expressions allow a very wide range of constraints to be expressed: based on element and attribute names, based on their position and occurrence, based on text values, and based on counts. In the example, the context is every element with a generic identifier 'dog': the test in the <assert> element counts the number of child elements with the generic identifer 'ear'. Neither assertion in this rule will fail for the following XML document:

<dog><ear/><ear/></dog>

The context attribute is an [XPath] as extended by [XSLT], allowing 'or' operations, for example. The test attributes are [XPath] expressions which allow various logical operators such as '|'.

The <pattern>,<assert> and <report> elements can each have a role attribute. This is an identifier within the schema to identify the role that is played. These elements can also have id attributes.

This double path system is reminiscent of SQL queries: one could consider a query SELECT x FROM y WHERE z IS a to be a context statement (i.e., 'WHERE x IS y') and a test (i.e., 'x FROM y).

A <rule> element can also contain <key> elements, which allows [XPath]'s key mechanism to be used. This allows various testing of reference constraints; it is more powerful than the [XML] ID/IDREF mechanism. The path attribute is an [XPath] path; the name attribute is a token naming the key. The icon attribute allows specification of an icon.

An important feature to note is that, because of [XSLT]'s document() function, a Schematron assertion test can refer to data in a different document from the context document. This allows Schematron schemas to be used for two important uses: to validate against a controlled vocabulary located externally to the schema (indeed, this can be in any XML document type, not just using a Schematron schema), and to validate the output of some programs function against data found in its input (or vice versa) as a form of black-box testing.

A simple macro mechanism is allowed on rules. A <rule> element can have one or more <extends> elements. These have a rules attribute, which is the identifier of another rule. This allows you to bring in the assertions of an abstract rule which was specified with an abstract attribute with a value "yes". An abstract rule element cannot have a context attribute. As an example, this constraint can be specified as follows (in [XPath} paths

<rule context="sch:rule">
   <assert test="(attribute::abstract='yes') and not(attribute::context)"
   >An abstract rule cannot have a context attribute.</assert>
   <assert test="(attribute::abstract='no') and attribute::context"
   >A rule should have a context attribute (except for abstract rules.)</assert>
   <assert test="not(attribute::abstract) and attribute::context"
   >A rule should have a context attribute (except for abstract rules.)</assert>
   <report test="attribute::abstract and not(attribute::abstract='yes') and not(attribute::abstract='no')"
   >In a rule, the abstract attribute is optional, and can have values 'yes' or 'no'</report>
</rule>

Note in this example that Schematron schemas are very specific. It is quite probable that a simpler schema would be just as effective, or the various assertions could be combined into a larger test with a more general statement.

One abstract rule can extend another. XML Entities can also be used for various macro effects, as desired.

Patterns

Rules are grouped into <pattern> elements. A pattern is a grouping of rules. An element will only be used as the context of one rule per pattern; the first rule in lexical order for which a context matches will be used.

Pattern elements have various attributes. The name attribute allows specification of a simple human-readable string to identify the pattern. The id attribute allows a unique identifier to be assigned. for reference purposes. The fpi attribute allows an [SGML] Formal Public Identifier to be attached. The see attribute allows a [URL] to be specified which gives some human readable documentation for the pattern; a hypertext presentation of the schema results can link to that resource.

A <pattern> element can have an icon element.

Schema

The top-level element of an XML schema is <schema>. A schema element should have a <title> sub-element.

Typically the schema will be declared using XML [Namespace] conventions. The preferred prefix is sch and the appropriate namespace URI is

http://www.ascc.net/xml/schematron

Thus a complete XML schema document is as follows:

<?xml version="1.0" encoding="US-ASCII"?>
<sch:schema xmlns:sch="http://www.ascc.net/xml/schematron">
 <sch:title>Example Schematron Schema</sch:title>
 <sch:pattern>
   <sch:rule context="dog">
    <sch:assert test="count(ear) = 2"
    >A 'dog' element should contain two 'ear' elements.</sch:assert>
    <sch:report test="bone"
    >This dog has a bone.</sch:report>
   </sch:rule>
  </sch:pattern>
</sch:schema>

The <schema> element can have a ns attribute which gives the namespace URI that role attributes will have, if the role is used to externally mark up the target document.

The <schema> element also allows explicit declaration of namespace prefixes and URLs that are used in the schema, using the <ns> subelements. The usual XML [Namespaces] mechanism can be used, however, then the prefix and URL data is not available for diagnostic reporting or application processing; furthermore, some implementations may require that the information is made available in that form.

A <schema> can have an icon attribute. It can also contain <p> elements, allowing some modest end-user-oriented documentation to be given: this allows the user to know what kind of validation or constraints the schema specifies, to aid them in interpreting any results usefully. The <p> element can have an icon attribute.

Phases

Workflow and dynamical schemas are supported through the phase mechanism. The <schema> element can contain <phase> elements. This must have an id attribute for a unique identifier; it can have an icon attribute; it can have an fpi attribute to give a persistant identifier for the phase, because a phase may correspond to a DTD which had an FPI (note that the FPI is for the phase, not for the current scheme per se.)

The <phases> element has an attribute activePatterns which is a list of the identifiers of patterns.

By default, all patterns in a document are active. However, an application may provide a way to allow the user to select the phase to be used: for example, a command line option when invoked from the command line, a preferences dialog box in a GUI, or a parameter on the function invocation when called as a precondion checker in a programming language.

Hints

Users have reported that a common use of Schematron schemas is to allow specific diagnostics to be given. However, it is desirable to keep <assert> and <report> statements as general assertions rather than diagnostic messages. To support this, the <assert> and <report> elements have a hints attribute which is a reference to a <hint> element. These are allowed in a <hints> section at the end of the document. The value of the hints attribute can be a list of references to <hint> elements.

A <hint> element is general text. It can contain <dir> and <emph> subelements. It must have an id attribute to allow references to it. The <hint> element can have <value-of> sub-elements, which have the same semantics as in [XSLT]. These allow insertion of value information as well as name details. A <hint> element can have an icon attribute.

Appendix: XML DTD For Schematron 1.4

The following are markup declarations for the Schematron assertion language. For clarity, this version used default namespace; it is inadvisable to use the default namespace in practise, because some Schematron implementations may apply that default namespace to the target document, to unqualified names. Note that, providing the defaulting noted is followed and except for ID purposes, the Schematron DTD does not make infoset contributions and should not be required.

<!-- +//IDN sinica.edu.tw//DTD Schematron 1.4//EN -->
     <!-- Data types -->
      <!ENTITY % URI  "CDATA" >
      <!ENTITY % PATH "CDATA" >
      <!ENTITY % EXPR "CDATA" >
      <!ENTITY % FPI  "CDATA" >
      <!-- Element declarations -->

       <!ELEMENT schema  ( title?, ns*, phase*, p*, pattern+ , p*, hints )>

       <!ELEMENT assert   ( #PCDATA | name | emph | dir )*>
       <!ELEMENT dir        ( #PCDATA )>
       <!ELEMENT emph     ( #PCDATA )>
       <!ELEMENT extends EMPTY >
       <!ELEMENT hint       (#PCDATA | value-of | emph | dir)* >
       <!ELEMENT hints     ( hint+ )>
       <!ELEMENT key       EMPTY >
       <!ELEMENT name     EMPTY >
       <!ELEMENT ns        EMPTY >
       <!ELEMENT p       ( #PCDATA | dir | emph) >
       <!ELEMENT pattern ( p*, rule+ )>
       <!ELEMENT phase   ( #PCDATA | dir | emph) >
       <!ELEMENT report  ( #PCDATA | name | emph | dir )*>
       <!ELEMENT rule    ( assert | report | key | extends )+>
       <!ELEMENT title   ( #PCDATA | dir ) >
       <!ELEMENT value-of EMPTY >

<!-- Attribute declarations -->
       <!ATTLIST schema  xmlns %URI; #FIXED "http:/www.ascc.net/xml/schematron"
                         fpi     %FPI; #IMPLIED
                         defaultPhase IDREF #IMPLIED
                         icon    %URI; #IMPLIED
                         xml:lang NMTOKEN #IMPLIED >

       <!ATTLIST assert  test    %EXPR; #REQUIRED
                         role    NMTOKEN    #IMPLIED
                         id        ID  #IMPLIED 
                         hints   IDREFS #IMPLIED
                         icon    %URI; #IMPLIED
                         subject %PATH; #IMPLIED 
                         xml:lang NMTOKEN #IMPLIED >
       <!ATTLIST dir  value ( ltr | rtl ) #IMPLIED >
       <!ATTLIST extends rule IDREF #REQUIRED >
       <!ATTLIST hint    id      ID    #REQUIRED
                         icon   %URI  #IMPLIED
                         xml:lang NMTOKEN #IMPLIED >
       <!ATTLIST key     name    NMTOKEN #REQUIRED
                         path    %PATH; #REQUIRED 
                         icon    %URI;  #IMPLIED >
       <!ATTLIST name    path    %PATH; #IMPLIED >
               <!-- Schematrons should implement '.' 
               as the default value for path -->
       <!ATTLIST p          xml:ns NMTOKEN #IMPLIED >
       <!ATTLIST pattern name    CDATA #REQUIRED
                         see     %URI; #IMPLIED
                         id      ID    #IMPLIED 
                         icon    %URI; #IMPLIED>
       <!ATTLIST ns      uri     %URI; #REQUIRED >
                         prefix  NMTOKEN #IMPLIED >
       <!ATTLIST phase   id      ID    #REQUIRED
                         fpi     %FPI; #IMPLIED
                         activePatterns IDREFS #REQUIRED 
                         icon    %URI; #IMPLIED >
       <!ATTLIST report  test    %EXPR; #REQUIRED
                         role    NMTOKEN   #IMPLIED
                         id        ID             #IMPLIED 
                         hints    IDREFS    #IMPLIED
                         icon    %URI; #IMPLIED
                         subject %PATH; #IMPLIED >
       <!ATTLIST rule    context %PATH; #IMPLIED
                         abstract  (yes | no ) "no"
                         role    NMTOKEN    #IMPLIED 
                         id        ID  #IMPLIED >
                  <!-- Schematrons should implement 'no' as the default
                  value of abstract -->
       <!ATTLIST value-of  select %PATH; #REQUIRED >

Acknowledgments

The Schematron was developed as a free software project at the Academia Sinica Computing Centre in 1999 and 2000 by the author. I thank the Director, Dr. Simon Lin, for his encouragement and support. Also I owe thanks for the support and contributions of Professor C.C. Hsieh, Dr Makoto Murata, Dr Oliver Becker (architecture), Dr Miloslav Nic (tutorials), Dr David Carlisle, Mr James Clark, Mr Adrian Edwards, Mr Uche Ogbuji, Mr David Pawson, and Mr Ludvig Svenovius (extends).

References

[1] Private conversation by the author with a Taiwanese MIS professor.

[2] The Schematron project website is at

http://www.ascc.net/xml/resource/schematron/schematron.html

[deFrancis] The Chinese Language

[HTML]

[Namespace]

[SGML]

[SOX]

[RDF]

[RELAX] Murata M.

[URL]

[XML]

[XML Schema]

[XSLT]