Last modified: 25 August 2000

XML rule language: a brief technical overview

This document describes an element set consistency language used by the consistency check engine developed for consistencycheck.com. The following sections will explain the basic ideas behind the language and the approach taken to the expression of consistency relationships, present the new language and a few introductory examples and finally give the DTD of the new language.

Note: This document is a preliminary release draft Before reading the overview you should already be familiar with the broad concept of checking consistency between XML documents. You will need to understand XML, DTDs, XPath and a bit of set theory (negligible). Please contact us if you want more information.

Language basics

Before introducing the language, let us pick a running example which will serve to illustrate the ideas. For this, we will use the scheme for managing syllabus information which can be seen on the example pages. Our first example consistency rule will be: Every course module in the curriculum must have a syllabus.

This rule is to ensure that students and staff can get access to information, e.g. prerequisites, reading lists, about each possible course. We will leave aside the DTD for the syllabus information, let's say it is very simple: The courses are kept in one XML file and there is one XML file for each course syllabus. A connection between the two can be made via the course code.

Here is an example syllabus for a course (with a lot of information removed):

complexity.xml

<syllabus>
        <identity>
            <title>Computational Complexity</title>
            <code>3C04</code>
            <summary>Theory of computation and complexity.</summary>
        </identity>
        <workload>
            <workload_element workload_hours="30" workload_mode="Lectures"/>
            <workload_element workload_hours="65" workload_mode="Private_Reading"/>
            <workload_element workload_hours="12" workload_mode="Required_Written_Work"/>
            <workload_element workload_hours="33" workload_mode="Revision"/>
            <workload_total>140</workload_total>
        </workload>
</syllabus>

And here is a sample degree course (again, with a lot of information removed).

samplecourse.xml

<StudyPlan>
        <StudentIdentity>
            <FirstName>Homer</FirstName>    
            <FamilyName>Finkelstein</FamilyName>
            <Reg_Number>3645</Reg_Number>
        </StudentIdentity>

        <Programme> 
            <Title>CS</Title>
            <Award>BSc</Award>
        </Programme>

        <Year number="1">
            <Course>   
                <Name>Computer Architecture and Networks</Name>          
                <Code>1B10</Code>
            </Course>
            <Course>
                <Name>Consistency Checking of Distributed Documents</Name>
                <Code>0xffff</Code>
            </Course>
        </Year>
</StudyPlan>

Let us formulate the example consistency rule in a more precise way, but still in English: For every course module in the curriculum, a syllabus entry exists with the same code.

To illustrate the thinking behind the consistency rule language, we will now parse the english text of the sentence to point out the important bits:

For every course module in the curriculum
a syllabus entry exists
with the same code

We can interpret the colours as follows:

The blue items specify the containment context for the red items.
The red items refer to sets of elements. "course module" refers to the set of <Course> elements in study plans, "syllabus entry" refers to the set of <identity> elements in the syllabus information file. (Note that the english text is deliberately imprecise, it does not mention the identity elements because it serves as a description for people who might not be familiar with the rules).
The green item provides a relative set. The meaning of this is slightly obscure. What it means is that for every entry in the set, we generate another set relative to the entry. In the case of course and identity, we will select the "code" attribute. In this case, the relative set for each entry will be of size one (just the code attribute). In other cases not documented here, it may be necessary to select a set of attributes or a set of subelements.
The purple item is an operator, in this case we want to compare the "code" attribute of the course element and the text contained in the code element of the syllabus identity element for equality.

1. Set generation

The first step in enforcing the example consistency rule is to generate sets of elements to compare. The source and destination sets for our consistency rule should look similar to this (contained in parenthesis is the set relative to the element, according to our rule, i.e. the Code attribute of the course entry in the study plan and the contents of the "code" elements contained in the syllabus identity element):

Set 1	Set 2
Course (1B10)	code (1B10)
Course (1B12)	code (1B12)
Course (1BGOOD)

Table 1

To summarize, we will generate sets of elements (e.g. Course and code elements) and for each element, we will generate a relative set (e.g. the code attribute) for comparison with the relative set of another element.

2. Set processing and filtering

Looking back to the rule, it starts with for every course module in the curriculum.... In this approach, we will assume that every consistency rule will be of the form "for every" or "for each", i.e. a universal quantification on the source elements. What we need to do then is to step through the source set (set 1) entry by entry and somehow work on the destination set.

Our approach is to execute a filter query (more or less equivalent to the idea of "restriction" in the relational algebra, but filter sounds more user-friendly) on the destination set, predicated by the current source node's relative set. For example, suppose we have just started our traversal of the source set and are at entry 1, "Course (1B10)". We now execute a filter roughly of the sort "Code = text()" (the text() function is xpath notation and refers to the text contained in the code element of the syllabus identity), which keeps all nodes in the destination set which have the same code as the current source node. The resulting table is:

Set 1	Filtered Set 2 (*Code = text())
* Course (1B10)	code (1B10)
Course (1B12)
Course (1BGOOD)

Table 2 - * marks current source node

The observation that the filtered set is not empty tells us that our existential quantifier ("there exists a syllabus entry..") is satisfied. The current source element is thus consistent with our rule. Now, the processing will continue for each node in the source set. Suppose we arrive at the last entry, "Course (1BGOOD)". The destination set will be filtered using "Code = text()", giving

Set 1	Filtered Set 2 (*Code = text())
Course (1B10)
Course (1B12)
* Course (1BGOOD)

Table 3 - * marks current source node

Unfortunately, the filtered destination set is empty. We conclude that there is no syllabus information for the course with code "1BGOOD". The current source element is inconsistent with our constraint.

3. Consistency links

In Table 2 and Table 3, we have seen a consistent and an inconsistent source element, respectively. We now have to decide how to generate our consistency links. Fortunately, we get all the information we need for free. All the user has to specify in the rule is whether consistent or inconsistent links should be generated at all and where inconsistent links should be pointing (explained below). In Table 2, we have found a syllabus entry consistent with our course. If the user specifies "generate consistent links", we will simply link the current element in the source set to every entry in the filtered destination set. Looking back at table 2, this gives an XLink linking the Course (1B10) element to the code (1B10) element (actually, we would link to the parent of code (1B10), the syllabus element, but it doesn't make much of a difference).

The user could also specify "generate inconsistent links". In this case, every element in the source set will get an incosistent link to the elements in the filtered destination set. Imagine a rule of the form For every course in the degree, there must not be a syllabus with the same name, however silly this example may sound.

Finally, the user could specify "generate inconsistent links to nowhere". This would apply in our example. In Table 3, the destination set is empty. The current source element is therefore inconsistent by the definition of our rule. However, where do we link if there is nothing to link to? The source element is not inconsistent with anything else, it is inconsistent with itself, or the system it resides in. We produce an inconsistent link that doesn't point anywhere (or to the element itself).

At the end of this example, we should have the following links:

Course (1B10) <-C-> code (1B10)
object (1B12) <-C-> class (1B12)
object (1BGOOD) <-I-> nowhere

Language formalism

We will now loosely specify the language we are working with. A grammar for the language exists but is too long and boring to include here. Having done that, we can then find a suitable and user-friendly encoding in XML. We will use the following notation:

Bold uppercase letters (S, D) refer to element sets (e.g. the set of all courses, etc).
Lowercase letters are elements of sets with the same letter in uppercase (e.g. s is an element of S).
The dot-notation refers to relative context, e.g. s.A is attribute A of element s (e.g. s.A could be the code of the current course).
|S| denotes the cardinality of number of elements of a set S.
S/(formula) denotes the restriction of set S according to a formula.

Our basic rules will then be of the form:

FORALL s in S, |D/(formula)| = size

A formula is an operator expression, e.g. equality, or a formula connected by the boolean connectives "and" or "or". The formula may test properties on two things:

A relative attribute of the current source element of the forall expression
A relative attribute of the destination element being compared

For example, we can express our running example as follows:

Let S be the set of Course elements in StudyPlan/Year elements
Let D be the set of code elements in syllabus/identity elements
Let s.A be the Code attribute of a Course element
Let d.A be the text contained in the code element of the syllabuses
FORALL s in S, |D/(s.A=d.A)| > 0

One thing to notice immediately is that the 'size' operand can be different from zero. This is a powerful tools because we can specify rules like "for every year in the curriculum, there have to be at least four courses which are classified as mandatory" and similar examples.

XML encoding and examples

We now present the XML encoding of our language by giving several examples. The DTD is presented at the bottom of this page but is not nearly as informative as the examples. We will start with our running example, the existence of a syllabus for each course.

Example 1

Looking back to the previous section, our XML encoding works exactly like the notation used above, we first state which sets we are working on and then write the consistency rule:

<ConsistencyRule id="r1">

    <Description>Each course in the studyplan has a syllabus
    </Description>

    <SetDefinition id="s1">
        /StudyPlan/Year/Course
    </SetDefinition>
    <SetDefinition id="d1">
        /syllabus/identity
    </SetDefinition>

    <Forall setid="s1">
        <SizeNotEqual cmode="C" imode="IX">
            <Filter setid="d1">
                <Equal>
                    <XPathSource value="Code"/>
                    <XPathFilter value="code"/>          
                </Equal>
            </Filter>
            <Integer value="0"/>
        </SizeNotEqual>
    </Forall>
</ConsistencyRule>

Now, looking at the rule, a destination element goes into the filtered set if the equal expression, which works on the relative xpath expressions inside, returns true. The XPathSource element will generate an attribute set relative to the nearest enclosing forall expression (there can only be one forall anyway, but these are the semantics). The XPathFilter expression works on the set in the nearest enclosing Filter element. Again, there can only be one filter element containing an XPathFilter, but this is the way this constraint is enforced in our DTD.

The Filter element will thus filter those nodes out of the destination set for which the filter expression returns false. The second parameter to the SizeNotEqual element is an integer with the value zero. The size of the filtered set will be compared to the integer. If they are not equal, consistent links are generated into the filtered set from the current source node, otherwise an inconsistent link to "missing" is generated from the current source node.

Example 2

This example expresses the rule that all first year courses in the CS department at UCL must contain exactly 8 compulsory course modules. The use of the word "course" for both the degree programme and the modules may be irritating, but that's the way it goes. The features of the rule are explained below.

<ConsistencyRule>

    <SetDefinition id="source">
        /Year[@number="1"]/Course
    </SetDefinition>

    <SetDefinition id="destination">
        /Curricula/Curriculum/Year/Course/Type[@requirement="C"]/..
    </SetDefinition>

    <Forall setid="source">
        <SizeEqual cmode="CX" imode="IX">
            <Filter setid="destination">
                <And>
                    <Equal>
                         <XPathSource value="Code"/>
                         <XPathFilter value="Code"/>
                    </Equal>
                    <Equal>
                         <XPathFilter value="Type[@requirement]"/>
                         <Constant value="C"/>
                    </Equal>
                </And>
            </Filter>            
            <Integer value="8"/>
        </SizeEqual>
    </Forall>

</ConsistencyRule>

First of all, observe that using the XPath expression /Year[@number="1"]/Course we can already pre-filter the source set to only include first year courses. In addition, we only want the Course elements which contain a Type subelement with the requirements attribute set to "C", again this is done by XPath.

Now we filter the set of course modules to include only the courses with the same code as the degree programme. If the final, filtered set is of size 8, we generate a consistent link from the current source node to the rule (specified by "cmode=CX" in the SizeEqual element). This way of handling the consistency links seems to make more sense than linking to all destination courses (the source node is consistent towards the rule, but it cannot be said that it is "consistent with its modules"). Similarly, if the size of the destination set is above or below eight, we generate an inconsistent link to the rule ("imode=IX").

Example 3

This example expresses the rule that a course cannot be a prerequisite of itself. For any given course, we now want to retrieve all the courses that are its prerequisites (at any level of nesting) and then check whether or not this set contains the course itself. This can be done using the transitive closure operator of our language, as shown below.

<ConsistencyRule id="r6">

    <SetDefinition id="s1">
        /syllabus/identity
    </SetDefinition>
    <SetDefinition id="d1">
        /syllabus/identity
    </SetDefinition>

    <Forall setid="s1" mode="oneSet">
        <SizeNotEqual cmode="CN" imode="IX">
                <Filter setid="d1" mode="oneSet">
                    <And>
                        <Equal>
                                <XPathSource value="code"/>
                                <XPathFilter value="code"/>
                        </Equal>
                        <IsIntersect>
                                <XPathSource value="code"/>
                                <XPathFilter value="closure(code,../../subjet/prerequisites/pre_code)"/>
                        </IsIntersect>
                    </And>
                </Filter>
                <Integer value="0"/>
        </SizeNotEqual>
    </Forall>
</ConsistencyRule>

The way the transitive closure operator works is as follows: given a course element, it first retrieves its direct prerequisites and then it repeats this process recursively, until all the prerequisites have been found. We then only have to check whether or not the course is listed (IsInteresect operator) among them.

Example 4

This example shows how to use variables to pre-filter a destination set so as to improve the performance of the check. Let us consider the following rule: each course in the studyplan is actually identified in the curricula.

<ConsistencyRule id="r6">

    <SetDefinition id="s1">
        /StudyPlan/Year/Course
    </SetDefinition>
    <SetDefinition id="d1">
        /Curricula/Curriculum/Year/Course
    </SetDefinition>

    <Forall setid="s1" mode="multipleSet">
        <SizeNotEqual cmode="C" imode="IX">
                <Filter setid="d1" mode="oneSet"/>
                    <Equal>
                        <XPathSource value"Code"/>
                        <XPathFilter value="Code">
                    </Equal>
                <Integer value="0"/>
        </SizeNotEqual>
    </Forall>
</ConsistencyRule>

The number of courses retrieved in the destination set "d1" is quite high (about 500). During the filtering operation we have to evaluate the xpath expression "Code" on each of these courses, and this operation is time-consuming. What we would like to do instead is to work on a destination set containing only those courses relevant to this studyplan, that is only the courses that belong to the same programme undertaken by the student. Using variables we are able to cut down the number the courses that belong to the destination set, as shown below:

    <SetDefinition id="d1">
       /Curricula/Curriculum/Programme/Title[text()=$s1/../../Programme/Title]/../Award[text()=$s1/../../Programme/Award]/../../Year/Course
    </SetDefinition>

The variable "$s1" refers to the definition set having the same id. In principle this variable could be bound to any set defined inside the rule; at the moment we are using a simplified version of this approach so that this variable is always bound to the set that is referred to by the "Forall" statement.
This pre-filtering operation drastically reduces the cardinality of the destination set so that the filtering operation can be performed much quicker. In some circumstances, the pre-filtering operation actually does all the filtering necessary, and all we have to do is to check the cardinality of the destination set. This is the case with the rule above that can be rewritten in the following way.

<ConsistencyRule>

    <SetDefinition id="s1">
        /StudyPlan/Year/Course
    </SetDefinition>
    <SetDefinition id="d1">
        /Curricula/Curriculum/Programme/Title[text()=$s1/../../Programme/Title]/../Award[text()=$s1/../../Programme/Award]/../../
        Year[@number=$s1/../@number]/Course/Code[text()=$s1/Code]/..
   </SetDefinition>

    <Forall setid="s1" mode="multipleSet">
        <SizeNotEqual cmode="C" imode="IX">
                <Filtered setid="d1" mode="oneSet"/>
                <Integer value="0"/>
        </SizeNotEqual>
    </Forall>
</ConsistencyRule>

Consistencyrule DTD

Here is the current complete DTD for the consistency rule language.

<?xml version="1.0" encoding="ISO-8859-1"?>

<!ENTITY % Boolean "And,Or">
<!ENTITY % MultiOperator "AndOperator,OrOperator">

<!ELEMENT XPathSource EMPTY>
<!ATTLIST XPathSource
value CDATA #REQUIRED>

<!ELEMENT XPathFilter EMPTY>
<!ATTLIST XPathFilter
value CDATA #REQUIRED>

<!ELEMENT Constant EMPTY>
<!ATTLIST Constant
value CDATA #REQUIRED>

<!ELEMENT Integer EMPTY>
<!ATTLIST Integer
value CDATA "0">

<!ELEMENT ConsistencyRuleSet (ConsistencyRule)*>
<!ELEMENT ConsistencyRule (Description,SetDefinition*,Forall)>
<!ATTLIST ConsistencyRule
id ID #REQUIRED>

<!ELEMENT Description (#PCDATA)>

<!ELEMENT SetDefinition (#PCDATA)>
<!ATTLIST SetDefinition
id ID #REQUIRED>

<!ELEMENT Forall (%MultiOperator;,SizeEqual,SizeNotEqual)>
<!ATTLIST Forall
setid IDREF #REQUIRED
mode (oneSet|multipleSet) "multipleSet">

<!ELEMENT AndOperator (SizeEqual*,SizeNotEqual*,AndOperator*,OrOperator*)>
<!ATTLIST AndOperator
cmode (C|CX) "C"
imode (I|IX) "">

<!ELEMENT OrOperator (SizeEqual*,SizeNotEqual*,AndOperator*,OrOperator*)>
<!ATTLIST OrOperator
cmode (C|CX) "C"
imode (I|IX) "">

<!ELEMENT SizeEqual (Filter*,Filtered*,Integer*)>
<!ATTLIST SizeEqual
cmode (C|CX) "C"
imode (I|IX) "">

<!ELEMENT SizeNotEqual (Filter*,Filtered*,Integer*)>
<!ATTLIST SizeNotEqual
cmode (C|CX) "C"
imode (I|IX) "">

<!ELEMENT Filter (%Boolean;,Equal,NotEqual,IsIntersect,Subset)>
<!ATTLIST Filter
setid IDREF #REQUIRED
mode (oneSet|multipleSet) "multipleSet">

<!ELEMENT Filtered (%Boolean;,Equal,NotEqual,IsIntersect,Subset)?>
<!ATTLIST Filtered
setid IDREF #REQUIRED
mode (oneSet|multipleSet) "multipleSet">

<!ELEMENT Equal (XPathSource,XPathFilter,Constant*)>
<!ELEMENT NotEqual (XPathSource,XPathFilter,Constant*)>
<!ELEMENT IsIntersect (XPathSource,XPathFilter,Constant*)>
<!ATTLIST IsIntersect size CDATA #REQUIRED>
<!ELEMENT Subset (XPathSource,XPathFilter,Constant*)>
<!ATTLIST Subset size CDATA #REQUIRED>

<!ELEMENT And (Equal*,NotEqual*,IsIntersect*,Subset*,And*,Or*)>
<!ELEMENT Or (Equal*,NotEqual*,IsIntersect*,Subset*,And*,Or*)>