Syntax for Regular-but-non-local Schemata for Structured Documents


From     murata@apsdc.ksp.fujixerox.co.jp Thu Feb 11 11:14:42 1999
Date:    Wed, 10 Feb 1999 16:48:47 +0900
From:    MURATA Makoto <murata@apsdc.ksp.fujixerox.co.jp>
To:      forest-automaton@fxis.fujixerox.co.jp
Subject: Fwd: DRAFT: Syntax for regular-but-non-local schemata
         for structured documents

[Note: Draft version only. Updated versionin preparation.]


DRAFT Version: Syntax for regular-but-non-local schemata for structured documents

MURATA Makoto


1. Introduction

In this note, I show practical requirements for regular-but-non-local schemata for structured documents, and then suggest a syntax for writing regular-but-non-local schemata.

Thoughout this memo, a schema is said to be local if it describes a local tree set. A local schema language is a schema langauge that can express local schemata only. A schema is said to be regular if it describes a regular tree set. A regular schema language is a schema language that can express regular schemata only.

2. Requirements

There are three types of practical requirements for regular-but-non-local schemata for structured documents. They are as below:

2.1 Support of schema maintenance by boolean operations

(1) Union

Suppose that technical reports in my company use an in-house schema based on HTML DTD 3.2 and technical reports in your company use another in-house schema based on the strict HTML DTD 4.0. Also supposet that I have to develop stylesheets and transformation programs for handling technical reports from both companies. To make this development easier, I would like to create a schema that permits any document valid against the first or the second schema. That is, I would like to automatically create the union of two schemata.

(2) Intersection

Suppose that some person belongs to both your company and my company, and he would like to create documents valid against both schemata. Although it is possible to create well-formed documents first and validate them against the two schemata, it would be much nicer if we can create a single schema which simulates both, and use an XML-native editor against this single DTD. Thus, I would like to automatically create the intersection of two schemata

(3) Difference

Suppose your company and my company are merged and that your in-house schema becomes the authoritative schema. Some of my documents probably do not validate against your schema. To know the characteristics of these documents, I would like to automatically create the difference between my schema and yours (to be precise, my schema minus yours).

(4) Upper-compatibility

In the above scenario, it would be nice if my documents are already permitted by your schema. So, I would like to automatically examine if your schemata is upper-compatible with mine (in other words, my schema minus yours is empty). If it is not upper-compatible, you might want to repeatedly revise your schema until it is upper-compatible with mine.

Another example is schemata evolution. It would be very nice if we can ensure that every existing document conforms to a new schema. Thus, it would be very nice if we cant test the new version is upper-compatible with the older schema.

Note: I owe Paul Prescod for these motivations.

2.2 Ancestor-sensitive content models

(1) Paragraphs with or without paragraphs

In my schema, paragraphs appear below sections and appendixes. I would like to allow paragraphs to have footnotes only when they appear below sections. Nevertheless, paragraphs always have the same attributes. Similar requirements (e.g., titles having different permissible subordinates) are very common and have been very frequently observed by SGML practioners.

(2) Stopping recursion without introducing superfluous element types

In my schema, I would like to allow segments to have subordinate segments. However, I would like to impose an upper limit on the nesting level of segments. (It is my understanding that HTML also has similar restrcition. Did I get this info from you, Michael?)

Note: The requirement (2) was originally mentioned by Eve Maler. (1) has been mentioned by many SGML practioners I have met.

2.3 Schema Transformation

Consider a schema and documents valid against this schema. For some reason, we might want to revise these documents as well as the *schema*. A very simple example is renaming of element types, which creates documents containint new element types and schemas declaring new element types. Ideally, a single transformation program should transform both documents and the schema.

A more complicated example is tranformation between different schemata. For example, suppose that I have to write a tranfroamtion program from a DOCBOOK DTD to HTML. It would be very nice if my program can also transform the DOCBOOK DTD to a target DTD, against which any transformed document is guaranteed to valid. Then, I can examine if the HTML DTD is upper-compatible with the target DTD. If this examination succeeds, my transformation program is guaranteed to create valid HTML documents only.

3. Problems of local schema languages

In this section, we show that the DTD syntax and any other *local* schema language cannot satisfy these requirements.

3.1 Inability to do union

Given two schemata (DTD's) of the same root elememt type, the union schema of the two can not be captured by DTD's. For example, consider two schemata as below:


Schema A 

<!ELEMENT A (B*)>
<!ELEMENT B (Para*)>
<!ELEMENT Fnote (#PCDATA)>
<!ELEMENT Para (#PCDATA|FNote)*> 

Schema B 

<!ELEMENT A (C*)>
<!ELEMENT C (Para*)>
<!ELEMENT Para (#PCDATA)> 

The union of A and B cannot be captured by any DTD.  The below DTD is
an approximation, but it is not quite.  It allows instances (e.g.,
<A><C><Para></Fnote></Para></C></A>) that are not permitted by A or B.

<!ELEMENT A (B*|C*)>
<!ELEMENT B (Para*)>
<!ELEMENT C (Para*)>
<!ELEMENT Para (#PCDATA|FNote)*>
<!ELEMENT Fnote (#PCDATA)> 

To remedy this problem, I have to allow a Para to have FNote's only when the Para is subordinate to B. In other words, I would like to use (#PCDATA) as the content model of Para's below Sec and use (#PCDATA|FNote)* as the content model of Para's below App. This is impossible in XML DTD's and any local schema languages.

Note: The other operations, namely intersection, difference, empty-testing, are possible even when we allow local schemata only. However, implementation becomes easier if we consider regular schemata.

3.2 Inability to capture ancestor-sensitivity

Recall the example in (1) of 1.2.   A DTD (shown below) 
allows paragraphs (Para's) to have footnotes (FNote's) always.  

<!ELEMENT Doc (Sec*,App*)
<!ELEMENT Sec (Para*)>
<!ELEMENT App (Para*)>
<!ELEMENT Para (#PCDATA|FNote)*> 
<!ELEMENT Fnote (#PCDATA)> 

However, we would like to allow a Para to have FNote's only when the
Para is below sections (Sec's).  That is, I would like to use
(#PCDATA) as the content model of Para's below Sec and use
(#PCDATA|FNote)* as the content model of Para's below App.  This
situation is very similar to the previous case and any local schema
languages fail to work here.

3.3 Inability to stop recursion

It is easy to write a DTD (see below) that allows recursion. But this is certainly not what I want.

<!ELEMENT Segment (Para*, Segment*)>
<!ELEMENT Para (#PCDATA)> 

The next DTD limits the structural depth but introduces superfuous
element types.   This is not what I want either.

<!ELEMENT Segment0 (Para*, Segment1*)>
<!ELEMENT Segment1 (Para*, Segment2*)>
<!ELEMENT Segment2 (Para*)>
<!ELEMENT Para (#PCDATA)> 

Ideally, we need a schema as below: 

<!ELEMENT Segment ???> 
<!ELEMENT Para (#PCDATA)> , 

where ??? is 

  (Para*, Segment*)   when Segment is at the top-level or second level, and

  (Para*)             when Segment is at the third level.

3.4 Inability to transform schemata

Consider a DTD as below:

<!ELEMENT Doc    (Sec*, App*)>
<!ELEMENT Sec    (Subsec*)>
<!ELEMENT App    (Subsec*)>
<!ELEMENT Subsec (Para*)>
<!ELEMENT Para   (#PCDATA)>

Suppose that we would like to rename paragraphs below sections as SPara's, and rename subsecs below sections as SSubsec's. Any transformed document is guaranteed to conform a DTD as below:


<!ELEMENT Doc     (Sec*, App*)>
<!ELEMENT Sec     (SSubsec*)>
<!ELEMENT App     (Subsec*)>
<!ELEMENT SSubsec (SPara*)>
<!ELEMENT Subsec  (Para*)>
<!ELEMENT SPara   (#PCDATA)>
<!ELEMENT Para   (#PCDATA)>

However, it is impossible to automatically construct this DTD since an interim schema is non-local. The interim schema after the first renaming (i.e., renaming paragraphs below sections as SPara's) must allow a Para to have SPara's only when the Para is below sections. This is impossible in local schema languages.

4. Possible syntax for regular schema languages

Now, we consider a syntax for regular schema languages. Although it is possible to invent a syntax expressed in XML, I use a SGML-like syntax in this proposal. The reason is that everybody knows how to read DTD's. This syntax can be easily changed to some XML syntax, if we choose one (or some) of the schema proposals as a basis.

Here I consider declarations of elements, attributes, and content models only without excluding further declarations such as entity declarations.

4.1 Element type declarations

<!ELEMENT element-type-name>

This construct declares an element type of the specified name. No content models are described by this construct. Thus, a content model is completely detached from the definition of an element type.

4.2 Attribute-list declarations

<!ATTLIST ...>

This construct is not changed from XML 1.0. It attaches some attributes to one element type, and specify permissible values.

(Side note: I would like to omit default values so as to make schema-aware processors and other processors behave the same.)

4.3 Production rules


<!RULE non-terminal [element-type-name] regexp>,

This construct declares an (anonymous) production rule. It attaches a content model to a pair of a non-terminal and element type. If an element type name is omitted, a content model is attached to a non-terminal. (This provides an elegant alternative for parameter entities for describing content models.)

A non-terminal is a name. It does not have be declared by other constructs. regexp is either a regular expression over non-terminals or (#PCDATA | e11 | e12 | ...)*, where e11, e12, ... are element type names.

The non-terminal is called the head of this rule, and the element type (if any) and the regexp are called the body of this rule

Well-formedness constraint (element type declared): If an element type appears in the body of some production rule, that element type must be declared.

Well-formedness constraint (uniqueness 1): No two rules share the same non-terminal and element type pair.

Well-formedness constraint (uniqueness 2): If an element type is omitted in some rule, the head (non-terminal) of this rule does not appear as the head of any other rule.

5. Examples

5.1 The union schema (see 3.1)

<!ELEMENT Doc>
<!ELEMENT Sec>
<!ELEMENT App>
<!ELEMENT Para>
<!ELEMENT Fnote>

<!RULE N-Doc     Doc   (N-Sec*, N-App*)>
<!RULE N-Sec     Sec   (N-Para-A*)>
<!RULE N-App     App   (N-Para-B*)>
<!RULE N-Para-A  Para  (#PCDATA|N-FNote)*>
<!RULE N-Para-B  Para  (#PCDATA)>
<!RULE N-Fnote   Fnote (#PCDATA)>

5.2 An ancestor-sensitive schema (see 3.2)

<!ELEMENT A>
<!ELEMENT B>
<!ELEMENT C>
<!ELEMENT Para>
<!ELEMENT Fnote>

<!RULE N-A      A     (N-B* | NC-*)>
<!RULE N-B      B     (N-Para-A*)>
<!RULE N-C      C     (N-Para-B*)>
<!RULE N-Para-A Para  (#PCDATA|N-FNote)*>
<!RULE N-Para-B Para  (#PCDATA|N-FNote)*>
<!RULE N-Fnote  Fnote (#PCDATA)>

5.3 Stopping recursion (see 3.3)

<!ELEMENT Segment>
<!ELEMENT Para>

<!RULE Seg-0  Segment  (N-Para*, Seg-1*)>
<!RULE Seg-1  Segment  (N-Para*, Seg-2*)>
<!RULE Seg-2  Segment  (N-Para*)>
<!RULE N-Para Para     (#PCDATA)>

5.4 Schema transformation (see 3.4)

Here is the interim non-local schema after the first renaming operation.

<!ELEMENT Doc>
<!ELEMENT Sec>
<!ELEMENT App>
<!ELEMENT Subsec>
<!ELEMENT SPara>
<!ELEMENT Para>

<!RULE N-Doc     Doc    (N-Sec*, N-App*)>
<!RULE N-Sec     Sec    (N-Subsec1*)>
<!RULE N-App     APP    (N-Subsec2*)>
<!RULE N-Subsec1 Subsec (N-SPara*)>
<!RULE N-Subsec2 Subsec (N-Para*)>
<!RULE N-SPara   SPara  (#PCDATA)>
<!RULE N-Para    Para   (#PCDATA)>

5.5 Non-terminals as a replacement of parameter entities

Traditionally, parameter entitis have been heavily used for information factorization in content models. We no longer need them, since non-terminals and rules without element types provide a better solution. For example:


<!ELEMENT P>

<!RULE N-P     P    (#PCDATA | N-phrase)*>
<!RULE N-phrase      (N-EM | N-STRONG | N-DFN | 
                        N-CODE | N-SAMP | N-KBD | N-VAR | N-CITE)>

6. Misc.

6.1 Determinism

To make validation easier, we might want to introduce another well-formedness constraint, say determinism. This constraint should guarantee that given an element type and a sequence of non-terminals, at most one non-terminal can be reached. This constraint makes validation significantly easier since neither determinization of hedge automata nor simulation of non-deterministic hedge automata are requied. However, this constraint complicates schema authoring slightly.

6.2 Confusing non-terminals and element types

Although an element type can be referenced by more than one rule, it is often referenced by a single rule. Thus, only one non-terminal and element type pair exists usually. To make schemata more readable, it is probably wise to use the same name for such a non-terminal and element type, and even omit non-terminals from rules. Then, the schema in 5.4 becomes simpler as below:

<!ELEMENT Doc>
<!ELEMENT Sec>
<!ELEMENT App>
<!ELEMENT Subsec>
<!ELEMENT SPara>
<!ELEMENT Para>

<!RULE           Doc    (Sec*, App*)>
<!RULE           Sec    (N-Subsec1*)>
<!RULE           APP    (N-Subsec2*)>
<!RULE N-Subsec1 Subsec (SPara*)>
<!RULE N-Subsec2 Subsec (Para*)>
<!RULE           SPara  (#PCDATA)>
<!RULE           Para   (#PCDATA)>

6.3 Inheritance

In regular schema languages for structured documents, inheritance should only be used for element type names and attribute declarations. It should not be used for content model inheritance.

References

Special topic "SGML/XML and Forest Automata Theory" at Robin Cover's site. Its URL is: http://www.oasis-open.org/cover/topics.html#forestAutomata.


Makoto
Fuji Xerox Information Systems
Tel: +81-44-812-7230
Fax: +81-44-812-7231
E-mail: murata@apsdc.ksp.fujixerox.co.jp