Syntax for regular-but-non-local schemata for structured documents
Revised on April 27, 1999.
By: MURATA Makoto
Fuji Xerox Information Systems
Contents
- 1. Introduction
- 2. Requirements
- 3. Problems of local schema languages
- 4. Possible syntax for regular schema languages
- 5. Examples
- 6. Miscellaneous
- References
1. Introduction
In my previous note entitiled "Regularity and locality of string languages and tree languages" [1], I introduced regularity and locality of tree languages.
In this note, I show practical requirements for regular-but-non-local schemata for structured documents, and then suggest a syntax for writing regular-but-non-local schemata.
Thoughout this memo, a schema is said to be local if it describes a local tree set. A local schema language is a schema langauge that can express local schemata only. A schema is said to be regular if it describes a regular tree set. A regular schema language is a schema language that can express regular schemata only.
2. Requirements
There are three types of practical requirements for regular-but-non-local schemata for structured documents. They are as below:
- Support of schema maintenance by boolean operations
- Ancestor-sensitive content models
- Schema transformation
2.1 Support of schema maintenance by boolean operations
(1) Union
Suppose that technical reports in my company use an in-house schema based on HTML DTD 3.2 and technical reports in your company use another in-house schema based on the strict HTML DTD 4.0. Also supposet that I have to develop stylesheets and transformation programs for handling technical reports from both companies. To make this development easier, I would like to create a schema that permits any document valid against the first or the second schema. That is, I would like to automatically create the union of two schemata.
(2) Intersection
Suppose that some person belongs to both your company and my company, and he would like to create documents valid against both schemata. Although it is possible to create well-formed documents first and validate them against the two schemata, it would be much nicer if we can create a single schema which simulates both, and use an XML-native editor against this single DTD. Thus, I would like to automatically create the intersection of two schemata
(3) Difference
Suppose your company and my company are merged and that your in-house schema becomes the authoritative schema. Some of my documents probably do not validate against your schema. To know the characteristics of these documents, I would like to automatically create the difference between my schema and yours (to be precise, my schema minus yours).
(4) Upper-compatibility
In the above scenario, it would be nice if my documents are already permitted by your schema. So, I would like to automatically examine if your schemata is upper-compatible with mine (in other words, my schema minus yours is empty). If it is not upper-compatible, you might want to repeatedly revise your schema until it is upper-compatible with mine.
Another example is schemata evolution. It would be very nice if we can ensure that every existing document conforms to a new schema. Thus, it would be very nice if we cant test the new version is upper-compatible with the older schema.
Note: I owe Paul Prescod for these motivations.
2.2 Ancestor-sensitive content models
(1) Paragraphs with or without paragraphs
In my schema, paragraphs appear below sections and appendixes. I would like to allow paragraphs to have footnotes only when they appear below sections. Nevertheless, paragraphs always have the same attributes. Similar requirements (e.g., titles having different permissible subordinates) are very common and have been very frequently observed by SGML practioners.
(2) Stopping recursion without introducing superfluous element types
In my schema, I would like to allow segments to have subordinate segments. However, I would like to impose an upper limit on the nesting level of segments.
This is useful for many practical document editing tasks; e.g., if I cut some data out of one document and paste it into another, I don't have to change all start/end tags of segemnts to reflect the level in the new document. It also simplifies queries to find the segments in a document.
Note: The requirement (2) was originally mentioned by Eve Maler, and then extended by Jonathan Robie. (1) has been mentioned by many SGML practioners I have met.
2.3 Schema Transformation
Consider a schema and documents valid against this schema. For some reason, we might want to revise these documents as well as the *schema*. A very simple example is renaming of element types, which creates documents containint new element types and schemas declaring new element types. Ideally, a single transformation program should transform both documents and the schema.
A more complicated example is tranformation between different schemata. For example, suppose that I have to write a tranfroamtion program from a DOCBOOK DTD to HTML. It would be very nice if my program can also transform the DOCBOOK DTD to a target DTD, against which any transformed document is guaranteed to be valid. I can then examine if the HTML DTD is upper-compatible with the target DTD. If this examination succeeds, my transformation program is guaranteed to create valid HTML documents only.
3. Problems of local schema languages
In this section, we show that the DTD syntax and any other *local* schema language cannot satisfy these requirements.
3.1 Inability to do union
Given two schemata (DTDs) of the same root elememt type, the union schema of the two can not be captured by DTDs. For example, consider two schemata as below:
Schema A
<!ELEMENT A (B*)> <!ELEMENT B (Para*)> <!ELEMENT Fnote (#PCDATA)> <!ELEMENT Para (#PCDATA|FNote)*>
Schema B
<!ELEMENT A (C*)> <!ELEMENT C (Para*)> <!ELEMENT Para (#PCDATA)>
The union of A and B cannot be captured by any DTD. The below DTD is an approximation, but it is not quite. It allows instances (e.g., <A><C><Para></Fnote></Para></C></A>) that are not permitted by A or B.
<!ELEMENT A (B*|C*)> <!ELEMENT B (Para*)> <!ELEMENT C (Para*)> <!ELEMENT Para (#PCDATA|FNote)*> <!ELEMENT Fnote (#PCDATA)>
To remedy this problem, I have to allow <Para> to have <FNote> only when the <Para> is subordinate to <B>. In other words, I would like to use (#PCDATA) as the content model of <Para> below <Sec> and use (#PCDATA|FNote)* as the content model of <Para> below <App>. This is impossible in XML DTDs and any local schema languages.
Note: Some operations such as intersection and empty-testing are possible even when we allow local schemata only. However, implementation becomes easier if we consider regular schemata.
3.2 Inability to capture ancestor-sensitivity
Recall the example in (1) of 1.2. A DTD (shown below) allows paragraphs (<Para>) to have footnotes (<FNote>) always.
<!ELEMENT Doc (Sec*,App*) <!ELEMENT Sec (Para*)> <!ELEMENT App (Para*)> <!ELEMENT Para (#PCDATA|FNote)*> <!ELEMENT Fnote (#PCDATA)>
However, we would like to allow <Para> to have <FNote> only when the <Para> is below sections (<Sec>). That is, I would like to use (#PCDATA) as the content model of <Para> below <Sec> and use (#PCDATA|FNote)* as the content model of <Para> below <App>. This situation is very similar to the previous case and any local schema language fails to work here.
3.3 Inability to stop recursion
It is easy to write a DTD (see below) that allows recursion. But this is certainly not what I want.
<!ELEMENT Segment (Para*, Segment*)> <!ELEMENT Para (#PCDATA)>
The next DTD limits the structural depth but introduces superfuous element types. This is not what I want either.
<!ELEMENT Segment0 (Para*, Segment1*)> <!ELEMENT Segment1 (Para*, Segment2*)> <!ELEMENT Segment2 (Para*)> <!ELEMENT Para (#PCDATA)>
Ideally, we need a schema as below:
<!ELEMENT Segment ???> <!ELEMENT Para (#PCDATA)> ,
where ??? is
(Para*, Segment*) when Segment is at the top-level or second level, and
(Para*) when Segment is at the third level.
3.4 Inability to transform schemata
Consider a DTD as below:
<!ELEMENT Doc (head, body)> <!ELEMENT body (segment*)> <!ELEMENT segment (title, p*, segment*)> <!ELEMENT title (#PCDATA)> <!ELEMENT p (#PCDATA)>
Note: the definition of head and its subordinates are omitted.
Suppose that we would like to convert this document to HTML. This is possible by (1) renaming <Doc> as <HTML>, (2) renaming <Segement> as <div>, and (3) renaming <title> as <H1>, <H2>, <H3>, <H4>, <H5>, <H6>, or simple <p> depending on the nest level.
Certainly, it is possible to transform document instances. However, we cannnot transform this DTD exactly. An approximation is as below:
<!ELEMENT HTML (head, body)> <!ELEMENT body (segment*)> <!ELEMENT div ((H1|H2|H3|H4|H5|H6|p), p*, div*)> <!ELEMENT H1 (#PCDATA)> <!ELEMENT H2 (#PCDATA)> <!ELEMENT H3 (#PCDATA)> <!ELEMENT H4 (#PCDATA)> <!ELEMENT H5 (#PCDATA)> <!ELEMENT H6 (#PCDATA)> <!ELEMENT p (#PCDATA)>
However, this is not exactly what we want: Although transformed documents have <H1> only at the top-level <div>, this DTD allows <H1> at every level. This situation is again very similar to the previous case and any local schema language fails to work here.
4. Possible syntax for regular schema languages
Now, we consider a syntax for regular schema languages. Although it is possible to invent a syntax expressed in XML, I use a SGML-like syntax in this proposal. The reason is that everybody knows how to read DTDs. This syntax can be easily changed to some XML syntax, if we choose one (or some) of the schema proposals as a basis.
Here I consider declarations of elements, attributes, and content models only without excluding further declarations such as entity declarations.
4.1 Element type declarations
<!ELEMENT element-type-name>
This construct declares an element type of the specified name. No content models are described by this construct. Thus, a content model is completely detached from the definition of an element type.
4.2 Attribute-list declarations
<!ATTLIST ...>
This construct is not changed from XML 1.0. It attaches some attributes to one element type, and specify permissible values.
(Side note: I would like to omit default values so as to make schema-aware processors and other processors behave the same.)
4.3 Production rules
<!RULE "[" non-terminal "]" element-type? regexp>,
This construct declares an (anonymous) production rule. It has three constituens: a non-terminal, an (optional) element type, and a content model. We say that this content model to this non-terminal and element type pair. If an element type name is omitted, a content model is attached to a non-terminal directly. (This provides an elegant alternative for parameter entities for describing content models.)
A non-terminal is a name. It does not have be declared by other constructs. A content model is either a regular expression over non-terminals or (#PCDATA | e11 | e12 | ...)*, where e11, e12, ... are non-terminals.
The non-terminal is called the head of this rule, and the element type (if any) and the content model are called the body of this rule
5. Examples
5.1 The union schema (see 3.1)
Observe that two non-terminals, N-Para-A and N-Para-B share the same element type but have different content models.
<!ELEMENT A> <!ELEMENT B> <!ELEMENT C <!ELEMENT Para> <!ELEMENT Fnote> <!RULE [N-A] A (N-B* | N-C*)> <!RULE [N-B] B (N-Para-A*)> <!RULE [N-C] C (N-Para-B*)> <!RULE [N-Para-A] Para (#PCDATA|N-FNote)*> <!RULE [N-Para-B] Para (#PCDATA)> <!RULE [N-Fnote] Fnote (#PCDATA)>
5.2 An ancestor-sensitive schema (see 3.2)
Again, non-terminals N-Para-A and N-Para-B share the same element type but have different content models.
<!ELEMENT Doc> <!ELEMENT Sec> <!ELEMENT App> <!ELEMENT Para> <!ELEMENT Fnote> <!RULE [N-A] Doc (N-Sec*, N-App*)> <!RULE [N-Sec] Sec (N-Para-A*)> <!RULE [N-App] App (N-Para-B*)> <!RULE [N-Para-A] Para (#PCDATA|N-FNote)*> <!RULE [N-Para-B] Para (#PCDATA)> <!RULE [N-Fnote] Fnote (#PCDATA)>
5.3 Stopping recursion (see 3.3)
In this example, non-terminals Seg-0, Seg-1, and Seg-2 share the same element type but have different content models.
<!ELEMENT Segment> <!ELEMENT Para> <!RULE [N-Seg-0] Segment (N-Para*, N-Seg-1*)> <!RULE [N-Seg-1] Segment (N-Para*, N-Seg-2*)> <!RULE [N-Seg-2] Segment (N-Para*)> <!RULE [N-Para] Para (#PCDATA)>
5.4 Schema transformation (see 3.4)
Transformation of schemata is outside the scope of this note and is described in [2]. Here I only show the output schema.
Non-terminals N-div1 through N-div-7 share the same element type but have different content models.
<!ELEMENT HTML> <!ELEMENT body> <!ELEMENT div> <!ELEMENT H1> <!ELEMENT H2> <!ELEMENT H3> <!ELEMENT H4> <!ELEMENT H5> <!ELEMENT H6> <!ELEMENT p> <!RULE [N-HTML] HTML (N-head, N-body)> <!RULE [N-body] body (N-div-1*)> <!RULE [N-div-1] div (N-H1, N-p*, N-div-2*)> <!RULE [N-div-2] div (N-H2, N-p*, N-div-3*)> <!RULE [N-div-3] div (N-H3, N-p*, N-div-4*)> <!RULE [N-div-4] div (N-H4, N-p*, N-div-5*)> <!RULE [N-div-5] div (N-H5, N-p*, N-div-6*)> <!RULE [N-div-6] div (N-H6, N-p*, N-div-7*)> <!RULE [N-div-7] div (N-p, N-p*, N-div-7*)> <!RULE [N-H1] H1 (#PCDATA)> <!RULE [N-H2] H2 (#PCDATA)> <!RULE [N-H3] H3 (#PCDATA)> <!RULE [N-H4] H4 (#PCDATA)> <!RULE [N-H5] H5 (#PCDATA)> <!RULE [N-H6] H6 (#PCDATA)> <!RULE [N-p] p (#PCDATA)>
5.5 Non-terminals as a replacement of parameter entities
Traditionally, parameter entitis have been heavily used for describing common idioms in content models. We no longer need them, since non-terminals and rules without element types provide a better solution. For example:
<!ELEMENT P> <!RULE [N-P] P (#PCDATA | N-phrase)*> <!RULE [N-phrase] (N-EM | N-STRONG | N-DFN | N-CODE | N-SAMP | N-KBD | N-VAR | N-CITE)>
6. Miscellaneous
6.1 Confusing non-terminals and element types
Although an element type can be referenced by more than one rule, it is usually referenced by a single rule. Thus, only one non-terminal and element type pair exists usually. To make schemata more readable, it is probably wise to use the same name for such a non-terminal and element type, and even omit non-terminals from rules. Then, the schema in 5.2 becomes simpler as below:
<!ELEMENT Doc> <!ELEMENT Sec> <!ELEMENT App> <!ELEMENT Para> <!ELEMENT Fnote>
<!RULE Doc (Sec*, App*)> <!RULE Sec (N-Para-A*)> <!RULE App (N-Para-B*)> <!RULE [N-Para-A] Para (#PCDATA|FNote)*> <!RULE [N-Para-B] Para (#PCDATA)> <!RULE Fnote (#PCDATA)>
6.2 Inheritance
In a regular schema language, a content model is NOT a property of an element type. Rather, a content model is a property of production rules. Thus, if element type inheritance implies content models inheritance, regular schema languages are endangered.
References
[1] Murata, "Regularity and locality of string languages and tree languages", http://www.oasis-open.org/cove r/murataRegularity.html.
[2] Murata, "DTD Transformation by Patterns and Contextual Conditions", SGML/XML '97, available at http://www.fxis.co.jp/DMS/sgml/xml /sgmlxml97.html (slides are also available at: http://www.fxis. co.jp/DMS/sgml/xml/dtd_transformation/index.html).
Makoto
Fuji Xerox Information Systems
Tel: +81-44-812-7230
Fax: +81-44-812-7231
E-mail: murata@apsdc.ksp.fujixerox.co.jp
Prepared by Robin Cover [with permission] for the The SGML/XML Web Page archive.