[This local archive copy is from the official and canonical URL, http://www.prescod.net/forest/shorttut/; please refer to the canonical source document if possible.]
Paul Prescod (paul@isogen.com)
The notions of schemata and validation are widely deemed to be crucial to the success of XML and SGML. This paper surveys a particular formalization for schemata that is sufficiently powerful to implement recursive SGML/XML content model validation and also many features that SGML/XML content models do not support. The same formalism can be used to describe queries and transformations over SGML and XML documents.
Recent events have pushed document schemata into the spotlight. SGML is currently under revision, and many of the proposed changes are extensions of the document type definition syntax and semantics. XML is not up for revision, but many members of the XML vendor and user community have expressed frustration over the limitations of SGML-style DTDs. They cite problems with scalability, granularity and syntax. They have created proposals for defining document schemata that would replace DTDs.
This is a good time, then, to ask what we mean by document, document type and schema. We would like to make definitions that are more general than a particular notation such as XML and SGML and yet mathematically precise. This process is similar to that used by Noam Chomsky to define "language" and "grammar" in the fifties. My definitions will not necessarily capture every nuance of the plain language definitions of "document" and "schema" any more than Chomsky's definition of language did. But they will provide a formal framework that we can use to describe, compare and evaluate document representation and schema languages. This paper is meant to be useful to the SGML, XML and Web community, so it does not presume knowledge beyond a CS course in formal language theory. For those without this background, one standard reference is "Principles of Compiler Design" by Aho and Ullman.
My definitions will be based on a generalization of some of Chomsky's ideas. Where he described "regular expressions", "regular grammars", "regular languages" and so forth, I will describe "Forest-regular expressions" and "Forest-regular" grammars, "Forest-regular" languages and so forth. These theories are not new, but they have only recently been applied in a text encoding context. Murata Makoto has been the most prominent proponent of this new document formalism, and this paper is in part a summary of his work. It is primarily Murata-san that has adapted the formalism for use with documents and schemata.
Formalisms help us to reason about computational complexity, interactions and relationships between language features and appropriate data models. Forest automata theory represents a mathematical formalism with known properties that can be used to understand at least some aspects of document schemata, just as context free grammars and regular expressions allow us to understand the syntactic features of markup languages.
For our purposes, a document is a hierarchical arrangement of elements. Elements can contain data characters and other elements. XML and SGML users know that there are documents that do not fit this pattern, but we also know that the hierarchical tree model has proven to be the most productive for real-world document processing. Nobody has yet described a non-hierachical model with the same scope, descriptive power and relatively simple processing model. Where necessary, other structures can be modelled through hyperlinks. So the hierarchical tree model is adequate but must be supported with a rich linking model.
A document language is a set (usually infinite) of documents. The set of all documents that conform to the DocBook DTD is a document language. The set of all documents on the Web is another document language. The set of all documents conforming either to the DocBook DTD or to TEI is a language. This last language would be called the union of the two other languages, which make sense if you think of languages in terms of sets. This definition of language is analogous to the Chomskian definition of language. In SGML terms, a document type is a language.
A document schema is a document that describes a language. A schema can be used in two ways. You can test things for adherance to the schema. Things which adhere to the schema are in the language it describes. For instance you can check if SGML documents conform to their document type definition. You can also use the schema as a set of constraints that a document must live within. For instance, a user can interactively build documents and an editor can stop them from violating the schema. Ideally a schema language would support both use models.
Most computer science graduates are familiar with string automata theory from their first formal languages course. This section is a review for those who have been away from it for a while. As you will recall, an alphabet is a set of symbols. Examples include the characters in Unicode, the digits from 0 to 9, or the set {a,b}. The set of element type names in an XML document also constitutes an alphabet. A string is a sequence of symbols from an alphabet. A language is a set (finite or infinite) of strings. So English is a language because there are certain strings that are English and certain strings that are not.
A regular expression is a way of describing a language. Regular expressions can only define simple languages. The set of languages that can be described are called the regular languages. Regular expressions are widely used in computer programs and in descriptions of languages. For instance, an SGML/XML content model is a regular expression. The set of all sequences of elements and data characters that conform to the model forms a regular language.
A finite state machine is a logical machine ("automata") that can be used to check if strings conform to a regular expression or not. The correspondance betwen finite state machines and regular expressions is important for several reasons. First, we can compile regular expressions to finite state machines using a fairly straightforward algorithm. This means that regular expressions can serve as a textual user interface to finite state machines just as SGML document strings serve as a user interface to the hierarchial model that underlies them. Second, we can optimize finite state machines using well-known algorithms. This means that users can write the expressions according to their understanding of the problem domain, instead of trying to optimize for the application. The application can then optimize the query itself. For instance, ambiguous XML content models can be rewritten as unambigous SGML ones, and can thus be simplified and implemented more efficiently.
In addition to optimizing regular languages we can combine them in various ways. We can determine the intersection, union and difference of regular languages. For example, if a DTD designer was considering merging two elements in an DTD, the designer might want to determine which sequences of elements and data characters would conform to both of the original content models. This is the intersection. The designer might also want to make a content model that is flexible enough to support sequences that conform to both of the original models. This is the union. Finally, the designer might want to know what content sequences would conform only to one model or the other. This would be the difference between the two models. There is a freely available tool that does these sorts of things called Grail.
Languages that cannot be defined by a regular expression are not regular languages. Most interesting languages are not regular. For instance, the language defined by an SGML DTD is not regular, though the content models of individual elements are regular. The problem is that regular languages cannot describe recursive languages, such as a language that allows nested parentheses or tags.
Any realistic programming language would be more complex than a regular expression could define. These can often be described by a more powerful formalism called a context free language. Context free languages are described by context free grammars and can be defined by push down automata. But the leap in power and complexity between finite string automata and push down automata is huge. A program that can parse any context free grammar must be quite complicated and slow, whereas programs for parsing and validating XML documents can be relatively simple and fast. Any XML or SGML parser that thought of DTDs as a form of context free grammar would be very slow. Luckily, real parsers do not. The fundamental problem is that context free grammars describe languages much more rich, powerful and complex than SGML can represent. It makes no sense to pay the price for that complexity without getting any benefit.
For instance, we know that SGML trees are easy to parse compared to the general class of context free languages because SGML documents always have logical start and end tags. The omittag feature in full SGML allows start-tags to be removed, but the conditions of their removal are carefully chosen to allow the tags to be inferred. This is different from the more general class of context free grammars. Given a context free grammar, it might take arbitrary lookahead to determine the correct parse for a particular part of the string. It is the need to lookahead that makes the class of context free grammars hard to parse in general.
In addition, there are many important things that cannot be done with context-free grammars because they are so general. For instance, the set of context-free grammars is not closed under intersection, so that we cannot necessarily make a context free grammar for the set of strings recognized by both of two different context free grammars. We also cannot (in general) optimize context free grammars.
Since we already know how to parse SGML and XML documents relatively easily, we can make progress on the schema front by divorcing our notions of schema validation and parsing. We can apply theories similar to those used for verifying the conformance of strings to a language, but apply them to the result of the parse: the document tree. When we work at this logical level, instead of at tree level, we can preserve most of the virtues of the simple regular language model, but also add the ability to handle recursive structures.
SGML-knowledgable people have an intuitive notion of a document as a tree. We will use the word forest to mean a list of zero or more trees. Just as an SGML document is a tree, the content model for any element must be a forest. We will consider a single character to be a simple tree. So an SGML document is a string representation of a forest. Parsing the string into the forest is exactly what a typical SGML parser and grove builder already does.
Note that this definition of forest is not universal. Some use the word forest to mean an unordered set of trees. Some people would like to refer to these constructs as "hedges" instead. Hopefully the terminology will shake out in the coming months. For the rest of this paper, I will use the term "forest".
A forest regular language can be described with a document called a forest regular grammar.
Forest regular grammars are similar to extended context-free grammars. They be used to define forest languages instead of string languages. Formally, they have four parts. The first is a set of non-terminals. The second is the alphabet. The third is a set of production rules that are equivalent to our transition table. Finally there is a string regular expression made up of non-terminal characters that is our "final state" expression.
Every production is of the form:
A -> <x>r</x>
where r is a string-regular expression over the non-terminals. Every forest-regular language has a forest-regular grammar. We can translate DTDs into this syntax easily.
DOC -> <doc>HEAD,BODY</doc> HEAD -> <head>TITLE,ABSTRACT</head> BODY -> <body>(H1|H2|H3|P)*</body> ABSTRACT -> <abstract>#PCDATA</abstract> H1 -> <h1>#PCDATA</h1> ...
Note that the lowercase letters in angle brackets represent tags that would actually be typed into a document. Upper case names on the left hand side and in the regular expressions represent symbolic replacement variables, the non-terminals. Valid replacements are given on the right hand side.
Forest regular grammars are more powerful than DTDs. The tag in the right-side does not necessarily have to be the same as the non-terminal name on the left-side. This is what distinguishes non-terminals from ordinary SGML element types. We can also allow non-terminals to have several replacements. Any one is valid anywhere that the non-terminal is referenced, just as with context free grammars. Consider the following:
DOC -> <doc>HEAD,BODY</doc> HEAD -> <head>TITLE,ABSTRACT</head> BODY -> <body>(HEADING|P)*</body> HEADING -> <h1>#PCDATA</h1> HEADING -> <h2>#PCDATA</h2> HEADING -> <h3>#PCDATA</h3>
Essentially, the BODY content model says that a BODY must have BODY start and end tags, and then contain any kind of heading or paragraph repeated. "Any kind of heading" is defined by the HEADING rules. It says that any of the matching rules is acceptable.
This feature could be used to replace the relatively unstructured parameter entity mechanism in SGML in some contexts. We can also use the feature to allow elements to have content models based on their contexts:
DOC -> <doc>HEAD,BODY</doc> HEAD -> <head>TITLE,ABSTRACT</head> BODY -> <body>(HEADING|P-WITH-FOOTNOTE|P-WITHOUT-FOOTNOTE)*</body> ABSTRACT -> <abstract>P-WITHOUT-FOOTNOTE</abstract> P-WITH-FOOTNOTE -> <p>#PCDATA,(FOOTNOTE,#PCDATA)+</p> P-WITHOUT-FOOTNOTE -> <p>#PCDATA</p> FOOTNOTE -> <FOOTNOTE>P-WITHOUT-FOOTNOTE</FOOTNOTE>
One way to look at it is to say that there are two different paragraph element types, and they just happen to both use the "p" tag as their delimiter. Reusing the tag reduces the number of tags that a user must memorize. Another way to look at it is that the "p" element type has different content models in different contexts.
This accomplishes the same thing as SGML's exclusions, but enforces the context-sensitivity from the bottom, instead of the top. This means that given the list of element declarations that include a particular element type name, you can see all possible variations of its content model based on context. For instance, a tool could generate a list of contexts that a particular fragment is valid in, or not valid in.
With SGML exclusions, the tool would have to manually trace a path from every container node with an exclusion to the node you are interested in. Exclusions can also be combined. In the worse case, there could be an exponential number of implicit variations on the content model!
The automata that recognizes this forest is called a frontier-to-root or bottom-up forest automaton. That means that it works from the bottom of a forest (such as an SGML document) to the top. This makes sense, because you cannot know if the top element of an SGML document is valid until you have looked at each of its children, and you cannot know if those children are valid until you have looked at each of their children, and so forth. So validity is a property that flows naturally from the bottom of a document to the top.
The machine recognizes paragraphs with footnotes by doing a pattern match with the right hand side of each rule. This is relatively easy, because only two rules have the P tag, and only one of those allows footnotes. The automata labels these paragraphs with the "PARAGRAPH-WITH-FOOTNOTE" label. It labels the other paragraphs with the "PARAGRAPH-WITHOUT-FOOTNOTE" label. Once it has all of the labels for the content of a particular element, it labels the element by pattern matching against the right hand sides.
Note that the example above explicitly made certain that no element could be considered both a "PARAGRAPH-WITH-FOOTNOTE" and a "PARAGRAPH-WITHOUT-FOOTNOTE". This made the content models more complex, and harder to read. It should be possible to write software that reads grammars that are not so careful about deterministic recognizability and compiles them to grammars that are more tailored for simple recognizability, such as the one above. For instance, a processor could recognize that a rule matching PARAGRAPH-WITH-FOOTNOTE would be more specific than one that only matched PARAGRAPH, and would thus take precedence. Overlapping rules can also be differentated based on context. When rules cannot be differentated, a processor could report that with an error message, just as modern SGML parsers can report ambiguous content models. This is another benefit of the fact that regular forest grammars are constrained in many of the same ways that string regular expressions are.
When we are describing forest regular languages, we have the choice of using either forest regular grammars or another notation called forest regular expressions. Just as string regular expressions define constraints on strings, forest regular expressions define constraints on forests. It may come as a surprise to learn that all of the element type declarations and content models in an SGML document can be expressed as a single regular forest expression. The full syntax for these expressions is beyond the scope of this paper.
Forest automata provide a wealth of interesting possibilities for SGML implementation, standardization and research.
Forest automata provide a very simple, rigorous model for SGML validation. A deterministic frontier-to-root forest automaton can validate an SGML parse tree or parse event stream in time roughly linear to the size of the input document.
SGML validation does not use all of the power of forest automata. As I described before, it is possible to describe context sensitive content models. Earlier, I showed how an exclusion can be replaced with two rules in a forest grammar. We can also replace inclusions. For instance in HTML it is legal to put input fields within paragraphs only when those paragraphs are within a form. This can be accomplished through an inclusion exception. But inclusion exceptions have many side effects, such as making an element available in many contexts where it should not be (such as input fields within hyperlink hotspots). HTML 4.0 did away with the inclusion exceptions used in earlier versions of HTML for this reason. This task can be easily accomplished with forest automata, without side effects:
<!ELEMENT FORM <form>(P-INFORM, H1-INFORM, H2-INFORM, ...)</form>> <!ELEMENT P <p>(#PCDATA|A)</P>> <!ELEMENT P-InForm <p>(#PCDATA|A|INPUT)</p>*> <!ELEMENT H1 <h1>(#PCDATA)</h1>> <!ELEMENT H1-InForm <h1>(#PCDATA|INPUT)*</h1>> <!ELEMENT A <a>(#PCDATA)</a>>
Note that INPUT elements cannot appear in A elements, even in forms.
One important question is how this context sensitivity should be presented to the document type designer. I have suggested that it might be more appropriate to use some form of forest regular expression based notation. Just as users of text editors and search engines do not want to think in terms of deterministic string automata states, users of structured documents and schemata would prefer not to have to track the flow of state around a machine.
I would suggest a syntax like this:
<!ELEMENT FORM (P, H1, H2, ...)> <!ELEMENT P (#PCDATA|A)> <!ELEMENT (P in FORM) (#PCDATA|A|INPUT)*> <!ELEMENT H1 (#PCDATA)> <!ELEMENT (P in H1) (#PCDATA|INPUT)*> <!ELEMENT A (#PCDATA)>
A validation engine could build a regular grammar that implemented the author's intention. It could implicitly define P_in_Form, P_in_H1 and other_P grammar rules.
Forest automata theory allows us to create formal definitions of what it would mean to combine schemata. This is something that SGML is lacking. One such mechanism is the forest x-product. Murata defines this as an operation that takes schema1 and schema2 and creates a schema that will accept any forest that can be created by replacing all occurrences of the element x in schema1 with a forest that conforms to schema2.
For example, consider the following DTD fragment:
<!ELEMENT HTML (H1|P|MATH-FORMULA)*> <!ELEMENT MATH EMPTY>
Now imagine another schema existed for defining mathematical formulae:
<!ELEMENT FORMULA (...)>
The MATH-product of these two DTDs would allow the FORMULA elements from the second DTD to replace MATH elements in the first DTD. Not every MATH element would need to be replaced, however. You could have two math models in the same DTD merely by applying the MATH-product operator twice. The new schema will require every occurence of the FORMULA and FORMULA2 elements to occur where a MATH element would be valid.
In the interest of information system rigour, I would also define a stricter form of product that I call schema parameterization. This would be equivalent to a product, but it would require that every single occurrence of an element be replaced. In the MATH example, this probably makes the most sense. You do not want empty MATH elements to exist. The MATH element was only supposed to be a placeholder for a supplied FORMULA ``parameter.''
If you wanted to allow two different types of formulae in the same document, but still strictly apply the product, you could combine schema parameterization with another operator: the union. A schema that accepts the union of two other schemata accepts any document that either of them accepts. Thus the union of two math models would accept any formula that either accepts. Simlarly, the intersection and difference of forest-based schemata can be computed using subset constructions similar to those used for string regular expressions.
As you can see, forest automata offer us simple, powerful formalisms for expressing combinations of schemata. Although structured, schema-based constructions can never be as completely general as SGML's ad hoc parameter entities (text substitutions), they are nevertheless quite powerful and they lend themselves much more strongly to human understanding and machine processing. For instance graphical tools cannot be used to combine SGML DTDs in interoperable ways, because each tool will expand parameter entities before attempting to understand the structure of the DTD. Under our system, the result of schema combination would be cleanly and portably specified.
The ability to describe the union, concatenation, intersection and complement of forest regular languages can have a large effect on the practice of DTD modification. The formalism provides a way of defining what a backwards-compatible document type is, and describing exactly what sort of documents will not be compatible with the new form of the schema.
Murata has shown that in many cases it is possible to describe the evolution of a DTD as a transformation. This transformation can be automatically applied to legacy instances so that they automatically conform to the new DTD. If it is possible that there are some legacy instances that cannot be converted, the document type designer can be warned of that possibility when the DTD is modified, instead of later, when the first incompatible document is encountered.
SGML transformations are an extremely important part of document processing. In fact, as more and more data formats migrate to SGML, more of what document processing professionals do will be transformations from one DTD to another. This is usually done through computer programs in completely general (Turing complete) programming languages such as Python, DSSSL, OmniMark or C++. The problem with these transformations is that they cannot be statically analyzed. Determining whether a DSSSL program creates HTML conforming to the HTML specification is equivalent to the halting problem. It is just not possible in general.
Transformations based on forest automata do not have this problem. Murata has shown that when we define a transformation based on forest automata theory, we can automatically define a new schema that accepts all forests created by the transformation. In many cases, this schema will be minimally sufficient, which means that it is no more flexible than it needs to be. If we combine minimal sufficiency with the ability of forest automata based schemata to be compared, then we can often make transformations that are guaranteed to create valid instances of some output schema. Even the limited transformation described by architectural forms cannot make this guarantee in most cases.
This is a huge step forward for document processing. Not only does it allow robust transformations, it also provides a basis for rigorous document databases. Just as you would not expect a SQL query to be able to produce results that would crash an application (because the SQL query defines a schema as well as the query itself) we would not want a query to a database to produce a result that could crash the formatter.
Although Murata uses a slightly different system, I propose that regular tree languages should extend even into #PCDATA text. Instead of treating text as a more or less opaque blob as SGML does, the language could treat each character as a tree node that must always be a leaf. Then the text contents of elements could be expressed in essentially the same regular expression syntax that everything else is:
<!ELEMENT NUM (HEX-NUM|DEC-NUM)> <!ELEMENT DEC-NUM ([0-9]+)> <!ELEMENT HEX-NUM ([0-9,"A"-"F"]+)>
As you can see, the example mixes character-level restrictions and element type restrictions freely.
The forest automata theory offers many exciting opportunities for the advancement of our understanding of document structures. This understanding will allow us to build systems that are more robust, efficient and intuitive. The primary limitation of the theory at this point is that the people who most need to know about it do not (software vendors), and the people who are most capable of extending it are not knowledgable about the problems of our industry (formal languages academics). That leaves those of us "in the trenches" with a responsibility to learn, share and do our own research.
Please contact me with questions. I will respond as quickly as I can.
[Note: This is a preliminary version of the paper. See also "SGML/XML and Forest Automata Theory."]