|HOME | COMPANY | SOFTWARE | DOCUMENTATION | EDUCATION & TRAINING | SALES & SERVICE|
"The Official Guide to Programming with OmniMark"
Content Model Algebra
By Sam Wilmott
By Sam Wilmott
Anyone desiring to have a full understanding of SGML content models and the theory behind the construction of text markup languages should have some familiarity with the basic concepts of Automata Theory and Set Theory on which content models are based.3. Transforming Regular Expressions
3.2. An Axiomatic System For Regular Expressions
3.3. Proving Theorems
3.4. Using Theorems To Help Create Content Models
3.5. Factoring In Optional Expressions
3.6. Being Careful
6. Finite Automata
6.2. Characteristic Derivatives
6.3. Vice-Versa - Translating DFAs into Regular Expressions
6.4. Irreducible Regular Expressions
Anyone desiring to have a full understanding of SGML content models and the theory behind the construction of text markup languages should have some familiarity with the basic concepts of Automata Theory and Set Theory on which content models are based. Automata Theory is a branch of Computer Science to which some, but not all, Computer Science students are exposed. Set Theory is a branch of Mathematics to which grade-school students were exposed, at least in the days of "New Math".
This report provides an outline of some of the concepts of Automata Theory and Set Theory relevant to creating text markup languages. A background in formal Computer Science theory and Mathematics is required for an easy comprehension of the material in the paper. Other readers may find it interesting that there is a strong theoretical basis for what they are doing when they write content models.
SGML is a "meta-language": a language in which one writes grammars describing the syntax of other languages. This chapter discusses grammars, and "regular expressions", the type of grammars used in content models.
The most important function of an SGML DTD is to define a set of elements, and to associate with each element declared content or a content model that defines what text and other elements are allowed within the element. A set of element declarations defines a grammar. A grammar is simply a definition of what elements and combinations of elements are allowed in a document. Everything else in a DTD, and in other parts of the prolog of an SGML document, are secondary to the definition of a grammar. Other information in the prolog deals with the details of how elements and text are entered, or (in the case of processing instructions and link process definitions) provide information that may aid in processing a document.
In theoretical Computer Science a grammar is said to "recognize" a string of symbols (for our purposes the characters that make up the text and markup of a document) that conform to the grammar. The set of strings recognized by a grammar is called a "language". In an SGML application, the language recognized by the grammar defined by the element declarations is the markup language used to mark up the document instance.
Both the part of Computer Science known as Formal Language Theory or Automata Theory, and modern theoretical Linguistics, define a large number of kinds of grammars. All grammars fall into one of the following four basic types:
In Automata Theory, each type of grammar corresponds to a characteristic type of machine that can recognize strings conforming to the grammar. These machines accept a string of symbols as input, one symbol at a time, and report only whether or not the string is recognized. A machine consists of a finite set of states and some form of data storage. The data storage contains symbols. For combinations of state, input symbol and data storage symbols, the machine chooses a next state, and describes a modification to the data storage. In addition, some states are defined as "final" states. If the machine reaches a final state at the end of the input string, the string is recognized. If the machine reaches a non-final state at the end of the string, or if for some combination of state, input symbol and data storage the machine does not choose a next state and data storage modification, then the string is not recognized.
The types of machines that correspond to types of grammars differ in the kinds of modification that can be done to data storage. For each of the four types of machines the allowed modification is:
SGML-definable grammars are Type 2 or Context-Free, and fall into the class known as LL(1). In (unambiguous) LL(1) grammars, each symbol in the input string uniquely determines the production to be followed. For example, in SGML, each tag or data uniquely determines what can be recognized following it.
The notation used to describe regular expressions in Computer Science texts is less verbose than that for content models, and more readily lends itself to mathematical manipulation. A regular expression consists of symbols, usually single lowercase Latin letters, and the following operations:
Lowercase Latin letters are also used to represent arbitrary regular expressions. Because the equivalence between two regular expressions remains valid when symbols in the equivalent expressions are uniformly replaced by other regular expressions, using letters in this way produces no confusion. (Exceptions to this convention, where letters represent exactly one symbol, are always noted in the following sections.)
Two Greek letters represent particular regular expressions:
Properly speaking, both and represent sets of one and zero members respectively. There is no difficulty in using the representation of a set of one member interchangeably with the representation of that one member. Because of this, is used both to represent the zero-length string and the set containing only that string.
Parentheses can be used to disambiguate expressions. Such disambiguation is not required in the following cases:
As an example of equivalent expressions, the content model (a,(b|c)+,d?) is equivalent to the regular expression a(b+c)(b+c)*(d+). The content model "plus" occurrence indicator is not required, as it can be expressed in terms of *. For example, (a+) is equivalent to the regular expression aa*, for any a. Likewise the "opt" occurrence indicated can be expressed as a union of the optional expression with the zero-length expression, . For example, (a?) is equivalent to the regular expression a+, for any a.
Both and are useful when performing complex transformations on regular expressions.
Content models and regular expressions differ in that regular expressions lack the + and & operations, regular expressions lack any distinction between "elements" and "data", and content models lack any representation for or . These differences can be accommodated in the following ways:
The most important use of regular expressions in the design of text markup languages is in constructing content models that do exactly what the designer wants them to do. The right content model can be found by writing down what is desired using a (possibly ambiguous) content model and then transforming it into a form that is not ambiguous. Knowing how to transform regular expressions also gives insight into how content models "work", and makes it easier to write complex content models.
Transforming regular expressions becomes interesting when, in the design of a markup language, an ambiguous content model occurs and an equivalent unambiguous content model is required. For example ((a,b)*,a) is ambiguous, but (a,(b,a)*), which is equivalent, is not. The question is, given the first one, how does one find the second one? In addition, given both, how does one show that they are equivalent? They both represent a sequence of alternating a and b elements, with a elements both starting and ending, and this is all a markup language designer needs to know. For more complex expressions a more convincing demonstration of equivalence is required.
Another useful example of equivalence that often occurs in practice is that (a*,(b,a*)*) is equivalent to (a|b)*. In other words, both content models simply describe a sequence of zero or more elements, made up entirely of a and b elements. Although both are unambiguous content models, it is easier to see what is going on using the second form, especially when a and b are complex models themselves, and it is easier to modify the second form.
First of all, we must be clear on what "equivalent" means: two content models or regular expressions are equivalent when they recognize exactly the same set of strings, i.e. when they describe the same language.
Secondly we need a way of finding and showing that expressions are equivalent. For this we need to formalize the definition of equivalence.
Following is a set of eleven axioms and two rules that provide a formalization of the equivalence of regular expressions, and more generally describe what can be done with regular expressions. Each axiom and rule is labelled for later use in the proof of theorems of equivalence.
If union and addition are thought of as analogous, and conjunction and multiplication are likewise, many of the axioms, such as a+b = b+a, are familiar from arithmetic, although the analogy cannot be taken too far (e.g. ab ba). Note that functions as the identity for union (analogous to zero in arithmetic), and that functions as the identity for conjunction (analogous to one).
The two rules provide ways of manipulating axioms, rather than being axioms themselves. They are called "Rules of Inference" and are used in creating proofs of theorems.
Theorems are equations which can be proven using axioms and other theorems. Following are a few theorems that solve a problem with the above set of axioms: they are not all symmetric. For example, A7 states that a = a but there is nothing that says a = a.
The following two theorems are proofs that the symmetric forms of axioms A8 and A9 are true:
Theorem A8': a =
Theorem A9': +a = a
As can be seen in each of these cases, a proof consists of applying one axiom or previously proven theorem at a time, and in a step-by-step manner, showing the equivalence of two regular expressions.
Some theorems use sub-proofs in their derivation, as in theorem A8' above, which needs to prove that a = (a)+ to produce the desired result. Sometimes these sub-proofs are of wide enough utility to be proven as theorems, as in this interesting demonstration of the relationship between and :
Theorem T1: = *
(Some text books actually use * instead of .)
The theorems already proven can be used to prove the symmetric form of axiom A7:
Theorem a7': a = a
Another simple theorem shows how symbols can be moved around repetitions. This theorem can be used to prove the symmetric form of axiom A10.
Theorem T2: aa* = a*a
Theorem A10': a* = a*a+
The following theorem is similar to T2. T2 and T3 are useful in transforming ambiguous SGML content models into unambiguous models:
Theorem T3: a(ba)* = (ab)*a
Theorems can also help in providing simplified forms of content models. The following slightly more complex proof illustrates a simplification that is often required:
Theorem T4: (a*b)*a* = a*(ba*)* = (a+b)*
This theorem proves that the content models (a,(b,a)*), ((a,b)*,a) and (a|b)* are all equivalent.
Regular expressions can be easily transliterated into content models, as long as the following items are addressed:
Most transformations of regular expressions involve "factoring" a sub-expression into a larger expression. Following are some theorems that factor in (or out) sub-expressions:
Theorem T5: (a+b)(c+d) = ac+ad+bc+bd
Theorem T6: a* = a*+
Theorem T7: a*b* = a*b*+
Theorem T8: (a+)a* = a*
Theorem T9: a* = a*a*
Consider the following "theorem":
Theorem NT1: a*b* = a*
Clearly there must be an error in the proof, but where is it? The error is in applying rule R2 when the first regular expression describes a language that contains . (In applying R2, the components were a = a*b*, b = a+, and c = . R2 states that b must not contain .)
Examples of the power of regular expressions, and important in discovering equivalences between regular expressions, are the "delta" and "derivative" operators defined in this chapter. These operators turn regular expressions from an interesting theory into a tool that can be used for practical purposes.
The key part of the proof that a*(ba*)* = (a+b)* involved factoring out an a and a b from the expression and observing that the original expression was left. This is a very powerful and general method of analysing regular expressions. Proofs using this technique are made easier by defining "operators" on regular expressions, and writing proofs in terms of the operators. The simplest operator is , which indicates whether the language defined by a regular expression contains the zero-length string, or less verbosely, whether a regular expression contains . The expression a means:
'a can also be defined as being the regular expression which recognizes all the strings that a does, except for . Note that:
can more formally be defined as follows:
Note that this definition provides a definition of for any regular expression. Note also that (ab) = (a)(b) is equivalent to if both a and b are , and is equivalent to if either or both are .
' can be formally defined similarly:
The equivalence '(ab) = ('a)b+(a)('b) provides a hint as to the use that can be made of both and '.
and ' are both used primarily in defining other operators, but one simple use of ' is in eliminating the "empty" case from content models. For example, a typical definition of a "chapter" is a "title" followed by zero or more "para" elements (paragraphs) followed by zero or more "section" elements, except that there must be at least one paragraph or section. Neither of the content models (title, para*, section*) or (title, (para|section)+) do the job. The trick is to get a model equivalent to (para*, section*) with the empty case not allowed. If the content model is written as the regular expression p*s*, the desired expression is '(p*s*). Using the definitions above, the desired expression is:
'(p*s*) = ('(p*))s*+((p*))('(s*)) = (pp*)s*+(ss*) = pp*s*+ss*
The content model required is therefore (title, ((para+, section*)|section+)).
The most powerful operator is D. Db, where a is a single symbol, is a regular expression that recognizes the set of all strings recognized by b that start with a but with the a stripped off the front. Db is called "the derivative of b with respect to a". D is formally defined as follows:
For any regular expression R, 'R is the sum of every aDR such that a is a symbol that appears in R. This can be written as 'R = aDR. In addition, R = aDR+R. Note that any symbol a that cannot start a string in R has DR = , so only symbols that start recognized strings need be considered.
a*(ba*)* = (a+b)* can be proved using derivatives as follows
Theorem T4': a*(ba*)* = (a+b)*
This proof is a bit longer than T4, but is also more mechanical: the derivatives "automate" the proof to a large extent.
Some further useful theorems are:
Theorem T10: (a*)* = a*
Theorem T11: (a*b*)* = (a+b)*
The following sequence derives a content model for a sequence of a and b symbols that contains at least two a symbols in succession. The easiest regular expression describing this language is (a+b)*aa(a+b)*, which produces an ambiguous SGML content model. The following sequence of equivalences shows that it is, however, equivalent to b*a(bb*a)*a(a+b)*, which is not ambiguous. Note that the intermediate steps in each sub-proof are not given.
Theorem T12: (a+b)*aa(a+b)* = b*a(bb*a)*a(a+b)*
Note that the order in which the derivatives were expanded in the final steps is significant. For example, if the last three steps above were replaced by the following two, the result would have been another ambiguous content model:
Ambiguity in SGML content models is defined in ISO 8879-1986. Unlike LL(1) grammars, regular expressions are not intrinsically ambiguous. Content model ambiguity can be defined formally for regular expressions as follows:
Because each regular expression describes a set of strings, simple set algebra can be applied to them. + corresponds to set union, and to the empty set. Set intersection and set difference are sometimes useful operations. The regular expression corresponding to a set intersection or difference can be found using the derivatives of the intersection or difference. For example, to find the regular expression for any string of a and b symbols that does not consist entirely of one or more a symbols:
The result is (a+b)* - aa* = a*b(a+b)*+.
Note that the derivative of an expression using set difference is simply the derivative of the two sets. The rules that can be applied for set operations are:
Finite Automata can be used as tools for the manipulation of content models. A Finite State Machine provides a graphical representation of a regular expression that, for some purposes, is easier to work with than the equivalent content model.
An interesting point about the proof of T11 and of the proof that (a+b)*aa(a+b)* = b*a(bb*a)*a(a+b)* is that the equivalence of the derivatives of two expressions is used as part of each proof. Two expressions are always equivalent if their derivatives (and the value of applied to each expression) are all equivalent.
If all the derivatives of a regular expression are taken, and then all the derivatives of those derivatives, and so on, only a finite number of distinct (i.e. non-equivalent) derivatives will be found. The original expression can be considered a derivative for this purpose (one can say R = DR). Note that in T11, there is only one distinct derivative: D((a*b*)*) = (a*b*)* for any x. This fact often leads to a solution when it is unclear how to reduce a regular expression.
As described earlier a Finite State Machine is a machine with a finite number of states, with a set of rules of the form "if, when in state X, the next symbol is Y, then go to state Z" or "if, when in state X, there are no more symbols, the string is recognized". A state with a rule of the latter form is called a "final state". One of the states is the "starting state". Any combination of state and symbol that does not have a rule indicates that a string is not in the language recognized by the machine. A Deterministic Finite Automaton, or DFA, is a Finite State Machine that for each combination of state X and symbol Y has only one rule of the first form. A Finite State Machine that allows a symbol to put it into two or more states is called a Nondeterministic Finite Automaton, or NFA.
A DFA can be constructed from a regular expression (and therefore from a content model) as follows:
If all the derivatives of a regular expression (including the original expression) are distinct (i.e. no two are equivalent), then they are called the "characteristic derivatives" of the regular expression. A DFA constructed using a set of characteristic derivatives is "minimal": it contains the minimum possible number of states for the regular expression.
A DFA is translated into a regular expressions simply by treating each state of the DFA as a derivative or the regular expression, and by setting S = for any state S that is a final state.
As an example, consider the following DFA:
Using the states as derivatives results in the following derivatives:
A little manipulation shows that S = (a+b)*, and as a consequence S = a(a+b)*. The regular expression corresponding to the DFA is therefore a(a+b)*.
Some regular expressions cannot be converted to unambiguous content models. An example is a(ba)*(b+), which is a language consisting of alternating a and b symbols, starting with an a and ending with either an a or a b. It is the (ba)*(b+) part of the expression that causes trouble; the leading a is no trouble. The following are the derivatives of (ba)*(b+):
Which is right back where we started from.
Irreducible regular expressions can be detected by examining their characteristic derivatives for "loops" that have more than one exit. In the example above, there is a "loop" of successive b and a symbols that can exit on either a b or an a.