"The Official Guide to Programming with OmniMark"

Magazine | Public forum

  International Edition   

White Papers

Content Model Algebra

By Sam Wilmott

Anyone desiring to have a full understanding of SGML content models and the theory behind the construction of text markup languages should have some familiarity with the basic concepts of Automata Theory and Set Theory on which content models are based.

1. Introduction
2. Languages

3. Transforming Regular Expressions 4. A Regular Expression Calculus 5. Set Operations
6. Finite Automata

1. Introduction

Anyone desiring to have a full understanding of SGML content models and the theory behind the construction of text markup languages should have some familiarity with the basic concepts of Automata Theory and Set Theory on which content models are based. Automata Theory is a branch of Computer Science to which some, but not all, Computer Science students are exposed. Set Theory is a branch of Mathematics to which grade-school students were exposed, at least in the days of "New Math".

This report provides an outline of some of the concepts of Automata Theory and Set Theory relevant to creating text markup languages. A background in formal Computer Science theory and Mathematics is required for an easy comprehension of the material in the paper. Other readers may find it interesting that there is a strong theoretical basis for what they are doing when they write content models.

2. Languages

SGML is a "meta-language": a language in which one writes grammars describing the syntax of other languages. This chapter discusses grammars, and "regular expressions", the type of grammars used in content models.

2.1. Grammars

The most important function of an SGML DTD is to define a set of elements, and to associate with each element declared content or a content model that defines what text and other elements are allowed within the element. A set of element declarations defines a grammar. A grammar is simply a definition of what elements and combinations of elements are allowed in a document. Everything else in a DTD, and in other parts of the prolog of an SGML document, are secondary to the definition of a grammar. Other information in the prolog deals with the details of how elements and text are entered, or (in the case of processing instructions and link process definitions) provide information that may aid in processing a document.

In theoretical Computer Science a grammar is said to "recognize" a string of symbols (for our purposes the characters that make up the text and markup of a document) that conform to the grammar. The set of strings recognized by a grammar is called a "language". In an SGML application, the language recognized by the grammar defined by the element declarations is the markup language used to mark up the document instance.

Both the part of Computer Science known as Formal Language Theory or Automata Theory, and modern theoretical Linguistics, define a large number of kinds of grammars. All grammars fall into one of the following four basic types:

  1. Type 3 or Regular Expression Grammars,

  2. Type 2 or Context-Free Grammars,

  3. Type 1 or Context-Sensitive Grammars, and

  4. Type 0 or Turing Machines.

In Automata Theory, each type of grammar corresponds to a characteristic type of machine that can recognize strings conforming to the grammar. These machines accept a string of symbols as input, one symbol at a time, and report only whether or not the string is recognized. A machine consists of a finite set of states and some form of data storage. The data storage contains symbols. For combinations of state, input symbol and data storage symbols, the machine chooses a next state, and describes a modification to the data storage. In addition, some states are defined as "final" states. If the machine reaches a final state at the end of the input string, the string is recognized. If the machine reaches a non-final state at the end of the string, or if for some combination of state, input symbol and data storage the machine does not choose a next state and data storage modification, then the string is not recognized.

The types of machines that correspond to types of grammars differ in the kinds of modification that can be done to data storage. For each of the four types of machines the allowed modification is:

  1. Type 3 machines have no data storage. They are consequently called "Finite State Machines".

  2. Type 2 machines have a single push-down stack, of which the machine can only examine the top-most symbol.

  3. Type 1 machines can examine any finite part of the data storage.

  4. Type 0 machines have no restriction on the way in which data storage can be modified or examined.

SGML-definable grammars are Type 2 or Context-Free, and fall into the class known as LL(1). In (unambiguous) LL(1) grammars, each symbol in the input string uniquely determines the production to be followed. For example, in SGML, each tag or data uniquely determines what can be recognized following it.

2.2. Regular Expression Notation

The notation used to describe regular expressions in Computer Science texts is less verbose than that for content models, and more readily lends itself to mathematical manipulation. A regular expression consists of symbols, usually single lowercase Latin letters, and the following operations:

  1. "or" (union) is represented by +, for example a+b.

  2. "seq" (conjunction) is represented simple adjacency or (rarely) by ^, for example ab or a^b.

  3. "rep" (repetition) is represented by *, for example a*.

Lowercase Latin letters are also used to represent arbitrary regular expressions. Because the equivalence between two regular expressions remains valid when symbols in the equivalent expressions are uniformly replaced by other regular expressions, using letters in this way produces no confusion. (Exceptions to this convention, where letters represent exactly one symbol, are always noted in the following sections.)

Two Greek letters represent particular regular expressions:

  • λ is a regular expression that recognizes zero-length strings.

  • φ is a regular expression that recognizes no strings.

Properly speaking, both λ and φ represent sets of one and zero members respectively. There is no difficulty in using the representation of a set of one member interchangeably with the representation of that one member. Because of this, λ is used both to represent the zero-length string and the set containing only that string.

Parentheses can be used to disambiguate expressions. Such disambiguation is not required in the following cases:

  1. where an asterisk follows a symbol or an expression in parentheses - for example a+b*+c means a+(b*)+c;

  2. where conjoined sub-expressions are united - for example abc+de means (abc)+(de);

  3. where a sequence of three or more sub-expressions are joined by union or conjunction - for example abc means both (ab)c and a(bc), and a+b+c means both (a+b)+c and a+(b+c).

As an example of equivalent expressions, the content model (a,(b|c)+,d?) is equivalent to the regular expression a(b+c)(b+c)*(d+λ). The content model "plus" occurrence indicator is not required, as it can be expressed in terms of *. For example, (a+) is equivalent to the regular expression aa*, for any a. Likewise the "opt" occurrence indicated can be expressed as a union of the optional expression with the zero-length expression, λ. For example, (a?) is equivalent to the regular expression a+λ, for any a.

Both λ and φ are useful when performing complex transformations on regular expressions.

2.3. Content Models and Regular Expressions

Content models and regular expressions differ in that regular expressions lack the + and & operations, regular expressions lack any distinction between "elements" and "data", and content models lack any representation for λ or φ. These differences can be accommodated in the following ways:

  1. The content model (a+) can be written as the regular expression aa*.

  2. The content model (a&b) can be rewritten as the content model ((a,b)|(b,a)) and then converted.

  3. Data can be represented in a regular expression by assigning a symbol to it.

  4. λ can be eliminated by using the equivalence aλ = a or by using the SGML "opt" occurrence indicator (e.g. (a+λ) = (a?)).

  5. φ can be eliminated by using the equivalences aφ = φ and a+φ = a.

3. Transforming Regular Expressions

The most important use of regular expressions in the design of text markup languages is in constructing content models that do exactly what the designer wants them to do. The right content model can be found by writing down what is desired using a (possibly ambiguous) content model and then transforming it into a form that is not ambiguous. Knowing how to transform regular expressions also gives insight into how content models "work", and makes it easier to write complex content models.

3.1. Equivalence

Transforming regular expressions becomes interesting when, in the design of a markup language, an ambiguous content model occurs and an equivalent unambiguous content model is required. For example ((a,b)*,a) is ambiguous, but (a,(b,a)*), which is equivalent, is not. The question is, given the first one, how does one find the second one? In addition, given both, how does one show that they are equivalent? They both represent a sequence of alternating a and b elements, with a elements both starting and ending, and this is all a markup language designer needs to know. For more complex expressions a more convincing demonstration of equivalence is required.

Another useful example of equivalence that often occurs in practice is that (a*,(b,a*)*) is equivalent to (a|b)*. In other words, both content models simply describe a sequence of zero or more elements, made up entirely of a and b elements. Although both are unambiguous content models, it is easier to see what is going on using the second form, especially when a and b are complex models themselves, and it is easier to modify the second form.

First of all, we must be clear on what "equivalent" means: two content models or regular expressions are equivalent when they recognize exactly the same set of strings, i.e. when they describe the same language.

Secondly we need a way of finding and showing that expressions are equivalent. For this we need to formalize the definition of equivalence.

3.2. An Axiomatic System For Regular Expressions

Following is a set of eleven axioms and two rules that provide a formalization of the equivalence of regular expressions, and more generally describe what can be done with regular expressions. Each axiom and rule is labelled for later use in the proof of theorems of equivalence.

  1. A1: (a+b)+c = a+(b+c)

  2. A2: (ab)c = a(bc)

  3. A3: a+b = b+a

  4. A4: a(b+c) = ab+ac

  5. A5: (a+b)c = ac+bc

  6. A6: a+a = a

  7. A7: aλ = a

  8. A8: aφ = φ

  9. A9: a+φ = a

  10. A10: a* = aa*+λ

  11. A11: a* = (a+λ)*

  12. R1: If:

    1. a, b, c, a' and b' are regular expressions,

    2. a = b,

    3. a is a symbol that occurs in both a and b,

    4. a' is identical to a except with each occurrence of a in a replaced by c, and

    5. b' is identical to b except with each occurrence of a in b replaced by c,

    then a' = b' (the Rule of Substitution).

  13. R2: If:

    1. a, b and c are regular expressions,

    2. a = ba+c, and

    3. the set of strings recognized by b does not include λ,

    then a = b*c (the Rule of Solution of Equations).

If union and addition are thought of as analogous, and conjunction and multiplication are likewise, many of the axioms, such as a+b = b+a, are familiar from arithmetic, although the analogy cannot be taken too far (e.g. ab (neq) ba). Note that φ functions as the identity for union (analogous to zero in arithmetic), and that λ functions as the identity for conjunction (analogous to one).

The two rules provide ways of manipulating axioms, rather than being axioms themselves. They are called "Rules of Inference" and are used in creating proofs of theorems.

3.3. Proving Theorems

Theorems are equations which can be proven using axioms and other theorems. Following are a few theorems that solve a problem with the above set of axioms: they are not all symmetric. For example, A7 states that aλ = a but there is nothing that says λa = a.

The following two theorems are proofs that the symmetric forms of axioms A8 and A9 are true:

Theorem A8': φa = φ
Lemma L1: φa = (φφ)a: A8 = φ(φa): A2 = φ(φa)+φ: A9
∴ φa = φ*φ: L1, R2 = φ: A8

Theorem A9': φ+a = a
∴ φ+a = a+φ: A3 = a: A9

As can be seen in each of these cases, a proof consists of applying one axiom or previously proven theorem at a time, and in a step-by-step manner, showing the equivalence of two regular expressions.

Some theorems use sub-proofs in their derivation, as in theorem A8' above, which needs to prove that φa = φ(φa)+φ to produce the desired result. Sometimes these sub-proofs are of wide enough utility to be proven as theorems, as in this interesting demonstration of the relationship between φ and λ:

Theorem T1: λ = φ*
∴ φ* = φφ*+λ: A10 = φ+λ: A8' = λ: A9'

(Some text books actually use φ* instead of λ.)

The theorems already proven can be used to prove the symmetric form of axiom A7:

Theorem a7': a = λa
Lemma L1: a = φ+a: A9' = φa+a: A8'
∴ a = φ*a: L1, R2 = λa: T1

Another simple theorem shows how symbols can be moved around repetitions. This theorem can be used to prove the symmetric form of axiom A10.

Theorem T2: aa* = a*a
Lemma L1: aa* = a(aa*+λ): A10 = a(aa*)+aλ: A4 = a(aa*)+a: A7
∴ aa* = a*a: L1, R2

Theorem A10': a* = a*a+λ
∴ a* = aa*+λ: A10 = a*a+λ: T2

3.4. Using Theorems To Help Create Content Models

The following theorem is similar to T2. T2 and T3 are useful in transforming ambiguous SGML content models into unambiguous models:

Theorem T3: a(ba)* = (ab)*a
Lemma L1: a(ba)* = a((ba)(ba)*+λ): A10 = a(ba)(ba)*+aλ = a(ba)(ba)*+a: A7 = (ab)(a(ba)*)+a: A2
∴ a(ba)* = (ab)*a: L1, R2

Theorems can also help in providing simplified forms of content models. The following slightly more complex proof illustrates a simplification that is often required:

Theorem T4: (a*b)*a* = a*(ba*)* = (a+b)*
Lemma L1: a*(ba*)* = (aa*+λ)(ba*)*: A10 = (aa*)(ba*)*+λ(ba*)*: A4 = a(a*(ba*)*)+λ(ba*)*: A2 = a(a*(ba*)*)+(ba*)*: A7' = a(a*(ba*)*)+((ba*)(ba*)*+λ): A10 = a(a*(ba*)*)+(b(a*(ba*)*)+λ): A2 = (a(a*(ba*)*)+b(a*(ba*)*))+λ: A1 = (a+b)(a*(ba*)*))+λ: A4
Lemma L2: a*(ba*)* = (a+b)*λ: L1, R2 = (a+b)*: A7
∴ (a*b)*a* = a*(ba*)*: T3 = (a+b)*: L2

This theorem proves that the content models (a,(b,a)*), ((a,b)*,a) and (a|b)* are all equivalent.

Regular expressions can be easily transliterated into content models, as long as the following items are addressed:

  1. The regular expression does not correspond to an ambiguous content model. Detecting and eliminating this condition is discussed later.

  2. The regular expression does not contain any use of λ other than in the context of a+λ. Any other use of λ can be eliminated by using aλ = a and λa = a.

  3. The regular expression does not contain any use of φ. All uses of φ can be eliminated by using aφ = φ and a+φ = a.

3.5. Factoring In Optional Expressions

Most transformations of regular expressions involve "factoring" a sub-expression into a larger expression. Following are some theorems that factor in (or out) sub-expressions:

Theorem T5: (a+b)(c+d) = ac+ad+bc+bd
∴ (a+b)(c+d) = a(c+d)+b(c+d): A5 = (ac+ad)+(bc+bd): A4

Theorem T6: a* = a*+λ
∴ a* = aa*+λ: A10 = aa*+(λ+λ): A6 = (aa*+λ)+λ: A1 = a*+λ: A10

Theorem T7: a*b* = a*b*+λ
Lemma L1: a*b* = (a*+λ)(b*+λ): T6 = a*b*+a*λ+λb*+λλ: T5 = a*b*+a*+b*+λ: A7, A7'
∴ a*b* = a*b*+a*+b*+λ: L1 = a*b*+a*+b*+(λ+λ): A6 = (a*b*+a*+b*+λ)+λ: A1 = a*b*+λ: L1

Theorem T8: (a+λ)a* = a*
∴ (a+λ)a* = aa*+λa*: A5 = aa*+a*: A7' = aa*+(a*+λ): T6 = aa*+(λ+a*): A3 = (aa*+λ)+a*: A1 = a*+a*: A10 = a*: A6

Theorem T9: a* = a*a*
Lemma L1: a* = a*+a*:A6 = (aa*+λ)+a*: A10 = aa*+(λ+a*): A1 = aa*+(a*+λ): A3 = aa*+a*: T6
∴ a* = a*a*: L1, R2

3.6. Being Careful

Consider the following "theorem":

Theorem NT1: a*b* = a*
Lemma L1: a*b* = a*b*+λ: T7 = ((a+λ)a*)b*+λ: T8 = (a+λ)(a*b*)+λ: A2
∴ a*b* = (a+λ)*λ: L1, R2 = (a+λ)*: A7 = a*: A11

Clearly there must be an error in the proof, but where is it? The error is in applying rule R2 when the first regular expression describes a language that contains λ. (In applying R2, the components were a = a*b*, b = a+λ, and c = λ. R2 states that b must not contain λ.)

4. A Regular Expression Calculus

Examples of the power of regular expressions, and important in discovering equivalences between regular expressions, are the "delta" and "derivative" operators defined in this chapter. These operators turn regular expressions from an interesting theory into a tool that can be used for practical purposes.

4.1. The "delta" Operator

The key part of the proof that a*(ba*)* = (a+b)* involved factoring out an a and a b from the expression and observing that the original expression was left. This is a very powerful and general method of analysing regular expressions. Proofs using this technique are made easier by defining "operators" on regular expressions, and writing proofs in terms of the operators. The simplest operator is (delta), which indicates whether the language defined by a regular expression contains the zero-length string, or less verbosely, whether a regular expression contains λ. The expression (delta)a means:

  • λ if a contains λ, or

  • φ if a does not contain λ.

(delta)'a can also be defined as being the regular expression which recognizes all the strings that a does, except for λ. Note that:

  • if (delta)a = φ, then (delta)'a = a, and

  • (delta)'a+(delta)a = a.

(delta) can more formally be defined as follows:

  1. (delta)a = φ when a is a single symbol

  2. (delta)λ = λ

  3. (delta)φ = φ

  4. (delta)(a+b) = ((delta)a)+((delta)b) (which can be more conveniently written (delta)a+(delta)b)

  5. (delta)(ab) = ((delta)a)((delta)b)

  6. (delta)(a*) = λ.

Note that this definition provides a definition of (delta) for any regular expression. Note also that (delta)(ab) = ((delta)a)((delta)b) is equivalent to λ if both (delta)a and (delta)b are λ, and is equivalent to φ if either or both are φ.

(delta)' can be formally defined similarly:

  1. (delta)'a = a when a is a single symbol

  2. (delta)'λ = φ

  3. (delta)'φ = φ

  4. (delta)'(a+b) = (delta)'a+(delta)'b

  5. (delta)'(ab) = ((delta)'a)b+((delta)a)((delta)'b)

  6. (delta)'(a*) = ((delta)'a)a*.

The equivalence (delta)'(ab) = ((delta)'a)b+((delta)a)((delta)'b) provides a hint as to the use that can be made of both (delta) and (delta)'.

(delta) and (delta)' are both used primarily in defining other operators, but one simple use of (delta)' is in eliminating the "empty" case from content models. For example, a typical definition of a "chapter" is a "title" followed by zero or more "para" elements (paragraphs) followed by zero or more "section" elements, except that there must be at least one paragraph or section. Neither of the content models (title, para*, section*) or (title, (para|section)+) do the job. The trick is to get a model equivalent to (para*, section*) with the empty case not allowed. If the content model is written as the regular expression p*s*, the desired expression is (delta)'(p*s*). Using the definitions above, the desired expression is:

(delta)'(p*s*) = ((delta)'(p*))s*+((delta)(p*))((delta)'(s*)) = (pp*)s*+λ(ss*) = pp*s*+ss*

The content model required is therefore (title, ((para+, section*)|section+)).

4.2. Derivatives

The most powerful operator is D. D(a)b, where a is a single symbol, is a regular expression that recognizes the set of all strings recognized by b that start with a but with the a stripped off the front. D(a)b is called "the derivative of b with respect to a". D(a) is formally defined as follows:

  1. D(a)a = λ when a is a single symbol

  2. D(a)b = φ when a and b are single symbols and a (neq) b

  3. D(a)λ = φ

  4. D(a)φ = φ

  5. D(a)(b+c) = (D(a)b)+(D(a)c)

  6. D(a)(bc) = (D(a)b)c+((delta)b)(D(a)c)

  7. D(a)(b*) = (D(a)b)b*.

For any regular expression R, (delta)'R is the sum of every aD(a)R such that a is a symbol that appears in R. This can be written as (delta)'R = (Sigma)aD(a)R. In addition, R = (Sigma)aD(a)R+(delta)R. Note that any symbol a that cannot start a string in R has D(a)R = φ, so only symbols that start recognized strings need be considered.

a*(ba*)* = (a+b)* can be proved using derivatives as follows

Theorem T4': a*(ba*)* = (a+b)*
Lemma L1: D(a)(a*(ba*)*) = (D(a)(a*))(ba*)*+(delta)(a*)(D(a)((ba*)*)) = (D(a)a)a*(ba*)*+λ(D(a)(ba*)(ba*)*) = λa*(ba*)*+λ((D(a)b)a*(ba*)*) = a*(ba*)*+λ(φa*(ba*)*) = a*(ba*)*+φ = a*(ba*)*
Lemma L2: D(b)(a*(ba*)*) = (D(b)(a*))(ba*)*+(delta)(a*)(D(b)((ba*)*)) = (D(b)a)a*(ba*)*+λ(D(b)(ba*)(ba*)*) = φa*(ba*)*+λ((D(b)b)a*(ba*)*) = φ+λ(λa*(ba*)*) = λλa*(ba*)* = a*(ba*)*
Lemma L3: (delta)(a*(ba*)*) = ((delta)(a*))((delta)((ba*)*)) = λλ = λ
Lemma L4: a*(ba*)* = aD(a)(a*(ba*)*)+bD(b)(a*(ba*)*)+(delta)(a*(ba*)*) = a(a*(ba*)*)+b(a*(ba*)*)+λ: L1, L2, L3 = (a+b)(a*(ba*)*)+λ
∴ a*(ba*)* = (a+b)*λ: L4, R2 = (a+b)*: A7

This proof is a bit longer than T4, but is also more mechanical: the derivatives "automate" the proof to a large extent.

Some further useful theorems are:

Theorem T10: (a*)* = a*
Lemma L1: D(a)((a*)*) = (D(a)(a*))(a*)* = (D(a)a)a*(a*)* = λa*(a*)* = a*(a*)* = a*(a*)*+λ: T7 = (a*)*: A10
Lemma L2: (delta)((a*)*) = λ
Lemma L3: (a*)* = a(D(a)((a*)*))+(delta)((a*)*) = a(a*)*+λ: L1, L2
∴ (a*)* = a*λ: L3, R2 = a*

Theorem T11: (a*b*)* = (a+b)*
Lemma L1: D(a)(a*b*)* = D(a)(a*b*)(a*b*)* = ((D(a)(a*))b*+((delta)(a*))(D(a)(b*)))(a*b*)* = ((D(a)a)a*b*+λ(D(a)b)b*)(a*b*)* = (λa*b*+λφb*)(a*b*)* = (a*b*+φ)(a*b*)* = (a*b*)(a*b*)* = (a*b*)(a*b*)*+λ: T7 = (a*b*)*: A10
Lemma L2: D(b)(a*b*)* = D(b)(a*b*)(a*b*)* = ((D(b)(a*))b*+((delta)(a*))(D(b)(b*)))(a*b*)* = ((D(b)a)a*b*+λ(D(b)b)b*)(a*b*)* = (φa*b*+λλb*)(a*b*)* = (φ+b*)(a*b*)* = b*(a*b*)*
Lemma L3: (delta)(a*b*)* = λ
Lemma L4: D(a)(b*(a*b*)*) = (D(a)(b*))(a*b*)*+((delta)b*)(D(a)(a*b*)*) = (D(a)b)b*(a*b*)*+λ((a*b*)*): L1 = φb*(a*b*)*+(a*b*)* = φ+(a*b*)* = (a*b*)*
Lemma L5: D(b)(b*(a*b*)*) = (D(b)(b*))(a*b*)*+((delta)b*)(D(b)(a*b*)*) = (D(b)b)b*(a*b*)*+λ(b*(a*b*)*): L2 = λb*(a*b*)*+λb*(a*b*)* = λb*(a*b*)* = b*(a*b*)*
Lemma L6: (delta)(b*(a*b*)*) = ((delta)(b*))((delta)((a*b*)*) = λλ = λ
Lemma L7: D(b)((a*b*)*) = b*(a*b*)* = aD(a)(b*(a*b*)*)+bD(b)(b*(a*b*)*)+(delta)(b*(a*b*)*) = a((a*b*)*)+b(b*(a*b*)*)+λ: L4, L5, L6 = aD(a)((a*b*)*)+bD(b)((a*b*)*)+(delta)((a*b*)*): L1, L2, L3 = (a*b*)*
Lemma L8: (a*b*)* = aD(a)((a*b*)*)+bD(b)((a*b*)*)+(delta)((a*b*)*) = a((a*b*)*)+b((a*b*)*)+λ: L1, L7, L3 = (a+b)(a*b*)*+λ: A5
∴ (a*b*)* = (a+b)*λ: L8, R2 = (a+b)*

4.3. A Long Proof

The following sequence derives a content model for a sequence of a and b symbols that contains at least two a symbols in succession. The easiest regular expression describing this language is (a+b)*aa(a+b)*, which produces an ambiguous SGML content model. The following sequence of equivalences shows that it is, however, equivalent to b*a(bb*a)*a(a+b)*, which is not ambiguous. Note that the intermediate steps in each sub-proof are not given.

Theorem T12: (a+b)*aa(a+b)* = b*a(bb*a)*a(a+b)*

  • R = (a+b)*aa(a+b)*
  • D(a)R = (a+b*)aa(a+b)*+a(a+b)* = D(1)
  • D(b)R = (a+b*)aa(a+b)* = R
  • (delta)R = φ
  • D(a)D(1) = (a+b*)aa(a+b)*+a(a+b)*+(a+b)* = D(2)
  • D(b)D(1) = (a+b*)aa(a+b)* = R
  • (delta)D(1) = φ
  • D(a)D(2) = (a+b*)aa(a+b)*+a(a+b)*+(a+b)* = D(2)
  • D(b)D(2) = (a+b*)aa(a+b)*+(a+b)* = D(3)
  • (delta)D(2) = λ
  • D(a)D(3) = (a+b*)aa(a+b)*+a(a+b)*+(a+b)* = D(2)
  • D(b)D(3) = (a+b*)aa(a+b)*+(a+b)* = D(3)
  • (delta)D(3) = λ
  • R = aD(a)R+bD(b)R+(delta)R = aD(1)+bR+φ
  • D(1) = aD(a)D(1)+bD(b)D(1)+(delta)D(1) = aD(2)+bR+φ
  • D(2) = aD(a)D(2)+bD(b)D(2)+(delta)D(2) = aD(2)+bD(3)+λ
  • D(3) = aD(a)D(3)+bD(b)D(3)+(delta)D(3) = aD(2)+bD(3)+λ
  • D(2) = D(3)
  • D(2) = aD(2)+bD(3)+λ = aD(2)+bD(2)+λ = (a+b)D(2)+λ = (a+b)*
  • R = aD(1)+bR+φ = bR+aD(1) = b*aD(1)
  • D(1) = aD(2)+bR+φ = a(a+b)*+bb*aD(1) = (bb*a)D(1)+a(a+b)* = (bb*a)*a(a+b)*
    ∴ R = b*aD(1) = b*a(bb*a)*a(a+b)*

Note that the order in which the derivatives were expanded in the final steps is significant. For example, if the last three steps above were replaced by the following two, the result would have been another ambiguous content model:

  • D(1) = aD(2)+bR+φ = a(a+b)*+bR

  • R = aD(1)+bR+φ = a(a(a+b)*+bR)+bR = (ab+b)R+aa(a+b)* = (ab+b)*aa(a+b)*

4.4. Ambiguity

Ambiguity in SGML content models is defined in ISO 8879-1986. Unlike LL(1) grammars, regular expressions are not intrinsically ambiguous. Content model ambiguity can be defined formally for regular expressions as follows:

  1. a is not ambiguous if a is a single symbol.

  2. λ is not ambiguous.

  3. φ is not ambiguous.

  4. a+b is ambiguous if and only if:

    1. a is ambiguous,

    2. b is ambiguous, or

    3. for some symbol x, D(x)a (neq) φ and D(x)b (neq) φ.

  5. ab is ambiguous if and only if:

    1. a is ambiguous,

    2. b is ambiguous,

    3. (delta)a = λ and for some symbol x, D(x)a (neq) φ and D(x)b (neq) φ, or

    4. for some c and d, where c (neq) a and d (neq) b, ab = cd, and cd is ambiguous.

      (For example, (x(y+λ))y is ambiguous because the preceding rule says that the (y+λ)y in x((y+λ)y) is ambiguous.

  6. a* is ambiguous if and only if aa is ambiguous.

5. Set Operations

Because each regular expression describes a set of strings, simple set algebra can be applied to them. + corresponds to set union, and φ to the empty set. Set intersection and set difference are sometimes useful operations. The regular expression corresponding to a set intersection or difference can be found using the derivatives of the intersection or difference. For example, to find the regular expression for any string of a and b symbols that does not consist entirely of one or more a symbols:

  1. R = (a+b)* - aa*

  2. D(a)R = (D(a)((a+b)*) - (D(a)(aa*)) = (a+b)* - a* = D(1)

  3. D(b)R = (D(b)((a+b)*) - (D(b)(aa*)) = (a+b)* - φ = (a+b)*

  4. (delta)R = (delta)((a+b)*) - (delta)(aa*) = λ - φ = λ

  5. D(a)D(1) = (D(a)((a+b)*) - (D(a)a*) = (a+b)* - a* = D(1)

  6. D(b)D(1) = (D(b)((a+b)*) - (D(b)a*) = (a+b)* - φ = (a+b)*

  7. (delta)D(1) = (delta)((a+b)*) - (delta)a* = λ - λ = φ

  8. D(1) = aD(a)D(1)+bD(b)D(1)+(delta)D(1) = aD(1)+b(a+b)*+φ = aD(1)+b(a+b)* = a*b(a+b)*

  9. R = aD(a)R+bD(b)R+(delta)R = aD(1)+b(a+b)*+λ = aa*b(a+b)*+b(a+b)*+λ = (aa*b+b)(a+b)*+λ = (aa*+λ)b(a+b)*+λ = a*b(a+b)*+λ.

The result is (a+b)* - aa* = a*b(a+b)*+λ.

Note that the derivative of an expression using set difference is simply the derivative of the two sets. The rules that can be applied for set operations are:

  1. D(a)(R(1) - R(2)) = (D(a)R(1)) - (D(a)R(2))

  2. D(a)(R(1) ∪ R(2)) = (D(a)R(1)) ∪ (D(a)R(2))

  3. D(a)(R(1) ∩ R(2)) = (D(a)R(1)) ∩ (D(a)R(2))

  4. (delta)(R(1) - R(2)) = ((delta)R(1)) - ((delta)R(2))

  5. (delta)(R(1) ∪ R(2)) = ((delta)R(1)) ∪ ((delta)R(2))

  6. (delta)(R(1) ∩ R(2)) = ((delta)R(1)) ∩ ((delta)R(2))

6. Finite Automata

Finite Automata can be used as tools for the manipulation of content models. A Finite State Machine provides a graphical representation of a regular expression that, for some purposes, is easier to work with than the equivalent content model.

6.1. Derivatives and Finite Automata

An interesting point about the proof of T11 and of the proof that (a+b)*aa(a+b)* = b*a(bb*a)*a(a+b)* is that the equivalence of the derivatives of two expressions is used as part of each proof. Two expressions are always equivalent if their derivatives (and the value of (delta) applied to each expression) are all equivalent.

If all the derivatives of a regular expression are taken, and then all the derivatives of those derivatives, and so on, only a finite number of distinct (i.e. non-equivalent) derivatives will be found. The original expression can be considered a derivative for this purpose (one can say R = DλR). Note that in T11, there is only one distinct derivative: D(x)((a*b*)*) = (a*b*)* for any x. This fact often leads to a solution when it is unclear how to reduce a regular expression.

As described earlier a Finite State Machine is a machine with a finite number of states, with a set of rules of the form "if, when in state X, the next symbol is Y, then go to state Z" or "if, when in state X, there are no more symbols, the string is recognized". A state with a rule of the latter form is called a "final state". One of the states is the "starting state". Any combination of state and symbol that does not have a rule indicates that a string is not in the language recognized by the machine. A Deterministic Finite Automaton, or DFA, is a Finite State Machine that for each combination of state X and symbol Y has only one rule of the first form. A Finite State Machine that allows a symbol to put it into two or more states is called a Nondeterministic Finite Automaton, or NFA.

A DFA can be constructed from a regular expression (and therefore from a content model) as follows:

  1. Find all the distinct derivatives of the regular expression. For each distinct derivative, produce an equation of the form R = (Sigma)aD(a)R+(delta)R, where R is the derivative, and where the equation defines the derivative in terms of distinct derivatives of the original regular expression.

  2. Create a machine with a state for each of the derivatives.

  3. Create a rule for each state and for each symbol a for which, where R is the derivative corresponding to that state, D(a)R (neq) φ. The rule states that for that state and the symbol a, the next state is the one corresponding to D(a)R.

  4. The state corresponding to the original expression is the "starting state".

  5. Each state which corresponds to a derivative R and for which (delta)R = λ is a "final state".

6.2. Characteristic Derivatives

If all the derivatives of a regular expression (including the original expression) are distinct (i.e. no two are equivalent), then they are called the "characteristic derivatives" of the regular expression. A DFA constructed using a set of characteristic derivatives is "minimal": it contains the minimum possible number of states for the regular expression.

6.3. Vice-Versa - Translating DFAs into Regular Expressions

A DFA is translated into a regular expressions simply by treating each state of the DFA as a derivative or the regular expression, and by setting (delta)S = λ for any state S that is a final state.

As an example, consider the following DFA:

  • The initial state is S1. In state S1, a takes one to state S2 and (delta)S(1) = φ.

  • In state S2, both a and b take one back to state S2, and (delta)S(2) = λ.

Using the states as derivatives results in the following derivatives:

  • S(1) = aS(2)

  • S(2) = aS(2)+bS(2)+λ

A little manipulation shows that S(2) = (a+b)*, and as a consequence S(1) = a(a+b)*. The regular expression corresponding to the DFA is therefore a(a+b)*.

6.4. Irreducible Regular Expressions

Some regular expressions cannot be converted to unambiguous content models. An example is a(ba)*(b+λ), which is a language consisting of alternating a and b symbols, starting with an a and ending with either an a or a b. It is the (ba)*(b+λ) part of the expression that causes trouble; the leading a is no trouble. The following are the derivatives of (ba)*(b+λ):

  1. R = (ba)*(b+λ)

  2. D(a)R = φ

  3. D(b)R = a(ba)*(b+λ)+λ = D(1)

  4. (delta)R = λ

  5. D(a)D(1) = (ba)*(b+λ) = R

  6. D(b)D(1) = φ

  7. (delta)D(1) = λ

  8. R = aD(a)R+bD(b)R+(delta)R = φ+bD(1)+λ = bD(1)+λ

  9. D(1) = aD(a)D(1)+bD(b)D(1)+(delta)D(1) = aR+φ+λ = aR+λ

  10. R = b(aR+λ)+λ = (ba)R+(b+λ) = (ba)*(b+λ)

Which is right back where we started from.

Irreducible regular expressions can be detected by examining their characteristic derivatives for "loops" that have more than one exit. In the example above, there is a "loop" of successive b and a symbols that can exit on either a b or an a.


Home Copyright Information Website Feedback Site Map Search