SGML: Simplification & disambiguation revisited

SGML: Simplification & disambiguation revisited

Article: 6816 of comp.text.sgml
Newsgroups: comp.text.sgml
Date: 29 Nov 1994 21:48:14 UT
From: Arjan Loeffen <>
Message-ID: <>
Subject: Simplification & disambiguation revisited

Dear reader,

On comp.text.sgml 1994-04-07, Hans Holger Rath tried to start off a
discussion on disambiguation and simplification of content models.  The
rules are valuable if you are designing an SGML based system (e.g., an
editor, or a database engine), and if you want to be sure that any unit
(data, or element) in content may be connected to a single, immutable token
in the content model.  This is relevant if you want the system to be able
to tell you what part of the content model allowed the unit to appear at
its position.  It is also relevant if you want the system to offer a
"prediction set", i.e., a pick-list of possible units to precede, follow or
start the current one.  If the content model is not simplified or
ambiguous, changes in the vicinity of a unit will corrupt the reference to
the content model; it is claimed that if the content model is simplified,
no such corruption will take place.

In this submission I've extended these rules.  As I'm bound to close this
section of my PhD thesis, I feel I should show them to you first, asking
for comments and/or corrections.  If you have better names for the
simplification rules, let me know.  I've heard of no other attempts to
augment the list of Rath; if you have such a reference, please let me know.

Thanks in advance,

(optimize model automatically, half of it by H. H. Rath)

  Nr:   Token:                           Reversable: Rewrites to:

bracketing rules
  1     (A)                              -           A
  2     (A)?                             -           A?
  3     (A)*                             -           A*
  4     (A)+                             -           A+
  5     (A?)                             -           A?
  6     (A*)                             -           A*
  7     (A+)                             -           A+

paired predicate rules
  1     (A*)?                            -           A*
  2     (A*)*                            -           A*
  3     (A*)+                            -           A*
  4     (A?)?                            -           A?
  5     (A?)*                            -           A*
  6     (A?)+                            -           A*
  7     (A+)?                            -           A*
  8     (A+)*                            -           A*
  9     (A+)+                            -           A+

group reduction rules
  1     (A, (B, C))                      y           (A, B, C)
  2     (A | (B | C))                    y           (A | B | C)
  3     (A | B | A)                      -           (A | B)

factorization rules
  1     (A? | B)                         y           (A | B)?
  2     (A* | B)                         y           (A+ | B)?
  3     (A? | B)?                        y           (A | B)?
  4     (A* | B)?                        y           (A+ | B)?
  5     (A? | B)+                        y           (A | B)*
  6     (A+ | B)+                        y           (A | B)+
  7     (A+ | B)*                        y           (A | B)*
  8     (A* | B)+                        y           (A | B)*
  9     (A? | B)*                        y           (A | B)*
  10    (A* | B)*                        y           (A | B)*

seq-reduction rules
  1     A? , A*                          y           A*
  2     A+ , A?                          -           A+
  3     A? , A+                          -           A+
  4     A* , A*                          -           A*
  5     A+ , A+                          -           A, A+
  6     A+ , A*                          -           A+

or-reduction rules
  1     A | A                            -           A
  2     A | A?                           y           A?
  3     A | A*                           y           A*
  4     A | A+                           y           A+
  5     A? | A?                          -           A?
  6     A? | A*                          y           A*
  7     A? | A+                          y           A*
  8     A* | A*                          -           A*
  9     A* | A+                          y           A*
  10    A+ | A+                          -           A+

and-reduction rules
  1     A & A                            -           (A , A)
  2     A? & A                           y           (A , A?)
  3     A* & A                           y           (A , A*)
  4     A+ & A                           y           (A , A+)
  5     A+ & A+                          -           (A , A+)

pcdata reduction rules (p = pcdata)
  1     p , p                            -           p
  2     p | p                            -           p
  3     (p & p)                          -           p
  4     p*                               -           p
  5     p+                               -           p
  6     p?                               -           p

("signal a problem", mostly from H. H. Rath)

"a" stands for a primitive content token or a unit that represents
"declared" content (DATA, EMPTY, ANY), while A and B stand for any content
token.  (A first) and (A last) stand for services "first" and "last" that
return the first/last content token in the definition of A (i.e., in the
model group A, see [Aho & Corasick '86, page 45 ff]).  The tokens between
square brackets are examples of ambiguous content.  Because the recognition
should take place without a lookahead the last character in the content
shown between square brackets cannot be linked to a unique location in the
content model.

1                  a is unambiguous

2                  A? is unambiguous iff A is unambiguous

3                  A+ and A* are unambiguous iff
3.1                A is unambiguous
3.2                for all a in ((A first) intersect (A last)) holds
                   that a is required in (A last) (i.e. a is
                   not opt-group, plus-group or rep-group)

                   Examples: (P, Q, P?)+ [PQP], and (P, Q,
                   P+)+ [PQPP]

4                  (A | B) is unambiguous iff
4.1                A and B are unambiguous
4.2                ((A first) intersect (B first)) == {}

                   Example: ((P, Q) | (P, R)) [P]

5 seq-group        (A, B) is unambiguous iff
5.1                A and B are unambiguous
5.2                for all a in ((A last) intersect (B first)) hold
                   that a is required in (A last)

                   Example: ((P, Q?), (Q, R)) [PQ]

5.3                if A is plus-group or rep-group then ((A
                   first) intersect (B first)) == {}

                   Example: (P+, P) [PP]

6                  (A & B) is unambiguous iff
6.1                A and B are unambiguous
6.2                ((A first) intersect (B first)) == {}

                   Example: ((P, Q) & (P, R)) [P]

6.3                for all a in ((A last) intersect (B first)) holds
                   that a is required in (A last)

                   Example: ((P, Q?) & (Q, R)) [PQ]

6.4                for all a in ((A first) intersect (B last)) holds
                   that a is required in (B last)

                   Example: ((P, Q) & (R, P?)) [RP]

7                  A?, A? is ambiguous

8                  A*, A+ is ambiguous

9                  An and-group is ambiguous when it
                   describes a sequence of two or more equal
                   content tokens.

                   Example: (A? & B & A+) [A]

Arjan Loeffen           Achter de Dom 22-24  ++31+30536417  voice work
Faculty of Arts         3512JP Utrecht       ++31+206656463 voice home
University of Utrecht   The Netherlands      ++31+30539221  fax work