Subject: An ambiguity test with correctness proof (long)


Keywords: Ambiguity, Content models Organization: Rechenzentrum der Universitaet Freiburg, Deutschland Newsgroups: comp.text.sgml

CTS archive link: here




Below is an algorithm for unambiguity testing,
together with a correctness proof. The test is based
on rewriting of and groups, along the lines proposed by Erik Naggum.
It is an extension of work described in my technical report
"Regular Expressions into Finite Automata".

I refer to my earlier posting, which contained a rigorous definition
of unambiguity. Below is a summary with small changes in terminology,
with thanks to James Clark, who suggested some of them.


Terminology/Definition of unambiguity
=====================================

We start with an alphabet S of symbols (corresponding to primitive
content tokens) and an alphabet of subscripted symbols P.
An expression is built from unary operators ?*+# and binary
operators ,|& (I'd dropped & in the earlier message by mistake).

# is a new unary operator. The language of E# is per definitionem
the language of E without the empty word. We will rewrite & groups
using ,|# while preserving (un)ambiguity, and we will then
test expressions built from ?*+#,| for (un)ambiguity as described
in the technical report cited above.

Definition:
  A marked expression is an expression in which each symbol
  occurs at most once.

Definition:
  Let E be an expression over S. A marked expression E' over P is called
  a marking of E if and only if dropping all subscripts from E' results in E.

Thus, one expression can have several markings.
For the expression ((a&b),a*,a,b), both ((a_1&b_2),a_3*,a_4,b_5)
and ((a_3&b_3),a_5*,a_9,b_2) are markings, whereas ((a_1&b_2),a_1*,a_2,b_1)
is not a marking.
(Note that for E' over P to be marked means that each subscripted symbol
occurs at most once.)

Definition:
  Let E' be a marking of E. Then, E is unambiguous if and only if
  for all words u,v,v' over P and all symbols x,y in P holds:
    (*) If uxv, uyv' are in L(E'), and if x and y differ only in their
        subscripts, then x=y.

As a warm-up, we prove the following little lemma on #:

Lemma:
  E is unambiguous if and only if E# is unambiguous.

Proof:
  Let E' be a marking of E. Then, E'# is a marking of E#.
  Since uxv, uyv' are not empty, we have
  uxv, uyv' in L(E') if and only if uxv, uyv' in L(E'#).
  Thus, condition (*) in the definition of unambiguity
  is the same for E' and E'#.


Rewriting & groups
====================

Now we eliminate & from expressions, at the expense of introducing
the new operator #. This takes care of the nullability problem
of & groups mentioned in earlier posts by James Clark and Erik Naggum.

Definition:
  We construct E^ from E by substituting each subexpression
  (F&G) of E by

    ((F#,G)|(G#,F))?       if e is in L(F) and in L(G),
    ((F#,G)|(G#,F))        otherwise.

This rewriting preserves language.

Lemma:
  L(E^)=L(E),
  i.e. we may use E^ instead of E for testing whether
  a document instance conformes to E.

Proof:
  It suffices to show that

               | L(((F#,G)|(G#,F))?)       if e is in L(F) and in L(G),
    L(F&G) =  <
               | L(((F#,G)|(G#,F)))        otherwise,

  which is obvious.

So far, we have eliminated & at the cost of introducing #.

Definition:
  We call an expression simple if it does not contain any & groups.

We proceed now along the following scheme:

  1. We give an algorithms that tests simple expressions
     for unambiguity and proof its correctness.
  2. We investigate how markings of E and of E^ are related.
  3. We use 2. to prove that E is unambiguous
     if and only if E^ is unambiguous.

This gives a provably correct algorithm that tests general expressions
for unambiguity: Given E, transform E into E^ and, instead of E,
test E^ for unambiguity using 1. above.


An unambiguity test for simple expressions
==========================================

(People who already know the technical report cited above
may savely skip this section.)

Why is it easier to test simple expressions for unambiguity?
The reason is that for markings of simple expressions,
the continuations of an initial string ux only depend on x,
but not on u. This condition is captured in the following definition:

Definition:
  An expression E is called modular if and only if
  uxv, u'xv' in L(E) imply u'xv in L(E).

Consider

  E=((a&b),a) and E'=((a_1&b_2),a_3).

We have a_1b_2a_3, b_2a_1a_3 in L(E'), but b_2a_3 not in L(E').
Thus, with u=a_1, x=b_2, v=a_3, u'=empty, and v'=a_1a_3,
we can see that E' is not modular.

However, we will prove that each simple marked expression is modular
and that, in particular, markings of simple expressions are modular.
Now, the point is that for a modular marked expression H,
unambiguity can be tested by looking at pairs of symbols
that can follow each other in a word of L(H), and these pairs
can be computed quite easily.
This gives us the outline for the remainder of this section.

Lemma:
  Each simple marked expression is modular.

Proof:
  By straightforward induction on simple marked expressions, H.
  I'll show the case that H=(I,J).

  Let uxv, u'xv be in L((I,J)).
  Since H is marked, x can either occur in I or in J, but not
  in both of them, and I and J are marked, too.
  The interesting case is that x occurs in I, but not in J.

  Then, there are v_1, v_2, w_1, w_2 with

    1. v=v_1v_2, w=w_1w_2.
    2. v_2, w_2 are in L(J).
    3. uxv_1, u'xw_2 are in L(I).

  By induction hypothesis, 3. implies u'xv_1 in L(I).
  Applying 2. results in u'xv_1v_2 in L((I,J)),
  and finally u'xv in L(I,J) because of 1.

Definition:
  For an expression H, we define:

       first(H) := { x | x a symbol, there is a word w with xw in L(H). }
        last(H) := { x | x a symbol, there is a word w with wx in L(H). }
    follow(H,x) := { y | y a symbol, there are words v,w with vxyw in L(H). }

Lemma:
  Let H be a modular expression, let x_1,...,x_n be symbols, n>=1, with
  x_1 in first(H), x_(i+1) in follow(H,x_i) for 1<=i<n, x_n is in last(H).
  Then, x_1...x_n in L(H).

Proof:
  We will show by induction on i
  that for each i with 1<=i<=n there is a word w with x_1...x_iw in L(H).
  Applying this to i=n, we get a word w with x_1...x_nw in L(H).
  On the other hand, there is a word v with vx_n in L(H), because x_n in last(H).
  Since H is modular, we can conclude that x_1...x_n in L(H).

  As to the induction:

  i=1:     The claim holds, because x_1 in first(H), n>=1.
  i-->i+1: Let x_1...x_iw be in L(H), i<n.
           Because x_(i+1) is in follow(H,x_i), there are
           u,v with ux_ix_(i+1)v in L(H).
           The modularity of H implies that x_1...x_ix_(i+1)v is in L(H).

The next lemma shows how unambiguity of E can be tested
by looking at pairs of symbols that can follow each other
in a marking of E.

Lemma "AmbTest":
  Let E be a simple expression, and let E' be a marking of E.
  Then, E is unambiguous if and only if the following holds:
  If x,y are in first(E') or if x,y are in follow(E',z), and if the
  symbols of S corresponding to x and y are identical, then x=y.

Proof:
  "==>"
    Let E be unambiguous.
    If x,y are in first(E'), there are v,v' over P with xv,yv' in L(E').
    thus, uxv, uyv' in L(E') with u=empty.

    If, on the other hand, x,y are in follow(E',z), there are u,v,u',v' with
    uzxw, u'zyv' in L(E').
    Because E is simple, E' is simple and marked and, thus, modular.
    Therefore, uzyv' is in L(E'), too.

    In both cases, the claim now follows from the unambiguity of E.

  "<=="
    If uxv, uyv' are in L(E'), then either u=empty and x,y in first(E')
    or u=u'z and x,y in follow(E',z).

    Thus, this direction holds even for non-simple expressions E.

The next lemma shows how the first/last/follow sets can be computed
recursively for simple marked expressions H.

Lemma "Induction":
  For a simple marked expression H holds:

  H=a:     first(H) = last(H) = {a},
           follow(H,a) = empty set.

  H=(I,J): first(H) = first(I) u first(J),        [u stands for union here]
           last(H) = last(I) u last(J),
                         | follow(I,x)            if x occurs in I,
           follow(H,x) = <
                         | follow(J,x)            if x occurs in J.

  H=(I,J):            | first(I) u first(J)       if e in L(I) [e is the empty word},
           first(H) = <
                      | first(I)                  otherwise,
                     | last(I) u last(J)          if e in L(J) [e is the empty word},
           last(H) = <
                     | last(J)                    otherwise,
                         | follow(I,x)            if x occurs in I, but not in last(I),
           follow(H,x) = < follow(I,x) u first(J) if x in last(I),
                         | follow(J,x)            if x occurs in J.

  H=I#:    first(H) = first(I),
           last(H) = last(I),
           follow(H,x) = follow(I,x).

  H=I*:    first(H) = first(I),
           last(H) = last(I),
                         | follow(I,x)            if x not in last(I),
           follow(H,x) = <
                         | follow(I,x) u first(I) if x in last(I).

  H=I+:    first(H) = first(I),
           last(H) = last(I),
                         | follow(I,x)            if x not in last(I),
           follow(H,x) = <
                         | follow(I,x) u first(I) if x in last(I).

Proof:
  Straightforward, but long.

To summarize, we have proved the following theorem:

Theorem "AlgoSimple":
  We can test a simple expression E for unambiguity in the following way:
    1. Construct a marking E' for E.
    2. Compute first(E'), last(E'), and follow(E',x) for any subscripted
       symbol x in E' using the equations in Lemma "Induction".
       This can be done because E' is simple and marked.
    3. Test E for unambiguity by testing the (finite) conditions
       of Lemma "AmbTest".


An unambiguity test for general expressions
===========================================

Now we go back to the rewriting of & groups, leading from E to E^.
We want to show that this rewriting preserves (un)ambiguity.
In E^, subgroups of E may occur several times in E^, in their rewritten
form. This means, that symbols occuring in E are duplicated in E^.
However, words in L(E^) whose symbols satisfy duplicates of symbols in E
correspond one-to-one to words in L(E) whose symbols satisfy the original
symbols in E. To formalize this in the next lemma, we have to use markings again.

Definition:
  Let E' and F' be markings of E and F.
  A mapping from E' to F' is a function h that maps symbols in E'
  to symbols in F' such that x and h(x) correspond to the same symbol in S.

  We extend a mapping h to words over P by defining
    h(empty)=empty,
    h(x_1...x_n)=h(x_1)...h(x_n).

Lemma "Mapping":
  For an expression E, let E^' be a marking for E^,
  and let E' be a marking for E. Then, there is a mapping h
  from E^' to E' with the following properties:

    1. If w is in L(E^'), then h(w) is in L(E').
    2. For each w in L(E'), there is a v in L(E^') with h(v)=w.
    3. If uv, u'v' are in L(E^') with h(u)=h(u'), then u=u'.

The following theorem follows directly from Lemma "Mapping".

Theorem "Reduction":
  Let E be an expression. Then, E is unambiguous if and only if
  E^ is unambigous.

Now, our two theorems combine to an unambiguity test for general expressions:

Theorem:
  We can test an arbitrary expression E for unambiguity in the following way:
  Construct E^ from E.
  Test E^ for unambiguity by using the algorithm in Theorem "AlgoSimple".
    {Note that E^ is simple.}
  Report E as unambiguous if and only if E^ is unambigous.
    {Theorem "Reduction".}

So, our only task left is the proof of Lemma "Mapping".

Proof:
  By induction on E. First, I give the definition of h.

  E=a:        In this case, E^=a, E'=a_i, E^'=a_j.
              Define h(a_j):=a_i.

  E=(F|G) or  Let / be | resp. ,. Then, E^=(F^/G^).
  E=(F,G):    A marking E' of E implies markings F' of F and G' of G.
              A marking E^' of E^ implies markings F^' of F^ and G^' of G^.
              The induction hypothesis gives mappings f from F^' to F'
              and g from G^' to G'. Define h(x) as
                f(x)      if x occurs in F^',
                g(x)      if x occurs in G^'
              (because E^' is marked, the subscripted symbols in F^' and G^'
              are disjoint).
  E=(F&G):    In this case, E^=((F^#,G^)|(G^#,F^))[?].
              A marking E^' of E^ implies two markings F^' and F^"
              for F^ and two markings G^' and G^" for G^, for the two
              occurrences of F^ resp. G^ in E^.
              Furthermore, a marking of E' implies markings F' of F and G' of G.
              The induction hypothesis gives mappings f' from F^' to F',
              f" from F^" to F', g' from G^' to G', and g" from G^" to G'.
              Define h(x) as
                f'(x)        if x occurs in the first copy of F^' in E^',
                f"(x)        if x occurs in the second copy of F^' in E^',
                g'(x)        if x occurs in the first copy of G^' in E^',
                g"(x)        if x occurs in the second copy of G^' in E^'.
  E=F?,       Let / be ?, #, *, resp. +. We have E^=F^/.
  E=F#,       A marking E' of E implies a marking F' of F,
  E=F*, or    and a marking E^' of E^ implies a marking F^' of F^.
  E=F+:       The induction hypothesis gives a mapping f from F^' to F'.
              Define h(x)=f(x).

  Next, I show that conditions 1.--3. hold for the case E=(F&G).
  We use the terminology introduced for the definition of h in this case.

    1. Let w be in L(E^'), w.l.o.g. let w=uv, u in L(F^'#), v in L(G^').
       By induction hypothesis, h(u) is in L(F'), h(v) is in L(G'),
       i.e. h(w)=h(u)h(v) is in L(F'&G')=L(E').

    2. Let w be in L(E'), w.l.o.g. let w=uv, u in L(F'), v in L(G'), u<>empty.
       By induction hypothesis, there are u' in L(F^'), v' in L(G^')
       with h(u')=u and h(v')=v.
       Then, u'v' is in L(E^'), h(u'v')=h(u')h(v')=uv=w.

    3. Let uv, u'v' be in L(E^') with h(u)=h(u').
       Because of E^'=(F^'#,G^')|(G^"#,F^")[?], either both u and u'
       start with a symbol in F^' or both start with a symbol in G^"
       or both are empty.
       Let's assume w.l.o.g. that they both start with a symbol in F^'.
       Furthermore, let's only consider the case that all symbols in u
       belong to F^'. Then, all symbols of u' belong to F^' as well,
       because h(u)=h(u') and by definition of h.
       thus, v=rs, v'=r's' such that ur, u'r' are in L(F^') and
       s, s' are in L(G^'), f(u)=h(u)=h(u')=f(u').
       By induction hypothesis, this implies u=u'.


Discussion
==========

We have given a method to transform expressions with & groups
into simple expressions. The transformation preserves language and
(un)ambiguity. Thus, the unambiguity test for simple expressions
can be extended to general expressions.

The ambiguity test for simple expressions obviously takes
time polynomial in the size of the expressions.
In the technical report cited above, I have given a linear-time
algorithm for expressions using ,|* that can easily be generalized to
arbitrary simple expressions.

However, transforming expressions with & groups into simple expressions
may lead to an exponential blow-up in size. This is an inherent property
of the transformation approach, since the smallest simple and unambiguous
expression equivalent to (a_1&...&a_n) must be of size n^n
(n to the nth).

I have made some progress on a correctness proof for James Clark's
method implemented in sgmls, which only preserves (un)ambiguity
but not languages and which takes time polynomial in the size
of the expression. I hope to have time for this again over the weekend...

Is someone still out there to check this all?

Cheers,
Anne
--
Anne Brueggemann-Klein,203-3897 <abk@sun1.ruf.uni-freiburg.de>



Archive last updated dd. 02/04/96

Suggestions to the compiler.

Prepared by Robin Cover for The XML Cover Pages archive.