CTS archive link: here
Below is an algorithm for unambiguity testing, together with a correctness proof. The test is based on rewriting of and groups, along the lines proposed by Erik Naggum. It is an extension of work described in my technical report "Regular Expressions into Finite Automata". I refer to my earlier posting, which contained a rigorous definition of unambiguity. Below is a summary with small changes in terminology, with thanks to James Clark, who suggested some of them. Terminology/Definition of unambiguity ===================================== We start with an alphabet S of symbols (corresponding to primitive content tokens) and an alphabet of subscripted symbols P. An expression is built from unary operators ?*+# and binary operators ,|& (I'd dropped & in the earlier message by mistake). # is a new unary operator. The language of E# is per definitionem the language of E without the empty word. We will rewrite & groups using ,|# while preserving (un)ambiguity, and we will then test expressions built from ?*+#,| for (un)ambiguity as described in the technical report cited above. Definition: A marked expression is an expression in which each symbol occurs at most once. Definition: Let E be an expression over S. A marked expression E' over P is called a marking of E if and only if dropping all subscripts from E' results in E. Thus, one expression can have several markings. For the expression ((a&b),a*,a,b), both ((a_1&b_2),a_3*,a_4,b_5) and ((a_3&b_3),a_5*,a_9,b_2) are markings, whereas ((a_1&b_2),a_1*,a_2,b_1) is not a marking. (Note that for E' over P to be marked means that each subscripted symbol occurs at most once.) Definition: Let E' be a marking of E. Then, E is unambiguous if and only if for all words u,v,v' over P and all symbols x,y in P holds: (*) If uxv, uyv' are in L(E'), and if x and y differ only in their subscripts, then x=y. As a warm-up, we prove the following little lemma on #: Lemma: E is unambiguous if and only if E# is unambiguous. Proof: Let E' be a marking of E. Then, E'# is a marking of E#. Since uxv, uyv' are not empty, we have uxv, uyv' in L(E') if and only if uxv, uyv' in L(E'#). Thus, condition (*) in the definition of unambiguity is the same for E' and E'#. Rewriting & groups ==================== Now we eliminate & from expressions, at the expense of introducing the new operator #. This takes care of the nullability problem of & groups mentioned in earlier posts by James Clark and Erik Naggum. Definition: We construct E^ from E by substituting each subexpression (F&G) of E by ((F#,G)|(G#,F))? if e is in L(F) and in L(G), ((F#,G)|(G#,F)) otherwise. This rewriting preserves language. Lemma: L(E^)=L(E), i.e. we may use E^ instead of E for testing whether a document instance conformes to E. Proof: It suffices to show that | L(((F#,G)|(G#,F))?) if e is in L(F) and in L(G), L(F&G) = < | L(((F#,G)|(G#,F))) otherwise, which is obvious. So far, we have eliminated & at the cost of introducing #. Definition: We call an expression simple if it does not contain any & groups. We proceed now along the following scheme: 1. We give an algorithms that tests simple expressions for unambiguity and proof its correctness. 2. We investigate how markings of E and of E^ are related. 3. We use 2. to prove that E is unambiguous if and only if E^ is unambiguous. This gives a provably correct algorithm that tests general expressions for unambiguity: Given E, transform E into E^ and, instead of E, test E^ for unambiguity using 1. above. An unambiguity test for simple expressions ========================================== (People who already know the technical report cited above may savely skip this section.) Why is it easier to test simple expressions for unambiguity? The reason is that for markings of simple expressions, the continuations of an initial string ux only depend on x, but not on u. This condition is captured in the following definition: Definition: An expression E is called modular if and only if uxv, u'xv' in L(E) imply u'xv in L(E). Consider E=((a&b),a) and E'=((a_1&b_2),a_3). We have a_1b_2a_3, b_2a_1a_3 in L(E'), but b_2a_3 not in L(E'). Thus, with u=a_1, x=b_2, v=a_3, u'=empty, and v'=a_1a_3, we can see that E' is not modular. However, we will prove that each simple marked expression is modular and that, in particular, markings of simple expressions are modular. Now, the point is that for a modular marked expression H, unambiguity can be tested by looking at pairs of symbols that can follow each other in a word of L(H), and these pairs can be computed quite easily. This gives us the outline for the remainder of this section. Lemma: Each simple marked expression is modular. Proof: By straightforward induction on simple marked expressions, H. I'll show the case that H=(I,J). Let uxv, u'xv be in L((I,J)). Since H is marked, x can either occur in I or in J, but not in both of them, and I and J are marked, too. The interesting case is that x occurs in I, but not in J. Then, there are v_1, v_2, w_1, w_2 with 1. v=v_1v_2, w=w_1w_2. 2. v_2, w_2 are in L(J). 3. uxv_1, u'xw_2 are in L(I). By induction hypothesis, 3. implies u'xv_1 in L(I). Applying 2. results in u'xv_1v_2 in L((I,J)), and finally u'xv in L(I,J) because of 1. Definition: For an expression H, we define: first(H) := { x | x a symbol, there is a word w with xw in L(H). } last(H) := { x | x a symbol, there is a word w with wx in L(H). } follow(H,x) := { y | y a symbol, there are words v,w with vxyw in L(H). } Lemma: Let H be a modular expression, let x_1,...,x_n be symbols, n>=1, with x_1 in first(H), x_(i+1) in follow(H,x_i) for 1<=i<n, x_n is in last(H). Then, x_1...x_n in L(H). Proof: We will show by induction on i that for each i with 1<=i<=n there is a word w with x_1...x_iw in L(H). Applying this to i=n, we get a word w with x_1...x_nw in L(H). On the other hand, there is a word v with vx_n in L(H), because x_n in last(H). Since H is modular, we can conclude that x_1...x_n in L(H). As to the induction: i=1: The claim holds, because x_1 in first(H), n>=1. i-->i+1: Let x_1...x_iw be in L(H), i<n. Because x_(i+1) is in follow(H,x_i), there are u,v with ux_ix_(i+1)v in L(H). The modularity of H implies that x_1...x_ix_(i+1)v is in L(H). The next lemma shows how unambiguity of E can be tested by looking at pairs of symbols that can follow each other in a marking of E. Lemma "AmbTest": Let E be a simple expression, and let E' be a marking of E. Then, E is unambiguous if and only if the following holds: If x,y are in first(E') or if x,y are in follow(E',z), and if the symbols of S corresponding to x and y are identical, then x=y. Proof: "==>" Let E be unambiguous. If x,y are in first(E'), there are v,v' over P with xv,yv' in L(E'). thus, uxv, uyv' in L(E') with u=empty. If, on the other hand, x,y are in follow(E',z), there are u,v,u',v' with uzxw, u'zyv' in L(E'). Because E is simple, E' is simple and marked and, thus, modular. Therefore, uzyv' is in L(E'), too. In both cases, the claim now follows from the unambiguity of E. "<==" If uxv, uyv' are in L(E'), then either u=empty and x,y in first(E') or u=u'z and x,y in follow(E',z). Thus, this direction holds even for non-simple expressions E. The next lemma shows how the first/last/follow sets can be computed recursively for simple marked expressions H. Lemma "Induction": For a simple marked expression H holds: H=a: first(H) = last(H) = {a}, follow(H,a) = empty set. H=(I,J): first(H) = first(I) u first(J), [u stands for union here] last(H) = last(I) u last(J), | follow(I,x) if x occurs in I, follow(H,x) = < | follow(J,x) if x occurs in J. H=(I,J): | first(I) u first(J) if e in L(I) [e is the empty word}, first(H) = < | first(I) otherwise, | last(I) u last(J) if e in L(J) [e is the empty word}, last(H) = < | last(J) otherwise, | follow(I,x) if x occurs in I, but not in last(I), follow(H,x) = < follow(I,x) u first(J) if x in last(I), | follow(J,x) if x occurs in J. H=I#: first(H) = first(I), last(H) = last(I), follow(H,x) = follow(I,x). H=I*: first(H) = first(I), last(H) = last(I), | follow(I,x) if x not in last(I), follow(H,x) = < | follow(I,x) u first(I) if x in last(I). H=I+: first(H) = first(I), last(H) = last(I), | follow(I,x) if x not in last(I), follow(H,x) = < | follow(I,x) u first(I) if x in last(I). Proof: Straightforward, but long. To summarize, we have proved the following theorem: Theorem "AlgoSimple": We can test a simple expression E for unambiguity in the following way: 1. Construct a marking E' for E. 2. Compute first(E'), last(E'), and follow(E',x) for any subscripted symbol x in E' using the equations in Lemma "Induction". This can be done because E' is simple and marked. 3. Test E for unambiguity by testing the (finite) conditions of Lemma "AmbTest". An unambiguity test for general expressions =========================================== Now we go back to the rewriting of & groups, leading from E to E^. We want to show that this rewriting preserves (un)ambiguity. In E^, subgroups of E may occur several times in E^, in their rewritten form. This means, that symbols occuring in E are duplicated in E^. However, words in L(E^) whose symbols satisfy duplicates of symbols in E correspond one-to-one to words in L(E) whose symbols satisfy the original symbols in E. To formalize this in the next lemma, we have to use markings again. Definition: Let E' and F' be markings of E and F. A mapping from E' to F' is a function h that maps symbols in E' to symbols in F' such that x and h(x) correspond to the same symbol in S. We extend a mapping h to words over P by defining h(empty)=empty, h(x_1...x_n)=h(x_1)...h(x_n). Lemma "Mapping": For an expression E, let E^' be a marking for E^, and let E' be a marking for E. Then, there is a mapping h from E^' to E' with the following properties: 1. If w is in L(E^'), then h(w) is in L(E'). 2. For each w in L(E'), there is a v in L(E^') with h(v)=w. 3. If uv, u'v' are in L(E^') with h(u)=h(u'), then u=u'. The following theorem follows directly from Lemma "Mapping". Theorem "Reduction": Let E be an expression. Then, E is unambiguous if and only if E^ is unambigous. Now, our two theorems combine to an unambiguity test for general expressions: Theorem: We can test an arbitrary expression E for unambiguity in the following way: Construct E^ from E. Test E^ for unambiguity by using the algorithm in Theorem "AlgoSimple". {Note that E^ is simple.} Report E as unambiguous if and only if E^ is unambigous. {Theorem "Reduction".} So, our only task left is the proof of Lemma "Mapping". Proof: By induction on E. First, I give the definition of h. E=a: In this case, E^=a, E'=a_i, E^'=a_j. Define h(a_j):=a_i. E=(F|G) or Let / be | resp. ,. Then, E^=(F^/G^). E=(F,G): A marking E' of E implies markings F' of F and G' of G. A marking E^' of E^ implies markings F^' of F^ and G^' of G^. The induction hypothesis gives mappings f from F^' to F' and g from G^' to G'. Define h(x) as f(x) if x occurs in F^', g(x) if x occurs in G^' (because E^' is marked, the subscripted symbols in F^' and G^' are disjoint). E=(F&G): In this case, E^=((F^#,G^)|(G^#,F^))[?]. A marking E^' of E^ implies two markings F^' and F^" for F^ and two markings G^' and G^" for G^, for the two occurrences of F^ resp. G^ in E^. Furthermore, a marking of E' implies markings F' of F and G' of G. The induction hypothesis gives mappings f' from F^' to F', f" from F^" to F', g' from G^' to G', and g" from G^" to G'. Define h(x) as f'(x) if x occurs in the first copy of F^' in E^', f"(x) if x occurs in the second copy of F^' in E^', g'(x) if x occurs in the first copy of G^' in E^', g"(x) if x occurs in the second copy of G^' in E^'. E=F?, Let / be ?, #, *, resp. +. We have E^=F^/. E=F#, A marking E' of E implies a marking F' of F, E=F*, or and a marking E^' of E^ implies a marking F^' of F^. E=F+: The induction hypothesis gives a mapping f from F^' to F'. Define h(x)=f(x). Next, I show that conditions 1.--3. hold for the case E=(F&G). We use the terminology introduced for the definition of h in this case. 1. Let w be in L(E^'), w.l.o.g. let w=uv, u in L(F^'#), v in L(G^'). By induction hypothesis, h(u) is in L(F'), h(v) is in L(G'), i.e. h(w)=h(u)h(v) is in L(F'&G')=L(E'). 2. Let w be in L(E'), w.l.o.g. let w=uv, u in L(F'), v in L(G'), u<>empty. By induction hypothesis, there are u' in L(F^'), v' in L(G^') with h(u')=u and h(v')=v. Then, u'v' is in L(E^'), h(u'v')=h(u')h(v')=uv=w. 3. Let uv, u'v' be in L(E^') with h(u)=h(u'). Because of E^'=(F^'#,G^')|(G^"#,F^")[?], either both u and u' start with a symbol in F^' or both start with a symbol in G^" or both are empty. Let's assume w.l.o.g. that they both start with a symbol in F^'. Furthermore, let's only consider the case that all symbols in u belong to F^'. Then, all symbols of u' belong to F^' as well, because h(u)=h(u') and by definition of h. thus, v=rs, v'=r's' such that ur, u'r' are in L(F^') and s, s' are in L(G^'), f(u)=h(u)=h(u')=f(u'). By induction hypothesis, this implies u=u'. Discussion ========== We have given a method to transform expressions with & groups into simple expressions. The transformation preserves language and (un)ambiguity. Thus, the unambiguity test for simple expressions can be extended to general expressions. The ambiguity test for simple expressions obviously takes time polynomial in the size of the expressions. In the technical report cited above, I have given a linear-time algorithm for expressions using ,|* that can easily be generalized to arbitrary simple expressions. However, transforming expressions with & groups into simple expressions may lead to an exponential blow-up in size. This is an inherent property of the transformation approach, since the smallest simple and unambiguous expression equivalent to (a_1&...&a_n) must be of size n^n (n to the nth). I have made some progress on a correctness proof for James Clark's method implemented in sgmls, which only preserves (un)ambiguity but not languages and which takes time polynomial in the size of the expression. I hope to have time for this again over the weekend... Is someone still out there to check this all? Cheers, Anne -- Anne Brueggemann-Klein,203-3897 <abk@sun1.ruf.uni-freiburg.de>
Archive last updated dd. 02/04/96
Suggestions to the compiler.
Prepared by Robin Cover for The XML Cover Pages archive.