Article: 6816 of comp.text.sgml Newsgroups: comp.text.sgml Date: 29 Nov 1994 21:48:14 UT From: Arjan Loeffen <Arjan.Loeffen@let.ruu.nl> Message-ID: <56898.loeffen@ruulet.let.ruu.nl> Subject: Simplification & disambiguation revisited Dear reader, On comp.text.sgml 1994-04-07, Hans Holger Rath tried to start off a discussion on disambiguation and simplification of content models. The rules are valuable if you are designing an SGML based system (e.g., an editor, or a database engine), and if you want to be sure that any unit (data, or element) in content may be connected to a single, immutable token in the content model. This is relevant if you want the system to be able to tell you what part of the content model allowed the unit to appear at its position. It is also relevant if you want the system to offer a "prediction set", i.e., a pick-list of possible units to precede, follow or start the current one. If the content model is not simplified or ambiguous, changes in the vicinity of a unit will corrupt the reference to the content model; it is claimed that if the content model is simplified, no such corruption will take place. In this submission I've extended these rules. As I'm bound to close this section of my PhD thesis, I feel I should show them to you first, asking for comments and/or corrections. If you have better names for the simplification rules, let me know. I've heard of no other attempts to augment the list of Rath; if you have such a reference, please let me know. Thanks in advance, Arjan. SIMPLIFICATION (optimize model automatically, half of it by H. H. Rath) Nr: Token: Reversable: Rewrites to: bracketing rules 1 (A) - A 2 (A)? - A? 3 (A)* - A* 4 (A)+ - A+ 5 (A?) - A? 6 (A*) - A* 7 (A+) - A+ paired predicate rules 1 (A*)? - A* 2 (A*)* - A* 3 (A*)+ - A* 4 (A?)? - A? 5 (A?)* - A* 6 (A?)+ - A* 7 (A+)? - A* 8 (A+)* - A* 9 (A+)+ - A+ group reduction rules 1 (A, (B, C)) y (A, B, C) 2 (A | (B | C)) y (A | B | C) 3 (A | B | A) - (A | B) factorization rules 1 (A? | B) y (A | B)? 2 (A* | B) y (A+ | B)? 3 (A? | B)? y (A | B)? 4 (A* | B)? y (A+ | B)? 5 (A? | B)+ y (A | B)* 6 (A+ | B)+ y (A | B)+ 7 (A+ | B)* y (A | B)* 8 (A* | B)+ y (A | B)* 9 (A? | B)* y (A | B)* 10 (A* | B)* y (A | B)* seq-reduction rules 1 A? , A* y A* 2 A+ , A? - A+ 3 A? , A+ - A+ 4 A* , A* - A* 5 A+ , A+ - A, A+ 6 A+ , A* - A+ or-reduction rules 1 A | A - A 2 A | A? y A? 3 A | A* y A* 4 A | A+ y A+ 5 A? | A? - A? 6 A? | A* y A* 7 A? | A+ y A* 8 A* | A* - A* 9 A* | A+ y A* 10 A+ | A+ - A+ and-reduction rules 1 A & A - (A , A) 2 A? & A y (A , A?) 3 A* & A y (A , A*) 4 A+ & A y (A , A+) 5 A+ & A+ - (A , A+) pcdata reduction rules (p = pcdata) 1 p , p - p 2 p | p - p 3 (p & p) - p 4 p* - p 5 p+ - p 6 p? - p DISAMBIGUATION ("signal a problem", mostly from H. H. Rath) "a" stands for a primitive content token or a unit that represents "declared" content (DATA, EMPTY, ANY), while A and B stand for any content token. (A first) and (A last) stand for services "first" and "last" that return the first/last content token in the definition of A (i.e., in the model group A, see [Aho & Corasick '86, page 45 ff]). The tokens between square brackets are examples of ambiguous content. Because the recognition should take place without a lookahead the last character in the content shown between square brackets cannot be linked to a unique location in the content model. 1 a is unambiguous 2 A? is unambiguous iff A is unambiguous 3 A+ and A* are unambiguous iff 3.1 A is unambiguous 3.2 for all a in ((A first) intersect (A last)) holds that a is required in (A last) (i.e. a is not opt-group, plus-group or rep-group) Examples: (P, Q, P?)+ [PQP], and (P, Q, P+)+ [PQPP] 4 (A | B) is unambiguous iff 4.1 A and B are unambiguous 4.2 ((A first) intersect (B first)) == {} Example: ((P, Q) | (P, R)) [P] 5 seq-group (A, B) is unambiguous iff 5.1 A and B are unambiguous 5.2 for all a in ((A last) intersect (B first)) hold that a is required in (A last) Example: ((P, Q?), (Q, R)) [PQ] 5.3 if A is plus-group or rep-group then ((A first) intersect (B first)) == {} Example: (P+, P) [PP] 6 (A & B) is unambiguous iff 6.1 A and B are unambiguous 6.2 ((A first) intersect (B first)) == {} Example: ((P, Q) & (P, R)) [P] 6.3 for all a in ((A last) intersect (B first)) holds that a is required in (A last) Example: ((P, Q?) & (Q, R)) [PQ] 6.4 for all a in ((A first) intersect (B last)) holds that a is required in (B last) Example: ((P, Q) & (R, P?)) [RP] 7 A?, A? is ambiguous 8 A*, A+ is ambiguous 9 An and-group is ambiguous when it describes a sequence of two or more equal content tokens. Example: (A? & B & A+) [A] -- Arjan Loeffen Achter de Dom 22-24 ++31+30536417 voice work Faculty of Arts 3512JP Utrecht ++31+206656463 voice home University of Utrecht The Netherlands ++31+30539221 fax work ------------------------------