This is a collection of postings from Fall 1997, augmented with other references in 1998. More recent references are provided in the database section "SGML/XML and non-deterministic ('ambiguous') content models."
See also the dissertation of Anne Brüggemann-Klein, Formal Models in Document Processing (1993). Several technical papers have been written by Anne Brüggemann-Klein [and Derick Wood] on ambiguity in SGML. For example: "One-Unambiguous Regular Languages" (1998). Similarly: Helena Ahonen, "Disambiguation of SGML Content Models," also published in the PODP '96 Proceedings. Also Andreas Neumann, "Unambiguity of SGML Content Models - Pushdown Automata Revisited," presented at the Third International Conference on the Developments in Language Theory (DLT'97) [biblio entry.]
Title: Re: Semantic of a Content Model Author: davep@acm.org (David Peterson) Date: Sat, 09 Aug 1997 21:06:56 -0400 In article <33E867CC.3235@PSI.Uni-Trier.DE>, Andreas Neumann <neumann@PSI.Uni-Trier.DE> wrote: > Suppose we have > > <!ELEMENT (a|b) - o EMPTY> > <!ELEMENT x - - (a & b+)*> > > in the DTD, and > > <x><a><b><b><a></x> > > in the document instance. > Is this supposed to be a valid instance of element x? > Intuitively, it should be valid, but a precise look at section 11.2.4 > (Content models) shows: > > "The elements ... of the content ... must conform to the content > model ... in the following order of priority: > a) a repetition of the most recent satisfied token, if it has a > rep or plus occurrence indicator; or > b) some other token in a model group ..." > > Thus, after <a><b>, the second <b> can satisfy a new instance of b > which has a plus occurrence indicator, but it could also start a new > repetition of (a & b+). Since alternative a) has higher priority, > only the b is repeated, and the following <a> starts a new instance > of the whole &-group which cannot be fulfilled any more. But if > alternative b) is chosen, the <a> completes the new instance of the > &-group and everything is fine. However, for <x><a><b><b></x>, a) > is the correct alternative. And that is the correct answer. The second <a> starts the outer repetition and a following <b> is required. > Annex H.1 says that "and groups reduce to an or group of seq group > permutations, for example (a & b) is equivalent to the regular > expression ... ((a, b) | (b, a))", i.e. (a & b+)* is equivalent > to ((a, b+) | (b+, a))*, which allows for <a><b><b><a> as well as > <a><b><b>. Thus, both alternatives a) and b) must be considered > (though the "annex does not form an integral part of" the SGML > standard, it does express the intentions of the comittee and should > help understanding the standard). So when there's a conflict, Annex H loses. Dave Peterson SGMLWorks! davep@acm.org ------------------------------ Title: Re: Semantic of a Content Model Author: Lennart Staflin <lenst@lysator.liu.se> Date: Sun, 10 Aug 1997 08:29:10 GMT > In article <33E867CC.3235@PSI.Uni-Trier.DE>, Andreas Neumann > <neumann@PSI.Uni-Trier.DE> wrote: > > Annex H.1 says that "and groups reduce to an or group of seq group > > permutations, for example (a & b) is equivalent to the regular > > expression ... ((a, b) | (b, a))", i.e. (a & b+)* is equivalent > > to ((a, b+) | (b+, a))*, which allows for <a><b><b><a> as well as > > <a><b><b>. Thus, both alternatives a) and b) must be considered > > (though the "annex does not form an integral part of" the SGML > > standard, it does express the intentions of the comittee and should > > help understanding the standard). davep@acm.org (David Peterson) writes: > So when there's a conflict, Annex H loses. In this case it is not the annex that loses, but the interpretation that the model ((a, b+) | (b+, a))* allows <a><b><b><a>. nsgmls thinks the model is ambiguous. I would have thought that the repetion of b takes priority over the repetion of the whole model group (The SGML Handbook, 11.2.4, p412, lines 14--19). -- Lennart Staflin <lenst@lysator.liu.se> /*/ (:ABSOLUTE :WILD) -- Lennart Staflin <lenst@lysator.liu.se> /*/ (:ABSOLUTE :WILD) ---------- Title: Re: Semantic of a Content Model Author: davep@acm.org (David Peterson) Date: Sun, 10 Aug 1997 11:35:07 -0400 In article <m2afiqigq1.fsf@triton.lstaflin.pp.se>, Lennart Staflin <lenst@lysator.liu.se> wrote: > > In article <33E867CC.3235@PSI.Uni-Trier.DE>, Andreas Neumann > > <neumann@PSI.Uni-Trier.DE> wrote: > > > Annex H.1 says that "and groups reduce to an or group of seq group > > > permutations, for example (a & b) is equivalent to the regular > > > expression ... ((a, b) | (b, a))", i.e. (a & b+)* is equivalent > > > to ((a, b+) | (b+, a))*, which allows for <a><b><b><a> as well as > > > <a><b><b>. Thus, both alternatives a) and b) must be considered > > > (though the "annex does not form an integral part of" the SGML > > > standard, it does express the intentions of the comittee and should > > > help understanding the standard). > > davep@acm.org (David Peterson) writes: > > So when there's a conflict, Annex H loses. My point was that the normative parts of the standard always win over the non-normative parts. > In this case it is not the annex that loses, but the interpretation > that the model ((a, b+) | (b+, a))* allows <a><b><b><a>. nsgmls thinks > the model is ambiguous. I would have thought that the repetion of b > takes priority over the repetion of the whole model group (The SGML > Handbook, 11.2.4, p412, lines 14--19). I can't figure out why nsgmls would think the model ambiguous. As far as I can tell, you "would have thought" correctly. As I see it, if you start out in the <b>-first alternative, you can wind up in the <a>-first alternative by having two <a>s in succession. Once you are in the <a>-first alternative there is no way to get back to the <b>-first alternative and two <a>s in succession will be an error, as will ending with an <a>. Dave Peterson SGMLWorks! davep@acm.org -------------- Title: Re: Semantic of a Content Model Author: roconnor@undergrad.math.uwaterloo.ca (Russell Steven Shawn O'Connor) Date: Sun, 10 Aug 1997 19:47:00 GMT In article <33E867CC.3235@PSI.Uni-Trier.DE>, Andreas Neumann <neumann@PSI.Uni-Trier.DE> wrote: >Annex H.1 says that "and groups reduce to an or group of seq group >permutations, for example (a & b) is equivalent to the regular >expression ... ((a, b) | (b, a))", i.e. (a & b+)* is equivalent >to ((a, b+) | (b+, a))*, which allows for <a><b><b><a> as well as Okay, after thinking about this for a bit, I did run the above ((a, b+) | (b+, a))* through nsgmls to see what it did. To my surprise it told me that this is an ambiguous content model</a>. I understand why it thinks it is ambiguous. Given <a><b><b>...., after reading the second <b> it can't tell if that <b> belongs to the first part of the or connection, or the second part. But what I don't understand is, why does it care? What is listed above is a regular expression. All regular expressions can be parsed by a finite state machine. There is no look ahead required when parsing with a finite state machine. Anything that can be parsed by a finite state machine is about as simple as it gets. In fact, I already drawn the finite state machine required to determine if a given series of <a> and <b>'s are satisfied by that regular expression or not. So again, why does nsgmls care that it can't determine which part of the or connection it is in? -- Russell O'Connor roconnor@uwaterloo.ca <URL:http://www.undergrad.math.uwaterloo.ca/%7Eroconnor/> "And truth irreversibly destroys the meaning of its own message" -- Anindita Dutta, "The Paradox of Truth, the Truth of Entropy" ----------- Title: Re: Semantic of a Content Model Author: xeborduas@microstar.com (Eric Borduas) Date: Mon, 11 Aug 1997 14:37:04 GMT Andreas Neumann <neumann@PSI.Uni-Trier.DE> wrote: >Dear SGML'ers, >Suppose we have > <!ELEMENT (a|b) - o EMPTY> > <!ELEMENT x - - (a & b+)*> >in the DTD, and > <x><a><b><b><a></x> >Annex H.1 says that "and groups reduce to an or group of seq group >permutations, for example (a & b) is equivalent to the regular >expression ... ((a, b) | (b, a))", i.e. (a & b+)* is equivalent >to ((a, b+) | (b+, a))*, which allows for <a><b><b><a> as well as ><a><b><b>. Thus, both alternatives a) and b) must be considered >(though the "annex does not form an integral part of" the SGML >standard, it does express the intentions of the comittee and should >help understanding the standard). J K <!ELEMENT x - -( (a, b+) | (b+, a) )* > > <x><a><b><b><a></x> -------J------- --K-- The sequence is: J(a) J(b) J(b) J(a) & -> Both a and b must be present, only the order is not important. In the example, <a><b><b> satisfies J and the last <a> partially satisfies J, since both <a> and <b> must be present. Also, SGML content models are not regular expressions. Although there may be a regular expression that describes a content model, the rules that govern their interpretation are not identical. Klein and Klein & Wood have written papers on the validation of SGML content models and their relationship to regular expressions. In particular, "The Validation of SGML Content Models", Ann Bruggemann-Klein and Derick Wood, March 23, 1993. Eric Borduas Microstar Software Ltd. -------------------- Title: Re: Semantic of a Content Model Author: papresco@calum.csclub.uwaterloo.ca (Paul Prescod) Date: Mon, 11 Aug 1997 15:47:32 GMT In article <EEpqyC.A9s@undergrad.math.uwaterloo.ca>, Russell Steven Shawn O'Connor <roconnor@undergrad.math.uwaterloo.ca> wrote: >So again, why does nsgmls care that it can't determine which part of the >or connection it is in? nsgmls cares because SGML cares. Annex H says: "It can be shown (by Kleene's theorem) that a regular expression is equivalent to a DFA. A parser could in theory, therefore, handle any model group by reducing it to a regular expression and constructing a DFA with ... Practice, however presents some difficulties. One problem arises in the reduction of and groups. Since the number of required seq group permutations is a factorial function, even small and groups can lead to an intractably large number of corresponding seq groups. An and group with six members for example would require 6! or 720 seq groups. Another problem lies with the construction of the DFA. One method is first to constract an NFA directly from the regular expression, then to derive a deterministic equivalent. Although this and other algorithms for DFA construction are non-polynomial and hardly intractable for the human-readable models envisaged by SGML, they may be too resource-intensive for some implementations. This International Standard avoids these problems by eliminating the need to construct a DFA. This it does by prohibiting models that are ambiguous or require "look-ahead".... As a result context checking can be done by simplified algorithms that use only NFAs." This restriction is hotly disputed and there is definately a camp that feels that it should be removed. See the SGML Web Page under "special topics" and "ambiguity" for more information. Paul Prescod ------------------- Title: Re: Semantic of a Content Model Author: roconnor@undergrad.math.uwaterloo.ca (Russell Steven Shawn O'Connor) Date: Mon, 11 Aug 1997 19:30:32 GMT In article <EErAJ8.2p8@undergrad.math.uwaterloo.ca>, Paul Prescod <papresco@calum.csclub.uwaterloo.ca> wrote: > >One problem arises in the reduction of and groups. Since the number of required >seq group permutations is a factorial function, even small and groups can lead >to an intractably large number of corresponding seq groups. An and group with >six members for example would require 6! or 720 seq groups. I thought NFA to DFA conversions were exponential function not factorial. (That is to say if you have n states in an NFA, the DFA may have up to 2^n states.) Although this wouldn't change the argument very much. Or is it saying that it requires a factorial algorithm to reduced a DFA to a simpler form. -- Russell O'Connor roconnor@uwaterloo.ca <URL:http://www.undergrad.math.uwaterloo.ca/%7Eroconnor/> "And truth irreversibly destroys the meaning of its own message" -- Anindita Dutta, "The Paradox of Truth, the Truth of Entropy" ----------- Title: Re: Semantic of a Content Model Author: jenglish@crl2.crl.com (Joe English) Date: 12 Aug 1997 11:50:22 -0700 Russell Steven Shawn O'Connor <roconnor@undergrad.math.uwaterloo.ca> wrote: > >Okay, after thinking about this for a bit, I did run the above ((a, b+) | >(b+, a))* through nsgmls to see what it did. > >To my surprise it told me that this is an ambiguous content model. I >understand why it thinks it is ambiguous. Given <a><b><b>...., after >reading the second <b> it can't tell if that <b> belongs to the first >part of the or connection, or the second part. > >But what I don't understand is, why does it care? Because them's the rules: "A _content model_ cannot be ambiguous; that is, an element or character string that appears in the document instance must be able to satisfy only one _primitive content token_ without looking ahead in the document instance" [ 11.2.4, "Ambiguous Content Model"; pp 414-415 ] > What is listed above is >a regular expression. All regular expressions can be parsed by a finite >state machine. There is no look ahead required when parsing with a >finite state machine. Anything that can be parsed by a finite state >machine is about as simple as it gets. > >In fact, I already drawn the finite state machine required to determine if >a given series of <a> and <b>'s are satisfied by that regular expression >or not. > >So again, why does nsgmls care that it can't determine which part of the >or connection it is in? There are a couple of reasons. The principle reason cited in Annex H is that the canonical algorithm for processing regular expressions -- converting the R.E. to an NFA, then converting that to a DFA -- is fairly expensive (it's worst-case exponential). So, the reasoning goes, if content models are required to be unambiguous (in the 8879 sense), a parser can convert the content model to a deterministic finite automaton in one step (with one state for each primitive content token and transitions determined by the occurrence indicators and connectors). This rationale isn't entirely compelling; there are after all algorithms to match an input sequence against an NFA, or directly against a regular expression. However, when you toss AND groups into the mix these get expensive too (try building an NFA for (a & b & c? & d? & e*) -- this is worse than exponential). With the ambiguity restriction, AND groups are tractable; without it they're much more difficult to deal with (but probably not impossible). The real show-stopper though is probably start-tag omission. Without the ambiguity restriction, the formal definition of "contextually required element" and related terms are utterly meaningless. Specifically, the parts about "satisfied tokens": in terms of general regular expressions it doesn't make any sense to talk about _which_ subexpression an input token matches, only whether the input sequence as a whole matches the regular expression as a whole. This only makes sense for deterministic regular expressions. Consider: <!ELEMENT foo - - (A, ((B,C)| B), D) -- this is ambiguous --> ... <foo> <a>...</a> <b>...</b> <!-- Here, is 'C', 'D', or neither of them contextually required? Try to decide based on the text of the Standard. --> </foo> --Joe English jenglish@crl.com -------------------- Title: Re: Semantic of a Content Model Author: jenglish@crl2.crl.com (Joe English) Date: 13 Aug 1997 10:58:35 -0700 Russell Steven Shawn O'Connor <roconnor@undergrad.math.uwaterloo.ca> wrote: >Paul Prescod <papresco@calum.csclub.uwaterloo.ca> wrote: >> >>One problem arises in the reduction of and groups. Since the number of >> required seq group permutations is a factorial function, even small and >> groups can lead to an intractably large number of corresponding seq groups. >> An and group with six members for example would require 6! or 720 seq >> groups. > >I thought NFA to DFA conversions were exponential function not factorial. >(That is to say if you have n states in an NFA, the DFA may have up to >2^n states.) That's correct. It's the conversion of an AND group with N nodes to an NFA that produces N factorial states. (OR groups and SEQ groups produce NFAs with N states). James Clark has a nifty (though somewhat difficult to follow) algorithm in SP; it treats AND groups just like repeatable OR groups when constructing the automaton, then at parse time it keeps track of an auxilliary "and state" which is a vector of flags indicating which submodels of the AND group have been satisfied. This takes O(N) space and (I think) O(1) time per transition. --Joe English jenglish@crl.com