Cover Pages: Ambiguity in SGML

This is a collection of postings from Fall 1997, augmented with other references in 1998. More recent references are provided in the database section "SGML/XML and non-deterministic ('ambiguous') content models."
See also the dissertation of Anne Brüggemann-Klein, Formal Models in Document Processing (1993). Several technical papers have been written by Anne Brüggemann-Klein [and Derick Wood] on ambiguity in SGML. For example: "One-Unambiguous Regular Languages" (1998). Similarly: Helena Ahonen, "Disambiguation of SGML Content Models," also published in the PODP '96 Proceedings. Also Andreas Neumann, "Unambiguity of SGML Content Models - Pushdown Automata Revisited," presented at the Third International Conference on the Developments in Language Theory (DLT'97) [biblio entry.]
Title: Re: Semantic of a Content Model
Author: davep@acm.org (David Peterson)
Date: Sat, 09 Aug 1997 21:06:56 -0400

In article <33E867CC.3235@PSI.Uni-Trier.DE>, Andreas Neumann
<neumann@PSI.Uni-Trier.DE> wrote:

> Suppose we have 
> 
>   <!ELEMENT (a|b) - o EMPTY>
>   <!ELEMENT x - - (a & b+)*>
> 
> in the DTD, and 
> 
>   <x><a><b><b><a></x>
> 
> in the document instance. 
> Is this supposed to be a valid instance of element x? 

> Intuitively, it should be valid, but a precise look at section 11.2.4
> (Content models) shows:
> 
>   "The elements ... of the content ... must conform to the content
>   model ... in the following order of priority:
>   a) a repetition of the most recent satisfied token, if it has a
>      rep or plus occurrence indicator; or
>   b) some other token in a model group ..."
> 
> Thus, after <a><b>, the second <b> can satisfy a new instance of b 
> which has a plus occurrence indicator, but it could also start a new 
> repetition of (a & b+). Since alternative a) has higher priority,
> only the b is repeated, and the following <a> starts a new instance
> of the whole &-group which cannot be fulfilled any more. But if 
> alternative b) is chosen, the <a> completes the new instance of the 
> &-group and everything is fine. However, for <x><a><b><b></x>, a) 
> is the correct alternative. 

And that is the correct answer.  The second <a> starts the outer
repetition and a following <b> is required.

> Annex H.1 says that "and groups reduce to an or group of seq group 
> permutations, for example (a & b) is equivalent to the regular
> expression ... ((a, b) | (b, a))", i.e. (a & b+)* is equivalent 
> to ((a, b+) | (b+, a))*, which allows for <a><b><b><a> as well as
> <a><b><b>. Thus, both alternatives a) and b) must be considered
> (though the "annex does not form an integral part of" the SGML 
> standard, it does express the intentions of the comittee and should
> help understanding the standard).

So when there's a conflict, Annex H loses.

Dave Peterson
SGMLWorks!

davep@acm.org


------------------------------

Title: Re: Semantic of a Content Model
Author: Lennart Staflin <lenst@lysator.liu.se>
Date: Sun, 10 Aug 1997 08:29:10 GMT




> In article <33E867CC.3235@PSI.Uni-Trier.DE>, Andreas Neumann
> <neumann@PSI.Uni-Trier.DE> wrote:
> > Annex H.1 says that "and groups reduce to an or group of seq group 
> > permutations, for example (a & b) is equivalent to the regular
> > expression ... ((a, b) | (b, a))", i.e. (a & b+)* is equivalent 
> > to ((a, b+) | (b+, a))*, which allows for <a><b><b><a> as well as
> > <a><b><b>. Thus, both alternatives a) and b) must be considered
> > (though the "annex does not form an integral part of" the SGML 
> > standard, it does express the intentions of the comittee and should
> > help understanding the standard).

davep@acm.org (David Peterson) writes:
> So when there's a conflict, Annex H loses.

In this case it is not the annex that loses, but the interpretation
that the model ((a, b+) | (b+, a))* allows <a><b><b><a>.  nsgmls thinks
the model is ambiguous.  I would have thought that the repetion of b
takes priority over the repetion of the whole model group (The SGML
Handbook, 11.2.4, p412, lines 14--19).


-- 
Lennart Staflin <lenst@lysator.liu.se>        /*/         (:ABSOLUTE :WILD)
-- 
Lennart Staflin <lenst@lysator.liu.se>        /*/         (:ABSOLUTE :WILD)

          
----------
Title: Re: Semantic of a Content Model
Author: davep@acm.org (David Peterson)
Date: Sun, 10 Aug 1997 11:35:07 -0400



In article <m2afiqigq1.fsf@triton.lstaflin.pp.se>, Lennart Staflin
<lenst@lysator.liu.se> wrote:

> > In article <33E867CC.3235@PSI.Uni-Trier.DE>, Andreas Neumann
> > <neumann@PSI.Uni-Trier.DE> wrote:
> > > Annex H.1 says that "and groups reduce to an or group of seq group 
> > > permutations, for example (a & b) is equivalent to the regular
> > > expression ... ((a, b) | (b, a))", i.e. (a & b+)* is equivalent 
> > > to ((a, b+) | (b+, a))*, which allows for <a><b><b><a> as well as
> > > <a><b><b>. Thus, both alternatives a) and b) must be considered
> > > (though the "annex does not form an integral part of" the SGML 
> > > standard, it does express the intentions of the comittee and should
> > > help understanding the standard).
> 
> davep@acm.org (David Peterson) writes:
> > So when there's a conflict, Annex H loses.

My point was that the normative parts of the standard always win over
the non-normative parts.

> In this case it is not the annex that loses, but the interpretation
> that the model ((a, b+) | (b+, a))* allows <a><b><b><a>.  nsgmls thinks
> the model is ambiguous.  I would have thought that the repetion of b
> takes priority over the repetion of the whole model group (The SGML
> Handbook, 11.2.4, p412, lines 14--19).

I can't figure out why nsgmls would think the model ambiguous.  As far
as I can tell, you "would have thought" correctly.

As I see it, if you start out in the <b>-first alternative, you can
wind up in the <a>-first alternative by having two <a>s in succession.
Once you are in the <a>-first alternative there is no way to get back
to the <b>-first alternative and two <a>s in succession will be an
error, as will ending with an <a>.

Dave Peterson
SGMLWorks!

davep@acm.org

--------------


Title: Re: Semantic of a Content Model
Author: roconnor@undergrad.math.uwaterloo.ca (Russell Steven Shawn O'Connor)
Date: Sun, 10 Aug 1997 19:47:00 GMT



In article <33E867CC.3235@PSI.Uni-Trier.DE>,
Andreas Neumann  <neumann@PSI.Uni-Trier.DE> wrote:

>Annex H.1 says that "and groups reduce to an or group of seq group 
>permutations, for example (a & b) is equivalent to the regular
>expression ... ((a, b) | (b, a))", i.e. (a & b+)* is equivalent 
>to ((a, b+) | (b+, a))*, which allows for <a><b><b><a> as well as

Okay, after thinking about this for a bit, I did run the above ((a, b+) |
(b+, a))* through nsgmls to see what it did.

To my surprise it told me that this is an ambiguous content model</a>.  I
understand why it thinks it is ambiguous.  Given <a><b><b>...., after
reading the second <b> it can't tell if that <b> belongs to the first
part of the or connection, or the second part.

But what I don't understand is, why does it care?  What is listed above is
a regular expression.  All regular expressions can be parsed by a finite
state machine.  There is no look ahead required when parsing with a
finite state machine.  Anything that can be parsed by a finite state
machine is about as simple as it gets.

In fact, I already drawn the finite state machine required to determine if
a given series of <a> and <b>'s are satisfied by that regular expression
or not.

So again, why does nsgmls care that it can't determine which part of the
or connection it is in?

-- 
Russell O'Connor                           roconnor@uwaterloo.ca
    <URL:http://www.undergrad.math.uwaterloo.ca/%7Eroconnor/>
"And truth irreversibly destroys the meaning of its own message"
-- Anindita Dutta, "The Paradox of Truth, the Truth of Entropy"

-----------



Title: Re: Semantic of a Content Model
Author: xeborduas@microstar.com (Eric Borduas)
Date: Mon, 11 Aug 1997 14:37:04 GMT

Andreas Neumann <neumann@PSI.Uni-Trier.DE> wrote:

>Dear SGML'ers,


>Suppose we have 

>  <!ELEMENT (a|b) - o EMPTY>
>  <!ELEMENT x - - (a & b+)*>

>in the DTD, and 

>  <x><a><b><b><a></x>


>Annex H.1 says that "and groups reduce to an or group of seq group 
>permutations, for example (a & b) is equivalent to the regular
>expression ... ((a, b) | (b, a))", i.e. (a & b+)* is equivalent 
>to ((a, b+) | (b+, a))*, which allows for <a><b><b><a> as well as
><a><b><b>. Thus, both alternatives a) and b) must be considered
>(though the "annex does not form an integral part of" the SGML 
>standard, it does express the intentions of the comittee and should
>help understanding the standard).  

                                J             K
<!ELEMENT x - -( (a, b+) | (b+, a) )* >

>  <x><a><b><b><a></x>
          -------J-------  --K--
The sequence is: J(a) J(b) J(b) J(a)

& -> Both a and b must be present, only the order is not important.

In the example, <a><b><b> satisfies J and the last <a> partially
satisfies J, since both <a> and <b> must be present.

Also, SGML content models are not regular expressions. Although there
may be a regular expression that describes a content model, the rules
that govern their interpretation are not identical.

Klein and Klein & Wood have written papers on the validation of SGML
content models and their relationship to regular expressions. In
particular, "The Validation of SGML Content Models", Ann
Bruggemann-Klein and Derick Wood, March 23, 1993.

Eric Borduas
Microstar Software Ltd.

--------------------

Title: Re: Semantic of a Content Model
Author: papresco@calum.csclub.uwaterloo.ca (Paul Prescod)
Date: Mon, 11 Aug 1997 15:47:32 GMT



In article <EEpqyC.A9s@undergrad.math.uwaterloo.ca>,
Russell Steven Shawn O'Connor <roconnor@undergrad.math.uwaterloo.ca> wrote:
>So again, why does nsgmls care that it can't determine which part of the
>or connection it is in?

nsgmls cares because SGML cares.  Annex H says:

"It can be shown (by Kleene's theorem) that a regular expression is equivalent
to a DFA. A parser could in theory, therefore, handle any model group by 
reducing it to a regular expression and constructing a DFA with ... Practice,
however presents some difficulties.

One problem arises in the reduction of and groups. Since the number of required
seq group permutations is a factorial function, even small and groups can lead 
to an intractably large number of corresponding seq groups. An and group with
six members for example would require 6! or 720 seq groups.

Another problem lies with the construction of the DFA. One method is first to
constract an NFA directly from the regular expression, then to derive a 
deterministic equivalent. Although this and other algorithms for DFA 
construction are non-polynomial and hardly intractable for the human-readable
models envisaged by SGML, they may be too resource-intensive for some 
implementations.

This International Standard avoids these problems by eliminating the need to 
construct a DFA. This it does by prohibiting models that are ambiguous or
require "look-ahead"....

As a result context checking can be done by simplified algorithms that use
only NFAs."

This restriction is hotly disputed and there is definately a camp that feels
that it should be removed.  See the SGML Web Page under "special topics" and
"ambiguity" for more information.

 Paul Prescod

-------------------
Title: Re: Semantic of a Content Model
Author: roconnor@undergrad.math.uwaterloo.ca (Russell Steven Shawn O'Connor)
Date: Mon, 11 Aug 1997 19:30:32 GMT



In article <EErAJ8.2p8@undergrad.math.uwaterloo.ca>,
Paul Prescod <papresco@calum.csclub.uwaterloo.ca> wrote:
>
>One problem arises in the reduction of and groups. Since the number of required
>seq group permutations is a factorial function, even small and groups can lead 
>to an intractably large number of corresponding seq groups. An and group with
>six members for example would require 6! or 720 seq groups.

I thought NFA to DFA conversions were exponential function not factorial.
(That is to say if you have n states in an NFA, the DFA may have up to
2^n states.) Although this wouldn't change the argument very much.

Or is it saying that it requires a factorial algorithm to reduced a
DFA to a simpler form.

-- 
Russell O'Connor                           roconnor@uwaterloo.ca
    <URL:http://www.undergrad.math.uwaterloo.ca/%7Eroconnor/>
"And truth irreversibly destroys the meaning of its own message"
-- Anindita Dutta, "The Paradox of Truth, the Truth of Entropy"



-----------


Title: Re: Semantic of a Content Model
Author: jenglish@crl2.crl.com (Joe English)
Date: 12 Aug 1997 11:50:22 -0700

Russell Steven Shawn O'Connor <roconnor@undergrad.math.uwaterloo.ca> wrote:
>
>Okay, after thinking about this for a bit, I did run the above ((a, b+) |
>(b+, a))* through nsgmls to see what it did.
>
>To my surprise it told me that this is an ambiguous content model.  I
>understand why it thinks it is ambiguous.  Given <a><b><b>...., after
>reading the second <b> it can't tell if that <b> belongs to the first
>part of the or connection, or the second part.
>
>But what I don't understand is, why does it care?

Because them's the rules:

    "A _content model_ cannot be ambiguous; that is,
    an element or character string that appears in the document
    instance must be able to satisfy only one _primitive content
    token_ without looking ahead in the document instance"

        [ 11.2.4, "Ambiguous Content Model"; pp 414-415 ]


> What is listed above is
>a regular expression.  All regular expressions can be parsed by a finite
>state machine.  There is no look ahead required when parsing with a
>finite state machine.  Anything that can be parsed by a finite state
>machine is about as simple as it gets.
>
>In fact, I already drawn the finite state machine required to determine if
>a given series of <a> and <b>'s are satisfied by that regular expression
>or not.
>
>So again, why does nsgmls care that it can't determine which part of the
>or connection it is in?

There are a couple of reasons.  The principle reason cited
in Annex H is that the canonical algorithm for processing
regular expressions -- converting the R.E. to an NFA, then
converting that to a DFA -- is fairly expensive (it's worst-case
exponential).  So, the reasoning goes, if content models
are required to be unambiguous (in the 8879 sense), a parser
can convert the content model to a deterministic finite automaton
in one step (with one state for each primitive content token and
transitions determined by the occurrence indicators and connectors).

This rationale isn't entirely compelling; there are after all
algorithms to match an input sequence against an NFA, or directly
against a regular expression.  However, when you toss AND groups
into the mix these get expensive too (try building an NFA
for (a & b & c? & d? & e*) -- this is worse than exponential).
With the ambiguity restriction, AND groups are tractable;
without it they're much more difficult to deal with (but
probably not impossible).

The real show-stopper though is probably start-tag omission.
Without the ambiguity restriction, the formal definition of
"contextually required element" and related terms are utterly
meaningless.  Specifically, the parts about "satisfied tokens":
in terms of general regular expressions it doesn't make any sense
to talk about _which_ subexpression an input token matches, only whether
the input sequence as a whole matches the regular expression
as a whole.  This only makes sense for deterministic regular
expressions.

Consider:

        <!ELEMENT foo - - (A, ((B,C)| B), D)    -- this is ambiguous -->
        ...
        <foo>
            <a>...</a>
            <b>...</b>
            <!-- Here, is 'C', 'D', or neither of them contextually required?
                 Try to decide based on the text of the Standard.
            -->
        </foo>



--Joe English

  jenglish@crl.com


--------------------

Title: Re: Semantic of a Content Model
Author: jenglish@crl2.crl.com (Joe English)
Date: 13 Aug 1997 10:58:35 -0700



Russell Steven Shawn O'Connor <roconnor@undergrad.math.uwaterloo.ca> wrote:
>Paul Prescod <papresco@calum.csclub.uwaterloo.ca> wrote:
>>
>>One problem arises in the reduction of and groups. Since the number of
>> required seq group permutations is a factorial function, even small and
>> groups can lead to an intractably large number of corresponding seq groups.
>> An and group with six members for example would require 6! or 720 seq
>> groups.
>
>I thought NFA to DFA conversions were exponential function not factorial.
>(That is to say if you have n states in an NFA, the DFA may have up to
>2^n states.)

That's correct.  It's the conversion of an AND group with N nodes
to an NFA that produces N factorial states.  (OR groups and SEQ groups
produce NFAs with N states).

James Clark has a nifty (though somewhat difficult to follow)
algorithm in SP; it treats AND groups just like repeatable
OR groups when constructing the automaton, then at parse time
it keeps track of an auxilliary "and state" which is a
vector of flags indicating which submodels of the AND
group have been satisfied.  This takes O(N) space and 
(I think) O(1) time per transition.


--Joe English

  jenglish@crl.com

SEARCH \| ABOUT \| INDEX \| NEWS \| CORE STANDARDS \| TECHNOLOGY REPORTS \| EVENTS \| LIBRARY