XML and non-deterministic content models
Here are postings from Joe English and James Clark on XML and non-deterministic content models. Answers to a FAQ that goes back to "SGML" before it became ISO 8879:1986. See SGML/XML Notion of Ambiguity (non-deterministic content models)
Date: Sun, 14 Jan 2001 09:58:16 -0800 From: Joe English <jenglish@flightlab.com> To: xml-dev@lists.xml.org Subject: Re: should all XML parsers reject non-deterministic content models?
TAKAHASHI Hideo wrote:
> I understand that the XML 1.0 spec prohibits non-deterministic (or, > ambiguous) content models (for compatibility, to be precise). > Are all xml 1.0 compliant xml processing software required to reject > DTDs with such content models?
No: a processor can ignore the DTD entirely and still be compliant. And since the prohibition against non-deterministic content models appears in a non-normative appendix, I would presume that conforming DTD-aware processors are not required to detect this condition either. Even in full SGML, ambiguous content models are a "non-reportable markup error", i.e., parser don't need to detect this condition.
> Ambiguous content models doesn't cause any problems when you construct a > DFA via an NFA. I have heard that there is a way to construct DFAs > directly from regexps without making an NFA, but that method can't > handle non-deterministic regular expressions.
There are many, many other ways to validate documents against content models though. Take a look at James Clark's TREX implementation, which has no problem with ambiguity, and also efficiently handles intersection, negation, and interleaving of content models (the first two of which are *very* expensive in a DFA-based approach).
> If you choose that method > to construct your DFA, you will surely benefit from the rule in XML 1.0 > . But if you choose not, detecting non-deterministic content models > become an extra job.
But note that detecting ambiguity in XML content models is considerably simpler than in SGML -- the really difficult part involves '&' groups which aren't present in XML.
--Joe English
Date: Sun, 14 Jan 2001 17:42:16 +0700 From: James Clark <jjc@jclark.com> To: Daniel.Veillard@imag.fr Cc: "TAKAHASHI Hideo(BSD-13G)" <hideo-t@bisd.hitachi.co.jp>, xml-editor@w3.org, xml-dev@lists.xml.org Subject: Re: should all XML parsers reject non-deterministic content models?
Daniel Veillard wrote:
> On Sun, Jan 14, 2001 at 04:42:55PM +0900, TAKAHASHI Hideo(BSD-13G) wrote: > > Hello. > > > > I understand that the XML 1.0 spec prohibits non-deterministic (or, > > ambiguous) content models (for compatibility, to be precise). > > Note also that this is stated in a non-normative appendix.
It is also stated normatively in the body of the spec (http://www.w3.org/TR/REC-xml#sec-element-content): "For compatibility, it is an error if an element in the document can match more than one occurrence of an element type in the content model."
> > Are all xml 1.0 compliant xml processing software required to reject > > DTDs with such content models? > > Since it is stated as non-normatively only I don't think this is the > case in theory.
It states normatively that it is an error. However, when the XML Rec says merely that something is an error, a conforming XML processor is not required to report it. If it said "it is a fatal error", then a conforming XML processor would be required to report it.
> In prectice this can be a problem. I recently faced a problem with > a DtD developped at the IETF which was clearly non-determinist. This > also means that this introduce new classes of XML parser among the > validating ones: > - those who detect and report non-determinist content model > - those who validate (correctly) or not using non-determinist > content model
Yes.
> > Ambiguous content models doesn't cause any problems when you construct a > > DFA via an NFA. I have heard that there is a way to construct DFAs > > directly from regexps without making an NFA, but that method can't > > handle non-deterministic regular expressions. If you choose that method > > to construct your DFA, you will surely benefit from the rule in XML 1.0 > > . But if you choose not, detecting non-deterministic content models > > become an extra job. > > I tried to read the Brüggemann-Klein thesis listed in reference and > found it a bit frightening, though very informative. The beginning > of the Part I on Document Grammar for example makes clear that SGML > view of unambiguity of the content model is really a 1 token lookahead > determinism. > In practice this is a very good rule because it allows to simplify > the validation of a content model a lot.
Complicating things for the user to make things simpler for the parser writer seems in general a bad trade-off to me. Also this decreases interoperability (as you've observed). The only justification is compatibility with SGML.
In fact, there is a very simple algorithm available that handles non-determinism just fine (it doesn't require you to construct a NFA and then do the subset construction). See http://www.flightlab.com/~joe/sgml/validate.html (TREX uses a variation on this).
> Problem is that grammars > need to be rewritten to conform to it (the thesis proves it's always > possible at lest).
The thesis proves the opposite: that there are some regular expressions that do not denote 1-unambiguous languages (see p52). She gives the example of
(a|b)*,a,(a|b)
James
Prepared by Robin Cover for The XML Cover Pages archive.