SGML: Ambiguity
SGML has some unusual definitions for ambiguity. See below, but this file is dated: use the indexed searching for the entire SGML database to find ambig* or similar. For even better results, consult an indexed database for the COMP.TEXT.SGML (Usenet News) archives)
Article: 579 of comp.text.sgml
Path: txsil!robin
From: robin@txsil.lonestar.org (Robin Cover)
Newsgroups: comp.text.sgml
Subject: More on Ambiguity (and FSAs)
Keywords: SGML, Unambiguity, Finite (State) Automata
Message-ID: <607@txsil.lonestar.org>
Date: 9 Jan 92 03:32:22 GMT
Organization: Summer Inst. of Linguistics, Dallas TX
Lines: 112
A previous posting citing discussions of "unambiguous" as defined in
ISO 8879 neglected a couple recent TRs by Brueggemann-Klein and Wood
(abstracts below), a couple others I have not seen, an early complaint
by Heath & Welsch, and a note by Price on the usefulness of FSAs in
revealing or clarifying ambiguity germane to SGML content models. Here
they are. Other studies could be included (some originating in the
early phases of the Chameleon project, some perhaps from Waterloo),
but I have not the heart. . . I'm thankful others have to worry about
parsers anyway. The current mood seems to be (?) reconciliation to the
standard as adopted, however justifiably it may be (arguably) criticized
for some of the tough formal-language-theory choices it made. Oh, I also
think I heard Erik Naggum mention an article in progress on this general
topic of parsing (ambiguous DTDs and/or instances). . .stay tuned.
(1) Brueggemann-Klein, Anne; Wood, Derek. Deterministic Regular
Languages. Bericht 38. Universitaet Freiburg, Institut fuer
Informatik. Oktober 1991. 18 pages.
Abstract: The ISO Standard for Standard Generalized Markup
Language (SGML) provides a syntactic meta-language for the
definition of textual markup systems. In the standard the right
hand sides of productions are called <ital>content models</> and
they are based on regular expressions. The allowable regular
expressions are those that are "unambiguous" as defined by the
standard. Unfortunately, the standard's use of the term
"unambiguous" does not correspond to the two well known notions,
since not all regular languages are denoted by "unambiguous"
expressions. Furthermore, the standard's definition of
"unambiguous" is somewhat vague. Therefore, we provide a
precise definition of "unambiguous expressions" and rename them
deterministic regular expressions to avoid any confusion. A
regular expression <ital>E</> is <ital>deterministic</> if the
canonical ε-free finite atomaton <ital>M<sub>E</sub></>
recognizing L(<ital>E</>) is deterministic. A regular language
is <ital>deterministic</> if there is a deterministic expression
that denotes it. We give a Kleene-like theorem for deterministic
regular languages and we characterize them in terms of the
structural properties of the minimal deterministic automata
recognizing them. The latter result enables us to decide if a
given regular expression denotes a deterministic regular
language and, if so, to construct an equivalent deterministic
expression.
(2) Brueggemann-Klein, Anne. Regular Expressions into Finite
Atomata. Bericht 33. Universitaet Freiburg, Institut fuer
Informatik. Juli 1991. 22 pages.
Abstract: It is a well-established fact that each regular
expression can be transformed into a non-deterministic automaton
(NFA) with or without ε-transitions, and all authors seem
to provide their own variant of the construction. Of these,
Berry and Sethi <cit>BS86</> have shown that the construction of
an ε-free NFA due to Glushkov <cit>Glu61</> is a
<ital>natural</> representation of the regular expression,
because it can be described in terms of the Brzozowski
derivatives <cit>Brz64</> of the expression. Moreover, the
Glushkov construction also plays a significant role in the
document processing area: The SGML standard <cit>ISO86</>, now
widely adopted by publishing houses and government agencies for
the syntactic specification of textual markup systems, uses
<ital>deterministic</> regular expressions, i.e., expressions
whose Glushkov automaton is deterministic, as a description
language for document types. In this paper, we first show that
the Glushkov automaton can be constructed in time quadratic in
the size of the expression, and that this is worst case optimal.
For deterministic expressions, our algorithm has even linear run
time. This improves on the cubic time methods suggested in the
literature <cit>BEGO71</><cit>ASU86</><cit>BS86</>. A major
step of the algorithm consists in bringing the expression into
what we call <ital>star normal form</>. This concept is also
useful for characterizing the relationship between two types of
unambiguity that have been studied in the literature. Namely,
we show that, modulo a technical condition, an expression is
strongly unambiguous <cit>SS88</> if and only if it is weakly
unambiguous <cit>BEGO71</> and in star normal form. This leads
to our third result, a quadratic time decision algorithm for
weak unambiguity, that improves on the bi-quadratic method
introduced by Book et al. <cit>BEGO71</>. (A version of this TR
is also to appear in the conference proceedings of Latin '92.)
(3) Brueggemann-Klein, Anne; Wood, Derek. On the Expressive Power
of SGML Document Grammars. In preparation, 1991. Cited in (1)
above.
(4) Brueggemann-Klein, Anne; Wood, Derek. Parser Generators for
Document Grammars. Submitted for publication, 1991. Cited in
(2) above.
(5) Heath, Jim; Welsch, Larry. "Difficulties in Parsing SGML." In
Proceedings of the ACM Conference on Document Processing
Systems, Santa Fe (5-9 December 1988). Pages 71-77. New York:
Association for Computing Machinery, 1988. See similarly, by
the same authors, "Difficulties in Parsing: Suggestions to
Improve SGML," <TAG> 10 (July 1989) 8-10.
(6) Price, Lynne A. "Graphic Representation of Content Models."
<TAG> 10 (July 1989) 12-16. The article demonstrates the use of
tree structures and (more extensively) FSAs to represent SGML
content models. FSAs are useful in revealing ambiguity
(seemingly equivalent models). The article is derived from the
author's tutorial session at the ACM Conference on Document
Processing Systems, Santa Fe, New Mexico (5-9 December 1988).
------------------------------------------------------------------------
Robin Cover BITNET: zrcc1001@smuvm1 ("one-zero-zero-one")
6634 Sarah Drive Internet: zrcc1001@vm.cis.smu.edu
Dallas, TX 75236 USA Internet: robin@utafll.uta.edu ("uta-ef-el-el")
Tel: (1 214) 296-1783 Internet: robin@ling.uta.edu
FAX: (1 214) 709-3387 Internet: robin@txsil.lonestar.org
-------------------------------------------------------------------------
Article: 575 of comp.text.sgml
Path: txsil!robin
From: robin@txsil.lonestar.org (Robin Cover)
Newsgroups: comp.text.sgml
Subject: James Clark on Ambiguity
Keywords: SGML, Ambiguity
Message-ID: <605@txsil.lonestar.org>
Date: 7 Jan 92 16:12:37 GMT
Organization: Summer Inst. of Linguistics, Dallas TX
Lines: 82
A bibliographic reminder that the worry about "ambiguity" in SGML is
an historic as well as current issue:
(1) Graf, John M. "Ambiguity in the Instance." <TAG> 7 (1988) 6-9.
Observes that "a document created under the exact rules of a valid DTD
may very well be invalid when passed through an instance parser." See
the response of John McFadden and Sam Wilmott, "Ambiguity in the
Instance: An Analysis" in <TAG> 9 (March/April 1989) 3-5.
(2) Grootenhuis, Jan. "Disambiguation of SGML Document Models."
<TAG> 12 (December 1989) 11-12. Contact: Jan Grootenhuis, c/o CIRCE,
Kralenbeek 1873, 1104 KJ Amsterdam, THE NETHERLANDS; tel: +31 20
998966.
(3) Hayter, Ron. "Comments on 'On Improving SGML'." Technical
Bulletin 4. Software Exoterica Corporation, 1988. Hayter argues that
Kaelbling's "improvements" to SGML (4) are based upon a
misunderstanding of the intent of the standard. Kaelbling's original
draft known to Hayter was apparently 16-March-1988; Kaelbling's
revised draft of 18-October-1988 responds to Hayter's comments.
(4) Kaelbling, Michael J. "On Improving SGML." OSU-CIRSC-7/88-TR22.
The Ohio State University, 1988. Department of Computer and
Information Science; The Ohio State University; 2036 Neil Avenue Mall;
Columbus, OH 43210. Draft 18-October-1988, accepted for publication
in Electronic Publishing: Origination, Dissemination and Design.
Summary: Several improvements are suggested to the syntax of SGML, the
recent international stnadard for the description of electronic
document types. These improvements ease processing by existing tools,
remove ambiguity cleanly, and increase human usability. They also
indicate some guidelines that should be followed in the design and
specification of computer-software standards. By following accepted
computer-science conventions for the description of languages the
design of a standard may be improved, and the subsequent
implementation task simplified.
(5) McFadden, John R.; Wilmott, Sam. "Ambiguity in the Instance: An
Analysis." <TAG> 9 (March/April 1989) 3-5. A response to the article
of John M. Graf in <TAG> 7 (1988) 6-9. The authors argue that
conforming parsers and a validating parsers must be are
differentiated. On this distinction, cf. further the article of
McFadden and Wilmott "The SGML Conformance Testing Initiative." <TAG>
9 (March/April 1989) 1-3. The writers conclude that Graf's examples
represent a misunderstanding of the use of SGML parsers, and that tag
minimization is "safe" with proper DTD design.
(6) Price, Lynne A. "The Problem with Ambiguous Content Models."
SGML Users' Group Bulletin 3/1 (1988) 25-26. Published abstract: "IS
8879 defines an ambiguous content model -- one for which an element
or character string occuring in a document instance can satisfy more
than one primitive content token without look-ahead -- to be a
nonreportable markup error. In other words, it is an error to use an
ambiguous content model, but SGML software will not necessarily detect
the condition. SGML could have been defined to give a unique
interpretation to each possible content model. For example, an element
or character string occuring in a document instance could have been
interpreted as matching the leftmost possible primitive content token.
Such a definition, however, would not have given consistently
intuitive results. Several illustrations of this point are given
below."
(7) Warmer, Jos; Egmond, Sylvia van. "The Implementation of the
Amsterdam SGML Parser." Electronic Publishing: Origination,
Dissemination and Design (EPOdd) 2/2 (July 1989) 65-90. ISSN: 0894-
3982. Abstract: The Standard Generalized Markup Language (SGML) is an
ISO Standard that specifies a language for document representation.
This paper gives a short introduction to SGML and describes the (Vrije
Universiteit) Amsterdam SGML Parser and the problems we encountered in
implementing the Standard. These problems include the interpretation
of the Standard in places where it is ambiguous and the technical
problems in parsing SGML documents. (See especially "Ambiguities in
SGML," pp. 80-83.)
-------------------------------------------------------------------------
Robin Cover BITNET: zrcc1001@smuvm1 ("one-zero-zero-one")
6634 Sarah Drive Internet: robin@utafll.uta.edu ("uta-ef-el-el")
Dallas, TX 75236 USA Internet: zrcc1001@vm.cis.smu.edu
Tel: (1 214) 296-1783 Internet: robin@ling.uta.edu
FAX: (1 214) 709-3387 Internet: robin@txsil.sil.org
=========================================================================
Prepared by Robin Cover for The XML Cover Pages archive. See "SGML/XML and non-deterministic ('ambiguous') content models."

