Cover Pages Logo SEARCH
Advanced Search
Site Map
CP RSS Channel
Contact Us
Sponsoring CP
About Our Sponsors

Cover Stories
Articles & Papers
Press Releases

XML Query

XML Applications
General Apps
Government Apps
Academic Apps

Technology and Society
Tech Topics
Related Standards

SGML: Ambiguity

SGML has some unusual definitions for ambiguity. See below, but this file is dated: use the indexed searching for the entire SGML database to find ambig* or similar. For even better results, consult an indexed database for the COMP.TEXT.SGML (Usenet News) archives)

Article: 579 of comp.text.sgml
Path: txsil!robin
From: (Robin Cover)
Newsgroups: comp.text.sgml
Subject: More on Ambiguity (and FSAs)
Keywords: SGML, Unambiguity, Finite (State) Automata
Message-ID: <>
Date: 9 Jan 92 03:32:22 GMT
Organization: Summer Inst. of Linguistics, Dallas TX
Lines: 112

A previous posting citing discussions of "unambiguous" as defined in
ISO 8879 neglected a couple recent TRs by Brueggemann-Klein and Wood
(abstracts below), a couple others I have not seen, an early complaint
by Heath & Welsch, and a note by Price on the usefulness of FSAs in
revealing or clarifying ambiguity germane to SGML content models.  Here
they are.  Other studies could be included (some originating in the
early phases of the Chameleon project, some perhaps from Waterloo),
but I have not the heart. . .  I'm thankful others have to worry about
parsers anyway.  The current mood seems to be (?) reconciliation to the
standard as adopted, however justifiably it may be (arguably) criticized
for some of the tough formal-language-theory choices it made.  Oh, I also
think I heard Erik Naggum mention an article in progress on this general
topic of parsing (ambiguous DTDs and/or instances). . .stay tuned.

(1)   Brueggemann-Klein, Anne; Wood, Derek.  Deterministic Regular
      Languages.  Bericht 38.  Universitaet Freiburg, Institut fuer
      Informatik.  Oktober 1991.  18 pages.

      Abstract:  The ISO Standard for Standard Generalized Markup
      Language (SGML) provides a syntactic meta-language for the
      definition of textual markup systems. In the standard the right
      hand sides of productions are called <ital>content models</> and
      they are based on regular expressions.  The allowable regular
      expressions are those that are "unambiguous" as defined by the
      standard.  Unfortunately, the standard's use of the term
      "unambiguous" does not correspond to the two well known notions,
      since not all regular languages are denoted by "unambiguous"
      expressions.  Furthermore, the standard's definition of
      "unambiguous" is somewhat vague.  Therefore, we provide a
      precise definition of "unambiguous expressions" and rename them
      deterministic regular expressions to avoid any confusion.  A
      regular expression <ital>E</> is <ital>deterministic</> if the
      canonical &epsi;-free finite atomaton <ital>M<sub>E</sub></>
      recognizing L(<ital>E</>) is deterministic.  A regular language
      is <ital>deterministic</> if there is a deterministic expression
      that denotes it. We give a Kleene-like theorem for deterministic
      regular languages and we characterize them in terms of the
      structural properties of the minimal deterministic automata
      recognizing them.  The latter result enables us to decide if a
      given regular expression denotes a deterministic regular
      language and, if so, to construct an equivalent deterministic

(2)   Brueggemann-Klein, Anne.  Regular Expressions into Finite
      Atomata.  Bericht 33.  Universitaet Freiburg, Institut fuer
      Informatik.  Juli 1991.  22 pages.

      Abstract:  It is a well-established fact that each regular
      expression can be transformed into a non-deterministic automaton
      (NFA) with or without &epsi;-transitions, and all authors seem
      to provide their own variant of the construction.  Of these,
      Berry and Sethi <cit>BS86</> have shown that the construction of
      an &epsi;-free NFA due to Glushkov <cit>Glu61</> is a
      <ital>natural</> representation of the regular expression,
      because it can be described in terms of the Brzozowski
      derivatives <cit>Brz64</> of the expression.  Moreover, the
      Glushkov construction also plays a significant role in the
      document processing area: The SGML standard <cit>ISO86</>, now
      widely adopted by publishing houses and government agencies for
      the syntactic specification of textual markup systems, uses
      <ital>deterministic</> regular expressions, i.e., expressions
      whose Glushkov automaton is deterministic, as a description
      language for document types.  In this paper, we first show that
      the Glushkov automaton can be constructed in time quadratic in
      the size of the expression, and that this is worst case optimal.
      For deterministic expressions, our algorithm has even linear run
      time.  This improves on the cubic time methods suggested in the
      literature <cit>BEGO71</><cit>ASU86</><cit>BS86</>.  A major
      step of the algorithm consists in bringing the expression into
      what we call <ital>star normal form</>.  This concept is also
      useful for characterizing the relationship between two types of
      unambiguity that have been studied in the literature.  Namely,
      we show that, modulo a technical condition, an expression is
      strongly unambiguous <cit>SS88</> if and only if it is weakly
      unambiguous <cit>BEGO71</> and in star normal form.  This leads
      to our third result, a quadratic time decision algorithm for
      weak unambiguity, that improves on the bi-quadratic method
      introduced by Book et al. <cit>BEGO71</>.  (A version of this TR
      is also to appear in the conference proceedings of Latin '92.)

(3)   Brueggemann-Klein, Anne; Wood, Derek.  On the Expressive Power
      of SGML Document Grammars.  In preparation, 1991.  Cited in (1)

(4)   Brueggemann-Klein, Anne; Wood, Derek.  Parser Generators for
      Document Grammars.  Submitted for publication, 1991. Cited in
      (2) above.

(5)  Heath, Jim; Welsch, Larry.  "Difficulties in Parsing SGML."  In
      Proceedings of the ACM Conference on Document Processing
      Systems, Santa Fe (5-9 December 1988).  Pages 71-77.  New York:
      Association for Computing Machinery, 1988.  See similarly, by
      the same authors, "Difficulties in Parsing: Suggestions to
      Improve SGML," <TAG> 10 (July 1989) 8-10.

(6)  Price, Lynne A.  "Graphic Representation of Content Models."
      <TAG> 10 (July 1989) 12-16.  The article demonstrates the use of
      tree structures and (more extensively) FSAs to represent SGML
      content models.  FSAs are useful in revealing ambiguity
      (seemingly equivalent models).  The article is derived from the
      author's tutorial session at the ACM Conference on Document
      Processing Systems, Santa Fe, New Mexico (5-9 December 1988).

Robin Cover              BITNET: zrcc1001@smuvm1    ("one-zero-zero-one")
6634 Sarah Drive         Internet:
Dallas, TX  75236  USA   Internet: ("uta-ef-el-el")
Tel: (1 214) 296-1783    Internet:
FAX: (1 214) 709-3387    Internet:
Article: 575 of comp.text.sgml
Path: txsil!robin
From: (Robin Cover)
Newsgroups: comp.text.sgml
Subject: James Clark on Ambiguity
Keywords: SGML, Ambiguity
Message-ID: <>
Date: 7 Jan 92 16:12:37 GMT
Organization: Summer Inst. of Linguistics, Dallas TX
Lines: 82

A bibliographic reminder that the worry about "ambiguity" in SGML is
an historic as well as current issue:

(1) Graf, John M.  "Ambiguity in the Instance."  <TAG> 7 (1988) 6-9.
Observes that "a document created under the exact rules of a valid DTD
may very well be invalid when passed through an instance parser."  See
the response of John McFadden and Sam Wilmott, "Ambiguity in the
Instance: An Analysis" in <TAG> 9 (March/April 1989) 3-5.

(2) Grootenhuis, Jan.  "Disambiguation of SGML Document Models."
<TAG> 12 (December 1989) 11-12.  Contact: Jan Grootenhuis, c/o CIRCE,
Kralenbeek 1873, 1104 KJ Amsterdam, THE NETHERLANDS; tel: +31 20

(3) Hayter, Ron.  "Comments on 'On Improving SGML'."  Technical
Bulletin 4.  Software Exoterica Corporation, 1988.  Hayter argues that
Kaelbling's "improvements" to SGML (4) are based upon a
misunderstanding of the intent of the standard. Kaelbling's original
draft known to Hayter was apparently 16-March-1988; Kaelbling's
revised draft of 18-October-1988 responds to Hayter's comments.

(4) Kaelbling, Michael J.  "On Improving SGML."  OSU-CIRSC-7/88-TR22.
The Ohio State University, 1988.  Department of Computer and
Information Science; The Ohio State University; 2036 Neil Avenue Mall;
Columbus, OH 43210.  Draft 18-October-1988, accepted for publication
in Electronic Publishing: Origination, Dissemination and Design.
Summary: Several improvements are suggested to the syntax of SGML, the
recent international stnadard for the description of electronic
document types.  These improvements ease processing by existing tools,
remove ambiguity cleanly, and increase human usability.  They also
indicate some guidelines that should be followed in the design and
specification of computer-software standards.  By following accepted
computer-science conventions for the description of languages the
design of a standard may be improved, and the subsequent
implementation task simplified.

(5) McFadden, John R.; Wilmott, Sam.  "Ambiguity in the Instance: An
Analysis." <TAG> 9 (March/April 1989) 3-5.  A response to the article
of John M. Graf in <TAG>  7 (1988) 6-9.  The authors argue that
conforming parsers and a validating parsers must be are
differentiated.  On this distinction, cf. further the article of
McFadden and Wilmott "The SGML Conformance Testing Initiative." <TAG>
9 (March/April 1989) 1-3. The writers conclude that Graf's examples
represent a misunderstanding of the use of SGML parsers, and that tag
minimization is "safe" with proper DTD design.

(6) Price, Lynne A.  "The Problem with Ambiguous Content Models."
SGML Users' Group Bulletin 3/1 (1988) 25-26.  Published abstract: "IS
8879 defines an  ambiguous content model -- one for which an element
or character string occuring in a document instance can satisfy more
than one primitive content token without look-ahead -- to be a
nonreportable markup error.  In other words, it is an error to use an
ambiguous content model, but SGML software will not necessarily detect
the condition.  SGML could have been defined to give a unique
interpretation to each possible content model. For example, an element
or character string occuring in a document instance could have been
interpreted as matching the leftmost possible primitive content token.
Such a definition, however, would not have given consistently
intuitive results. Several illustrations of this point are given

(7) Warmer, Jos; Egmond, Sylvia van.  "The Implementation of the
Amsterdam SGML Parser."  Electronic Publishing: Origination,
Dissemination and Design (EPOdd) 2/2 (July 1989) 65-90. ISSN: 0894-
3982.  Abstract: The Standard Generalized Markup Language (SGML) is an
ISO Standard that specifies a language for document representation.
This paper gives a short introduction to SGML and describes the (Vrije 
Universiteit) Amsterdam SGML Parser and the problems we encountered in
implementing the Standard. These problems include the interpretation
of the Standard in places where it is ambiguous and the technical
problems in parsing SGML documents.  (See especially "Ambiguities in
SGML," pp. 80-83.)

Robin Cover              BITNET:   zrcc1001@smuvm1   ("one-zero-zero-one")
6634 Sarah Drive         Internet:   ("uta-ef-el-el")
Dallas, TX 75236 USA     Internet:
Tel: (1 214) 296-1783    Internet:
FAX: (1 214) 709-3387    Internet:

Prepared by Robin Cover for The XML Cover Pages archive. See "SGML/XML and non-deterministic ('ambiguous') content models."

Globe Image

Document URL: