SGML: Ambiguity
SGML has some unusual definitions for ambiguity. See below, but this file is dated: use the indexed searching for the entire SGML database to find ambig* or similar. For even better results, consult an indexed database for the COMP.TEXT.SGML (Usenet News) archives)
Article: 579 of comp.text.sgml Path: txsil!robin From: robin@txsil.lonestar.org (Robin Cover) Newsgroups: comp.text.sgml Subject: More on Ambiguity (and FSAs) Keywords: SGML, Unambiguity, Finite (State) Automata Message-ID: <607@txsil.lonestar.org> Date: 9 Jan 92 03:32:22 GMT Organization: Summer Inst. of Linguistics, Dallas TX Lines: 112 A previous posting citing discussions of "unambiguous" as defined in ISO 8879 neglected a couple recent TRs by Brueggemann-Klein and Wood (abstracts below), a couple others I have not seen, an early complaint by Heath & Welsch, and a note by Price on the usefulness of FSAs in revealing or clarifying ambiguity germane to SGML content models. Here they are. Other studies could be included (some originating in the early phases of the Chameleon project, some perhaps from Waterloo), but I have not the heart. . . I'm thankful others have to worry about parsers anyway. The current mood seems to be (?) reconciliation to the standard as adopted, however justifiably it may be (arguably) criticized for some of the tough formal-language-theory choices it made. Oh, I also think I heard Erik Naggum mention an article in progress on this general topic of parsing (ambiguous DTDs and/or instances). . .stay tuned. (1) Brueggemann-Klein, Anne; Wood, Derek. Deterministic Regular Languages. Bericht 38. Universitaet Freiburg, Institut fuer Informatik. Oktober 1991. 18 pages. Abstract: The ISO Standard for Standard Generalized Markup Language (SGML) provides a syntactic meta-language for the definition of textual markup systems. In the standard the right hand sides of productions are called <ital>content models</> and they are based on regular expressions. The allowable regular expressions are those that are "unambiguous" as defined by the standard. Unfortunately, the standard's use of the term "unambiguous" does not correspond to the two well known notions, since not all regular languages are denoted by "unambiguous" expressions. Furthermore, the standard's definition of "unambiguous" is somewhat vague. Therefore, we provide a precise definition of "unambiguous expressions" and rename them deterministic regular expressions to avoid any confusion. A regular expression <ital>E</> is <ital>deterministic</> if the canonical ε-free finite atomaton <ital>M<sub>E</sub></> recognizing L(<ital>E</>) is deterministic. A regular language is <ital>deterministic</> if there is a deterministic expression that denotes it. We give a Kleene-like theorem for deterministic regular languages and we characterize them in terms of the structural properties of the minimal deterministic automata recognizing them. The latter result enables us to decide if a given regular expression denotes a deterministic regular language and, if so, to construct an equivalent deterministic expression. (2) Brueggemann-Klein, Anne. Regular Expressions into Finite Atomata. Bericht 33. Universitaet Freiburg, Institut fuer Informatik. Juli 1991. 22 pages. Abstract: It is a well-established fact that each regular expression can be transformed into a non-deterministic automaton (NFA) with or without ε-transitions, and all authors seem to provide their own variant of the construction. Of these, Berry and Sethi <cit>BS86</> have shown that the construction of an ε-free NFA due to Glushkov <cit>Glu61</> is a <ital>natural</> representation of the regular expression, because it can be described in terms of the Brzozowski derivatives <cit>Brz64</> of the expression. Moreover, the Glushkov construction also plays a significant role in the document processing area: The SGML standard <cit>ISO86</>, now widely adopted by publishing houses and government agencies for the syntactic specification of textual markup systems, uses <ital>deterministic</> regular expressions, i.e., expressions whose Glushkov automaton is deterministic, as a description language for document types. In this paper, we first show that the Glushkov automaton can be constructed in time quadratic in the size of the expression, and that this is worst case optimal. For deterministic expressions, our algorithm has even linear run time. This improves on the cubic time methods suggested in the literature <cit>BEGO71</><cit>ASU86</><cit>BS86</>. A major step of the algorithm consists in bringing the expression into what we call <ital>star normal form</>. This concept is also useful for characterizing the relationship between two types of unambiguity that have been studied in the literature. Namely, we show that, modulo a technical condition, an expression is strongly unambiguous <cit>SS88</> if and only if it is weakly unambiguous <cit>BEGO71</> and in star normal form. This leads to our third result, a quadratic time decision algorithm for weak unambiguity, that improves on the bi-quadratic method introduced by Book et al. <cit>BEGO71</>. (A version of this TR is also to appear in the conference proceedings of Latin '92.) (3) Brueggemann-Klein, Anne; Wood, Derek. On the Expressive Power of SGML Document Grammars. In preparation, 1991. Cited in (1) above. (4) Brueggemann-Klein, Anne; Wood, Derek. Parser Generators for Document Grammars. Submitted for publication, 1991. Cited in (2) above. (5) Heath, Jim; Welsch, Larry. "Difficulties in Parsing SGML." In Proceedings of the ACM Conference on Document Processing Systems, Santa Fe (5-9 December 1988). Pages 71-77. New York: Association for Computing Machinery, 1988. See similarly, by the same authors, "Difficulties in Parsing: Suggestions to Improve SGML," <TAG> 10 (July 1989) 8-10. (6) Price, Lynne A. "Graphic Representation of Content Models." <TAG> 10 (July 1989) 12-16. The article demonstrates the use of tree structures and (more extensively) FSAs to represent SGML content models. FSAs are useful in revealing ambiguity (seemingly equivalent models). The article is derived from the author's tutorial session at the ACM Conference on Document Processing Systems, Santa Fe, New Mexico (5-9 December 1988). ------------------------------------------------------------------------ Robin Cover BITNET: zrcc1001@smuvm1 ("one-zero-zero-one") 6634 Sarah Drive Internet: zrcc1001@vm.cis.smu.edu Dallas, TX 75236 USA Internet: robin@utafll.uta.edu ("uta-ef-el-el") Tel: (1 214) 296-1783 Internet: robin@ling.uta.edu FAX: (1 214) 709-3387 Internet: robin@txsil.lonestar.org ------------------------------------------------------------------------- Article: 575 of comp.text.sgml Path: txsil!robin From: robin@txsil.lonestar.org (Robin Cover) Newsgroups: comp.text.sgml Subject: James Clark on Ambiguity Keywords: SGML, Ambiguity Message-ID: <605@txsil.lonestar.org> Date: 7 Jan 92 16:12:37 GMT Organization: Summer Inst. of Linguistics, Dallas TX Lines: 82 A bibliographic reminder that the worry about "ambiguity" in SGML is an historic as well as current issue: (1) Graf, John M. "Ambiguity in the Instance." <TAG> 7 (1988) 6-9. Observes that "a document created under the exact rules of a valid DTD may very well be invalid when passed through an instance parser." See the response of John McFadden and Sam Wilmott, "Ambiguity in the Instance: An Analysis" in <TAG> 9 (March/April 1989) 3-5. (2) Grootenhuis, Jan. "Disambiguation of SGML Document Models." <TAG> 12 (December 1989) 11-12. Contact: Jan Grootenhuis, c/o CIRCE, Kralenbeek 1873, 1104 KJ Amsterdam, THE NETHERLANDS; tel: +31 20 998966. (3) Hayter, Ron. "Comments on 'On Improving SGML'." Technical Bulletin 4. Software Exoterica Corporation, 1988. Hayter argues that Kaelbling's "improvements" to SGML (4) are based upon a misunderstanding of the intent of the standard. Kaelbling's original draft known to Hayter was apparently 16-March-1988; Kaelbling's revised draft of 18-October-1988 responds to Hayter's comments. (4) Kaelbling, Michael J. "On Improving SGML." OSU-CIRSC-7/88-TR22. The Ohio State University, 1988. Department of Computer and Information Science; The Ohio State University; 2036 Neil Avenue Mall; Columbus, OH 43210. Draft 18-October-1988, accepted for publication in Electronic Publishing: Origination, Dissemination and Design. Summary: Several improvements are suggested to the syntax of SGML, the recent international stnadard for the description of electronic document types. These improvements ease processing by existing tools, remove ambiguity cleanly, and increase human usability. They also indicate some guidelines that should be followed in the design and specification of computer-software standards. By following accepted computer-science conventions for the description of languages the design of a standard may be improved, and the subsequent implementation task simplified. (5) McFadden, John R.; Wilmott, Sam. "Ambiguity in the Instance: An Analysis." <TAG> 9 (March/April 1989) 3-5. A response to the article of John M. Graf in <TAG> 7 (1988) 6-9. The authors argue that conforming parsers and a validating parsers must be are differentiated. On this distinction, cf. further the article of McFadden and Wilmott "The SGML Conformance Testing Initiative." <TAG> 9 (March/April 1989) 1-3. The writers conclude that Graf's examples represent a misunderstanding of the use of SGML parsers, and that tag minimization is "safe" with proper DTD design. (6) Price, Lynne A. "The Problem with Ambiguous Content Models." SGML Users' Group Bulletin 3/1 (1988) 25-26. Published abstract: "IS 8879 defines an ambiguous content model -- one for which an element or character string occuring in a document instance can satisfy more than one primitive content token without look-ahead -- to be a nonreportable markup error. In other words, it is an error to use an ambiguous content model, but SGML software will not necessarily detect the condition. SGML could have been defined to give a unique interpretation to each possible content model. For example, an element or character string occuring in a document instance could have been interpreted as matching the leftmost possible primitive content token. Such a definition, however, would not have given consistently intuitive results. Several illustrations of this point are given below." (7) Warmer, Jos; Egmond, Sylvia van. "The Implementation of the Amsterdam SGML Parser." Electronic Publishing: Origination, Dissemination and Design (EPOdd) 2/2 (July 1989) 65-90. ISSN: 0894- 3982. Abstract: The Standard Generalized Markup Language (SGML) is an ISO Standard that specifies a language for document representation. This paper gives a short introduction to SGML and describes the (Vrije Universiteit) Amsterdam SGML Parser and the problems we encountered in implementing the Standard. These problems include the interpretation of the Standard in places where it is ambiguous and the technical problems in parsing SGML documents. (See especially "Ambiguities in SGML," pp. 80-83.) ------------------------------------------------------------------------- Robin Cover BITNET: zrcc1001@smuvm1 ("one-zero-zero-one") 6634 Sarah Drive Internet: robin@utafll.uta.edu ("uta-ef-el-el") Dallas, TX 75236 USA Internet: zrcc1001@vm.cis.smu.edu Tel: (1 214) 296-1783 Internet: robin@ling.uta.edu FAX: (1 214) 709-3387 Internet: robin@txsil.sil.org =========================================================================
Prepared by Robin Cover for The XML Cover Pages archive. See "SGML/XML and non-deterministic ('ambiguous') content models."