A collection of postings to CTS, more or less germane to CDATA and RCDATA as declared content, in light of SGML's recognition modes under the influence of active FEATURES set in the SGML declaration. "To him that hath ears. . ." Erik Naggum warns against this use of CDATA and RCDATA (see below). Other experts, as far as I know, (rather fully) agree with Erik's appraisal; (e.g., Steve DeRose/David Durand, Joe English). Please send me updates/corrections/additions/demurrals. -- Robin Cover
Newsgroups: comp.text.sgml Date: 01 Apr 1994 15:57:58 UT From: Erik Naggum <erik@naggum.no> Organization: Naggum Software; +47 2295 0313 Message-ID: <19940401.1664.erik@naggum.no> References: <JCOLLIE.94Mar30102934@blue.weeg.uiowa.edu> Subject: Re: Question about content models [Jeffrey C. Ollie] | What is the difference between the content models "(#PCDATA)" and | "CDATA"? It is my understanding that both define a string of | characters. Are they synonyms, or are there other differences? actually, they both allow data characters, with a minor twist: PCDATA (parsed character data) actually means that if the characters cannot be interpreted as markup, they are valid in that context as data characters. CDATA means that characters in that element are declared to be data characters and is only terminated by an end-tag open in context (a </ followed by (, >, or a letter; or / if the start-tag was "net-enabling".) this may change as the FEATURES clause in the SGML declaration changes. if prematurely ended, the error is the same as if these characters occurred in a non-CDATA element, so there's not really any surprises lurking here. after some study of the impact on CDATA and RCDATA, I have found that they serve no useful purpose, give a false sense of security that text will not be parsed, and confuse users. therefore, I recommend that people not use CDATA or RCDATA at all. if the functionality of CDATA and RCDATA is needed, it would be much better for all parties involved if this is controlled in the document instance rather than in the DTD. not only is parsing made simpler, unexpected and undesirable things happen in CDATA and RCDATA elements: users can no longer use comments, references to a character they need appear verbatim, omitted or minimized tags no longer work, etc. the functionality of CDATA and RCDATA is available with marked sections. marked sections stand out, are visible to the user at the point where they occur, and won't screw things up unexpectedly. furthermore, if the element's content is (#PCDATA), the user is free to use a marked section or entity references, other content may occur in the same element, and things obtain a much more elegant look when you don't have to "know" whether an element has CDATA or RCDATA contents. the application software will not know whether the content was declared CDATA or allowed to contain PCDATA, so the only party you constrain is the user. the specific thing that caused me to reject CDATA and RCDATA completely was that changing an element's content from CDATA to (#PCDATA) is not possible without also checking all instances of that element. this means that if you need more information in an element, or want a smaller granularity to what was previously just "character data", you lose, or, more precisely, your information receives a serious blow, and may not recover. that is, you may intentionally avoid fixing all the "broken" documents, and thus you pay a hefty fee for the initial short-cut of using CDATA if that data once again becomes useful. CDATA and RCDATA are dangerous and best avoided. what is gained by CDATA and RCDATA? that a few characters will not be treated as markup, that's all. if these characters occur elsewhere in the document, they will have to be replaced by entity references, or other strange ways to "hide" them from the parser. training users to escape their special characters costs enough already if they should not also have to remember that these rules have exceptions that apply only to certain elements. training them to use marked sections may take additional cost, but at least it has no impact on them if they don't use it, and if they use it, it's just an added benefit. there are several ways to hide characters from the parser. suppose we want to say "AT&T", here are some ways to do it: through a character entity: AT&T through a numeric character reference: AT&T through an empty comment declaration: AT&<!>T in a marked section: <![ CDATA [AT&T]]> using the markup-scan-suppress code (where \ is the MSSCHAR): AT\&T in an entity (declared): <!ENTITY ATT CDATA "AT&T"> (referenced): &ATT; in a clever entity (declared): <!ENTITY T CDATA "&T"> (referenced): AT&T by making \& a short ref delimiter: AT\&T this exact same issue was discussed 1993-03-11 through --14 between John Bowe, Eric Skinner and myself. the URL wais://ifi.uio.no/comp.text.sgml?MSSCHAR will give you 5 documents all pertinent to this topic. (lynx supports WAIS searches, if your gee-whiz graphical WWW browser doesn't.) | Is there a way to define a content model such that only certain | characters are permitted, and a certain number of them? For example, I | would like to define an element that contains a ZIP code (5 digits). no. best regards, </erik> -- Erik Naggum <erik@naggum.no> <SGML@ifi.uio.no> | memento, terrigena. ISO 8879 SGML, ISO 10744 HyTime, ISO 10646 UCS | memento, vita brevis. for information on SGML and HyTime, try ftp.ifi.uio.no:/pub/SGML first. -------------------------------------------------------------------------- Newsgroups: comp.text.sgml Date: 19 Jun 1994 18:39:37 UT From: Erik Naggum <erik@naggum.no> Organization: Naggum Software; +47 2295 0313 Message-ID: <19940619.3062@naggum.no> References: <1994Jun17.174907.14944@chemabs.uucp> <19940619.3045@naggum.no> <2u1utg$fmc@news.delphi.com> Subject: Re: SGML DTD help needed for binary data. [Jeffrey McArthur] | Erik Naggum writes: | > CDATA and RCDATA should in fact not be used at all. | | That is a little strong. it was reserved. closer to what I actually think, and which _is_ a little strong, is that CDATA and RCDATA mar the otherwise elegant picture of a language by providing the means to destroy central principles to the design of that language. CDATA and RCDATA have absolutely no utility that cannot be accomplished _easier_ without them (when considering the whole picture). redundancy is bad enough, but when it becomes destructive, it is time to speak up and make people aware of the evil potential of these features. CDATA and RCDATA must not be used if you want your documents to outlive your system or software, or the current version of either. they are _bad_ for you and for your information. now, that's a _little_ strong. further escalation available by mail. | I would like to hear a good argument why a sort key should not be | CDATA. and _I_ would like to hear a good argument for why there isn't a God. how about some arguments in _favor_ of CDATA, since this is your position? the onus of proof is on he who asserts the positive. | They should be used with care. Consider the problem of a sort key. | Sorting can become a complex task. For example companies with numbers | in their names may be sorted based on the spelling of the number. For | example, 3M may sort in the T's (Three M), and 16 Magazine may sort in | the S's (Sixteen). Other publishers may want 3M to sort before 16 | Magazine that precedes the companies that start with the letter A. this is irrelevant to a discussion of CDATA. however, as an example, consider a name containing a small y with diaeresis, the famous ISO 8859-1 character that the reference concrete syntax of SGML in its limited wisdom decided to SHUN. it needs to be in the content of an element declared CDATA (because the DTD designer thought "we'll never need an entity reference in _this_ baby"), and, voila!, you can't use ÿ, you can't represent your data, and you can't change the DTD because somewhere else someone used the ampersand as data where it looks like an entity reference if you allow RCDATA or PCDATA. this is destructive. this is bad karma. you should become an ex-DTD designer if you use CDATA. RCDATA is only a little bit better. everywhere in the document, an author can insert <!-- comments --> to leave information to others in the collaborative editing team, _except_ in RCDATA elements. the elements can't be changed to PCDATA because the elements frequently include start-tags, and changing them would have to be done manually since the information about the special parsing conditions is not localized with the affected information. this is also destructive. this is also bad karma. next life, when you're digging up mushrooms with your snout, you will regret that RCDATA element. instead, use marked sections, teach users how to avoid markup recognition, and make it work the same _everywhere_. consistency is a major win in this game. CDATA and RCDATA screw up the consistency, confuse users, and make funny remnants of markup appear in the final document unless you're _very_ careful, and no parser on the market today will tell you about this. (CDATA and RCDATA marked sections localize the special parsing information with the data, and are therefore perfectly OK, indeed essential. CDATA attributes are of a different kind of CDATA, and do in fact accept entity references, as do all attribute literals.) </Erik> -- Erik Naggum <erik@naggum.no> <SGML@ifi.uio.no> | memento, terrigena ISO 8652 Ada/ISO 8879 SGML/ISO 9899 C/ISO 10646 UCS | memento, vita brevis ftp://ftp.ifi.uio.no/pub/SGML wais://ftp.ifi.uio.no/comp.text.sgml ----------------------------------------------------------------------- Newsgroups: comp.text.sgml Date: 19 Jun 1994 19:36:22 UT From: David Megginson <dmeggins@aix1.uottawa.ca> Organization: Department of English, University of Ottawa Message-ID: <DMEGGINS.94Jun19153622@aix1.uottawa.ca> References: <1994Jun17.174907.14944@chemabs.uucp> <19940619.3045@naggum.no> <2u1utg$fmc@news.delphi.com> <19940619.3062@naggum.no> Subject: Re: SGML DTD help needed for binary data. I support Erik on this point -- I do not think that CDATA or RCDATA should _ever_ be used as a declared content model for an element. First, there are the practical problems. Consider this DTD fragment from a document about SGML (and thus, which will quote a lot of SGML literally): <!ELEMENT example - - CDATA> The following document fragment would _not_ work! <example> Here is some sample SGML text with an <tag>embedded tag</tag> and an &entity; </example> The problem is that you cannot use any end tag in a CDATA content model anyway. You _could_ use RCDATA with the appropriate entities: <!ELEMENT example - - RCDATA> <example> Here is some sample SGML text with a <tag>embedded tag</tag> and an &entity; </example> but if your goal was to save typing, you're not gaining much, especially if the SGML sample is long. You also have to worry about entity references as well as end tags, so your text will be full of & entities. The best way to do it is just <example> <![CDATA[ Here is some sample SGML text with a <tag>embedded tag</tag> and an &entity; ]]> </example> It is especially nice because it allows you to cut and past directly from the SGML document without worrying about converting some characters to entities. The second reason is that the DTD should not impose the choice of CDATA or RCDATA on the user. If you are writing an example using a marked section, you will have to use some entities to avoid trouble, and RCDATA will be appropriate; if you are writing an example with a lot of entities, however, RCDATA will force you to use & too often. Let the DTD decide the document's _structure_ and the user decide the document's _content_. I wonder whether it would be possible to remove the CDATA and RCDATA content models from the standard altogether, or at least, to mark them as counter-indicated. David -- David Megginson Department of English, University of Ottawa, dmeggins@aix1.uottawa.ca Ottawa, Ontario, CANADA K1N 6N5 dmeggins@acadvm1.uottawa.ca Phone: +1 613 564 6850 (Office) ak117@freenet.carleton.ca +1 613 564 9175 (FAX) ---------------------------------------------------------------------- Joe English on CDATA as declared content Message-id: <9607240020.AA11739@trystero.art.com> From: Joe English <joe@trystero.art.com> To: www-html@w3.org Date: Tue, 23 Jul 1996 17:20:33 PDT Subject: Cougar DTD: Do not use CDATA declared content for SCRIPT The 12-July-1996 draft of the "Cougar" HTML DTD [1] declares: <!ELEMENT SCRIPT - - CDATA -- script statements --> This will not work. In particular, the use of CDATA declared content is incompatible with JavaScript (which, I presume, will be one of the primary scripting languages used in HTML documents). The main reason for this is that the arguments to JavaScript's 'document.write()' method [2], which inserts text and HTML markup into a document, may contain end-tags, e.g.: <SCRIPT> document.write("<H1>", "Foo", "</H1>") </SCRIPT> Elements with CDATA declared content cannot contain any sequence of characters that "looks like" an end-tag -- ETAGO (</) followed by a letter -- since that will prematurely terminate the element. There is no way around this; it is a fundamental problem with CDATA declared content. Here are a few alternatives: 1) Use <!ELEMENT SCRIPT - - (#PCDATA)>, and require all occurrences of '<', '&', and '>' in the content to be replaced with '<', '&', and '>'. This is more consistent with the rest of HTML. 2) Use <!ELEMENT SCRIPT - - (#PCDATA)> and add browser support for CDATA marked sections: <SCRIPT><![ CDATA [ document.write("<H1>", "Foo", "</H1>") ]]></SCRIPT> This is the approach favored by most other SGML applications. 3) Allow scripts to be included by external reference: <SCRIPT SRC="http://www.foo.com/myscript.js"></SCRIPT> This approach may increase network latency, but has the advantage of better backward-compatibility with SCRIPT-unaware user agents. * * * CDATA declared content is in general a bad idea (it should not be used for STYLE either, and IMO the XMP and LISTING elements should be removed entirely.) Of all of SGML's broken features, CDATA declared content is among the worst. For more details, please refer to the relevant entries on Robin Cover's SGML Web Page [3] under "Other Grammar/Parsing Issues and FEATURES" [4]. Many of the issues brought up there are not particularly relevant to the Web, though there are other problems with CDATA declared content that make it especially dangerous for HTML. [I've expounded on this before on html-wg, but because of the current lack of a working archive I can't cite references :-(] Two things that come to mind are that the presence of *any* element with CDATA or RCDATA declared content in the HTML DTD makes it much more difficult to write a Web search engine -- it becomes necessary to parse against the DTD instead of simple lexical scanning, e.g., with tools like Dan Connolly's lexical analyzer [5] -- and that it greatly increases the amount of SGML knowledge necessary for authors to construct a valid document including such elements. [1] <URL: http://www.w3.org/pub/WWW/MarkUp/Cougar/HTML.dtd > [2] <HURL: http://www.netscape.com/eng/mozilla/2.0/handbook/ javascript/ref_t-z.html#write_method > [3] <URL: http://www.sil.org/sgml/ > [4] <URL: http://www.sil.org/sgml/topics.html#miscGrammar > [5] <URL: http://www.w3.org/pub/WWW/TR/WD-sgml-lex > --Joe English joe@art.com ----------------------------------------------------------------------- From: Stephen Dixon <stephen@gp.co.nz> Newsgroups: comp.text.sgml Subject: CDATA and RCDATA understand confused Message-ID: <1993Jun23.093319.1796@gp.co.nz> Date: 22 Jun 1993 21:33:19 UT Summary: END-TAG error in CDATA, RCDATA declared content. Keywords: ETAGO, CDATA, RCDATA, GI's Distribution: world Organization: GP Print Ltd, Wellington, New Zealand Lines: 39 Greetings All. I seem to have got myself confused somewhere with regard to CDATA and RCDATA. I had thought that it was possible to have an element "x" whose content was declared CDATA or RCDATA in which could occur start-tags (and associated end-tags) that may or may not be in the DTD. The parser would look at these and only recognise that end-tag which started it. Example text: <x> <s><st>Title type text</st> <ss>text a plenty ...</ss></s> ... </x> Errors are produced for the end-tag </st>, </ss> and </s>. Correct? (Both parsers I have access to agree by producing errors for those three end-tags (and other end-tags within the <x> element).) Using an entity like &etago; for the end-tag open is not an option as the end-tag needs to be seen (like the start-tag) so that the following processing application can use it. In this case the elements <s>, <st> and <ss> are defined, but they may represent GI's that are not defined as elements in the DTD. How can this be achieved? Have I gone wrong? Where? Apologies if this has been discussed before. -- _____ Regards: Stephen J. Dixon / / Thoughts etc. my own. / GP / email: stephen@gp.co.nz /____/ SGML: Seeking to understand it better. ------------------------------------------------------------- Newsgroups: comp.text.sgml From: ers@xgml.com (Eric R. Skinner) Subject: Re: 7.6.1 Record Boundaries revisited Message-ID: <1993Feb6.183053.6512@xgml.com> Organization: Exoterica Corporation References: <19930201.007@erik.naggum.no> Date: 06 Feb 1993 18:30:53 UT Lines: 441 In article <19930201.007@erik.naggum.no> Erik Naggum <SGML@ifi.uio.no> writes: >A request for interpretation of ISO 8879 7.6.1 (Record Boundaries) has been >submitted to ISO/IEC JTC 1/SC 18/WG 8 from the CALS Industry Standards >Groups for the revision of SGML. I thought this might be interesting for >readers of this newsgroup, and have also included my comments at the end. [This article was previously posted with editing errors, cancelled and re-posted. Apologies to any who received both posts.] The following is Exoterica's commentary on record boundary processing, as stated in Exoterica document ETR13-0792. I am providing it here for information purposes, as it describes Exoterica's interpretation of clause 7.6.1. This document was made available to the participants of the Techdoc discussion, and has been submitted to ISO WG8, although I do not know the document number assigned to it at this time. If you redistribute this article, please retain all copyright notices. Hope this helps... --------- RECORD BOUNDARY PROCESSING IN SGML ETR1307-92 Exoterica Corporation 1545 Carling Ave. Suite 404 Ottawa, Ontario Canada K1Z 8P9 Tel 613-722-1700 Fax 613-722-5706 Information in this document is subject to change without notice and does not represent a commitment on the part of Exoterica Corporation. Change History: Release 1 23 July 1992 PROPRIETARY RIGHTS NOTICE All rights reserved by Exoterica Corporation. No part of this material may be reproduced, translated or transmitted in any form or by any means, electronic, mechanical, or otherwise, including photocopying and recording, WITHOUT INCLUDING ALL COPYRIGHT NOTICES AND THIS NOTICE. DISCLAIMER Exoterica Corporation makes no representation or warranty, express or implied with respect to this publication or the programs or information described therein. In no event shall Exoterica Corporation, its employees or contractors be liable for specific, indirect, or consequential damages. (c) Exoterica Corporation, 1992 All rights reserved. 1 Record Boundaries 2 Record End Handling 3 ESIS 4 Record Start Handling 5 CDATA and RCDATA 6 Character References 7 Conclusion RECORD BOUNDARY PROCESSING IN SGML This paper explains Exoterica's interpretation of ISO 8879 with respect to record boundary handling and why we decided to do things the way we do. Clause 7.6.1 of ISO 8879, which describes the handling of record boundaries, contains some difficulties in the way it is written. We have tried to interpret it and implement our SGML parser in a manner consistent with the wording of both this clause and others in ISO 8879, as well as in a manner that provides the maximum benefits to our users. 1 Record Boundaries The SGML standard describes text files in terms of "records", which are bounded by record-start (RS) and record-end (RE) characters. This is not the only way of looking at text files, and is not the way text files are stored by most computer systems. In fact, ISO 8879 does not require a document to be organized in records. SGML's record-oriented view does, however, have the advantage of providing a model for describing precisely what a "blank record" is as well as providing a basis for supporting the short reference minimization feature of SGML. Exoterica has traditionally embedded its SGML parser in its products in a way that supports the standard's description of text records. No matter what computer system you are running on, the Exoterica software gives the SGML parser each text record terminated by an RE and with the following text record started by an RS. We do this for a number of reasons: - Short reference minimization is an important part of many SGML applications and short references that include RE and RS need to be available on all systems. - Our software must exhibit consistent behavior on all of the large variety of platforms on which it runs. Consistent behavior is required if data interchange is to be practical. - On computer systems that use the ASCII line-feed character to delimit text records (i.e. unix), the literal interpretation of this character as an RS would cause its being discarded and would result in lines of text that were run together (i.e. if line-feeds in unix files were discarded by the SGML parser, a word at the end of a line would be run together with the word at the start of the following line). - The task of an SGML system is to process the text submitted to it, not just to copy or discard the text. This processing includes recognizing and processing the starts and ends of lines of text. Exoterica's new products, being released this autumn, support user control over this behavior, while providing the traditional behavior in the default case. We expect this control to be used only in the support of wildly variant syntaxes (unusual uses of the SGML Declaration) and in supporting consistency with the behavior of other vendors' products. In particular, we expect that most users will continue to want consistency across platforms, and full access to SGML's minimization features, and will therefore want to take advantage of the traditional behavior. 2 Record End Handling The poor wording of Clause 7.6.1 in ISO 8879 results in an ambiguity as to when REs are to be "ignored". In the present absence of a revision to ISO 8879 that resolves this difficulty, we, and other implementors, have had to come up with our own resolutions based on trying to maintain consistency with other parts of ISO 8879 and providing the kind of behavior that users expect. The ambiguity revolves around the inadequately defined term "ignored". Ignored presumably means that a character is discarded at some point in processing. The ambiguity is in determining when a character (an RE or RS) is ignored: - REs and RSes are clearly not ignored a priori (prior to the SGML parser seeing them) because they can be recognized as short reference delimiters or parts thereof. So the discarding has to be done at some later point in processing. - REs "attributed to markup" and RSes are ignored prior to the data in the parsed data in the SGML document being returned to the application, because the ignoring has to be done at some time - The critical issue is whether REs attributed to markup and RSes are ignored instead of, or in addition to their being recognized as data. If they are recognized as "parsed character data", then they are available for matching with #PCDATA content tokens in the content models of mixed content elements (elements that have content models that contain one or more #PCDATA content tokens), and can cause errors if parsed character data is not allowed by a content model (i.e. a #PCDATA content token is not available to be matched with). There are a number of considerations that lead us to believe that ISO 8879 says that all REs in the content of mixed content elements are recognized as parsed character data, independent of whether they are also "ignored" (RSes are discussed in more detail later on in this paper): - REs, RSes, spaces and other white-space characters are recognized as "s separators" in element content (production [26]) and are ignored by the provisions of Clause 6.2.1. The recognition of white-space characters in element content is done as part of the recognition of data and markup (it is provided for in the grammar of SGML). The provisions of Clause 7.6.1 are only required, therefore, to deal with white-space characters that are not distinguished from data characters by the grammar of SGML. - The text of Clauses 7.6 and 7.6.1 never describe the REs to be ignored as "markup" (it refers to REs "attributed" to markup). The last sentence of the note in Clause 7.6 describes any SGML character in mixed content that is not recognized as markup to be data. Therefore, characters in mixed content (but not in element content), that are not recognized as markup, are data. Similarly, production [25] does not allow for an RE or RS separately, but only as a "data character". - In the first note in Clause 7.6.1, it is recognized that the rules preceding the note only produce the "intuitive results" when "data can occur anywhere in the content of an element". An example is provided of a content model that would produce unintuitive results: (x, #PCDATA) It is stated that the rules for "ignoring" REs notwithstanding, an RE prior to the element "x" would produce an error. This could only be so if the RE were treated as parsed character data and an attempt were made to match it with a #PCDATA content model token prior to its being ignored. The note in Clause 11.2.4 expands on the explanation of the potential difficulty caused by this and similar models, and recommends against their use. - The note in Clause 11.2.4 states that "separator characters, which are recognized as separators in element content, are treated as data in mixed content." This statement would seem to be quite explicit, but there is, however, a difficulty: the definition of "separator characters" (definition 4.277) is written so that it is not clear whether SPACE or RE characters are intended to be separator characters, and the proposed revision (WG 8 paper N 1035) makes it explicit that "separator characters" are members of the SEPCHAR class (i.e. TAB in the Reference Concrete Syntax), and do not include RE, RS or SPACE. On the other hand, the statement quoted clearly applies to SPACE characters equally well as to SEPCHAR character. Furthermore, the example that immediately follows the statement describes the difficulty caused by RE characters occurring in the locations described in the statement. It would therefore seem that the statement is intended to apply to characters that can occur in "s separators" (RE, RS, SPACE and SEPCHAR class characters) rather than just SEPCHAR characters. - A similar problem to that of leading REs occurs within elements. An example is the following declarations: <!ELEMENT a - O (b, c?, #PCDATA)> <!ELEMENT (b, c) - - (#PCDATA)> and the following use of these elements: <a><b>data</b> <c>more data ... As there is no provision for discarding the RE following the </b> in ISO 8879, it is therefore matched with the #PCDATA content model token in the element "a" and the following start tag for "c" is in error. Note that this is basically the same problem as occurs with REs at the start of mixed content elements, but that it has no simple resolution of the "always ignore leading white-space" variety. Resolution of the difficulty introduced by the RE in the example would require looking ahead to the <c> start-tag. Look-ahead over a large number of characters could be required if there were a large amount of white space or the presence of comments intervening between the white-space and the <c>. ISO 8879 is quite strict in requiring delimiters and other constructs to be recognized with a minimum of look-ahead. Such a resolution would require breaking this provision of the standard. In the absence of an alternative straight-forward resolution, treating the case in the example as an error seems the only reasonable action. As a consequence of this consideration and of the desire to treat all REs in mixed content in as consistent as possible a fashion, REs cannot be allowed in all contexts in mixed content. The fact that this resolution of the ambiguity produces results that are not "intuitive", is recognized in the notes in Clauses 7.6.1 and 11.2.4. Clause 11.2.4 further strongly recommends against the use of mixed content models that, in particular, do not allow #PCDATA at their start. The very existence of these notes indicates that thought was given to the mechanism and consequences of RE processing. What does not seem to have been realized is that more than clarification of the issues was required: the normative text (i.e. not the notes) should have been written to be clear and unambiguous as to the required processing. Although the behavior of Exoterica's SGML parser with respect to REs is not always what the user would desire, it does seem to be the approach most consistent with the wording of ISO 8879. Any other approach would have to ignore ISO 8879 in a number of clauses. Implementors of standards are not free to ignore parts of a standard and still claim conformance for their products, so we feel obliged to take this approach. The text that most strongly supports the approach we have taken is in notes in ISO 8879. Notes are not normative text (i.e. the text in notes and non-normative annexes is not officially part of an ISO standard, and should not contain any of the requirements of a standard). However, in the absence of any other source of resolution of an ambiguity in the normative text, the non-normative text of the standard is the best place to look for a resolution that can be accepted by most users of the standard. 3 ESIS The preceding discussion of RE handling seems to conclude that REs are both ignored and not ignored. This is not as bizarre a conclusion as it may at first seem as there are many examples in SGML of text that affects parsing but is not part of the document as seen by the application receiving the results of SGML parsing. For example, an RE on a line of text that contains a comment declaration is ignored if the line only contains the comment, but is not ignored if either the line also contains other data characters or if the line does not contain the comment declaration. In effect, the presence or absence of the comment declaration, that is not part of the tagging or text of the SGML document, influences how the SGML parser processes parsed character data. Characters can be both ignored and not ignored because the view of an SGML document as seen by an SGML parser is different from the view of an SGML document as seen by an application that receives a parsed document from an SGML parser. An SGML parser sees all the markup, REs and RSes and other text of the document. An application sees the effects of the tags in the returned document structure and the data that is not "ignored". This two-fold view of parsed documents is fundamental to all data description languages. It has been formalized for SGML in the definition of ESIS (Element Structure Information Set), which has been developed by ISO/IEC JTC 1 SC 18 WG 8, the ISO group responsible for the maintenance of ISO 8879. In the same way that ISO 8879 now defines the form of an SGML document, ESIS defines "the set of information that is acted upon by implementations of structure-controlled applications", an application that "only operates on the element structure that is described by SGML markup, never on the markup itself." A definition of ESIS is contained in (a normative annex) in the newly approved (July 1992) ANSI standard, "Conformance Testing for Standard Generalized Markup Language (SGML) Systems". REs attributed to markup are "ignored" by the ESIS view of a document, but not in the document as viewed by an SGML parser. 4 Record Start Handling Determining how RSes should be processed is subject to the same difficulties as for REs, except that in the case of RSes, there are not the helpful notes that there are for REs. On the one hand, there is the simple statement in Clause 7.6.1 that "if an RS in content is not interpreted as markup, it is ignored." On the other hand: - The provisions of the grammar (productions [25] and [26]) and the last sentence of the note in Clause 7.6 to apply equally well to RSes as REs: RSes are parsed data characters. In addition, the first sentence of Clause 7.6.1 (quoted in the paragraph above) refers to RSes "not interpreted as markup", strengthening the applicability of the note in Clause 7.6 to RSes. - The wording used to describe what happens to an RS is the same as that for a leading or trailing RE: "it is ignored". There is no indication that RSes should be ignored at a different phase or in a different manner than REs attributed to markup. - The note in Clause 11.2.4 also applies equally to RSes as REs. In particular: "separator characters, which are recognized as separators in element content, are treated as data in mixed content." The Exoterica SGML parser matches all RSes in mixed content to #PCDATA tokens prior to (or as well as) discarding them simply because ISO 8879 seems to indicate that RSes are handled in the same manner as REs attributable to markup. If ISO 8879 were more straight-forwardly written, the first sentence in Clause 7.6.1 could be take out of context and at face value, but the interpretation of most other parts of the standard requires substantial reading in context, and an exception cannot be justified in this one case. The argument in favor of the current behavior of the Exoterica SGML parser with respect to RSes is not as strong as that for REs. It is more difficult to make a categorical statement as to which way to go with RSes. However, the balance seems to be on the side of treating RSes in mixed content as parsed character data, matching them to #PCDATA content tokens, and reporting any errors that may result. Because the Exoterica products "normalize" text prior to its being submitted to the SGML parser (all record boundaries are converted include both RE and RS) any difficulties caused by the type of content model recommended against in the notes in Clauses 7.6.1 and 11.2.4 of ISO 8879 have been primarily caused by out-of-place REs. The difficulty is only ever ascribable to RSes when explicit &#RS; character references are used. 5 CDATA and RCDATA The arguments that apply to mixed content apply equally well, one way or the other, to the content of CDATA (character data) and RCDATA (replaceable character data) elements, and the content of such elements should be processed in the same manner as that of mixed content elements. The difficulties encountered in mixed content do not occur in these cases, however, as they cannot contain any markup other than entity and character references (in RCDATA). 6 Character References Clause 9.5 states that the replacement character for a character reference is treated as though it were entered directly. Therefore, &#RE; and &#RS; must be ignored in exactly the same manner (in particular, at the same phase of processing) as if the corresponding characters were entered in the document. In particular, they must be discarded if RE or RS would be discarded in the same context. 7 Conclusion Implementors of computer software based on national or international standards must implement the standard as written, where at all possible. ISO 8879 fails to be clear and precise in its description of record boundary processing, but it does seem to have a distinct intent, as follows: - The term "ignored" in Clause 7.6.1 means ignored (or discarded) with respect to ESIS information only. The parsing process cannot ignore the presence of an RE or RS. - An RE or RS in the content of a mixed content element (i.e. one that allows #PCDATA anywhere) is treated as is any other parsed data content for the purpose of SGML parsing, even though it is discarded for ESIS purposes. There are two classes of difficulties associated with record boundary processing in SGML: an unnecessarily difficult to read standard, and different text line delimiting conventions in different computer systems. These difficulties can be reduced in the following ways: - The text of ISO 8879 should be revised so as to be clear and precise as to the required processing of record boundaries. The current WG 8 project to revise ISO 8879 provides an opportunity to do this, although it will probably take two to four years to complete the revision cycle. SGML system implementors, SGML system users and other standards incorporating SGML have to decide what to do in the meantime. - The CALS standards describing how documents are transmitted between computer sytems, particularly the latest members of the MIL-M-28001 and MIL-M-1840 series, should specify how text records are to be represented in explicit terms. The best approach would be to specify that text records are explicitly bounded by record-start and record-end characters when transmitted, no matter what the conventions of the transmitting and receiving systems are (i.e. placing characters for both RE and RS between text records), although any specification in this regard would be better than none. Requiring both REs and RSes at text record boundaries would have the additional advantage that an RE character preceding each RS would simplify the issues involved in discussing record boundary handling. Using the CALS standards as a vehicle to resolve record boundary handling difficulties is practical, because the CALS standard revision cycle is usually considerably shorter than the ISO cycle. --------- -- Eric R. Skinner ers@xgml.com Exoterica Corporation Tel +1 613 722 1700 Ottawa, Canada Fax +1 613 722 5706 ----------------------------------------------------------------------- Newsgroups: comp.text.sgml From: Erik Naggum <SGML@ifi.uio.no> Message-ID: <19930314.002@erik.naggum.no> Date: 14 Mar 1993 02:42:37 UT Editing-Time: 115 min References: <1993Mar11.010814.6386@osf.org> <1993Mar12.223624.26543@xgml.com> Subject: Re: CDATA content and end-tags as data Lines: 152 Some additional comments to a generally sound reply from Eric R. Skinner. [John Bowe] : | What are the rules for including tags in CDATA and RCDATA content? Briefly stated, you cannot include a valid end-tag in character data and replaceable character data content because doing so would terminate the content (prematurely). | I could not find an explicit statement about this in The SGML Handbook. See B.8.3 Unparsable Sections [50:15], which discusses marked sections with the "CDATA" or "RCDATA" status keyword. | What I did find was something about the character data ending when the | parser finds an ETAGO followed by a valid name (does that mean *any* | valid name?). This is actually a complication of the rules, which does not include the "valid name" part, so your parenthetical question is not relevant. | I also found something about the char data ending when the *matching* | end tag is found. B.13.1.1 [59:1] discusses this: "Only the correct end-tag (or that of an element in which this element is nested) will be recognized." However generally useful, this is in conflict with the requirements in 7.6. ISO 8879, clause 7.6, paragraph 2: The |content| of an element declared to be |character data| or |replaceable character data| is terminated only by an ETAGO delimiter-in-context (which need not open a valid |end-tag|) or a valid NET. Such termination is an error if it would have been an error had the |content| been |mixed content|. [321:6-11] [Eric R. Skinner] : | In other words, anything that starts to look like an end tag (ie. an | ETAGO followed by a name start character) will cause the end of the | CDATA element. If the end tag is invalid an error will also be | generated. An ETAGO delimiter-in-context is not only an ETAGO followed by a name start character, though. If SHORTTAG YES is specified, a TAGC satisfies the contextual constraints (Clause 9.6.2), and if CONCUR YES is specified in the SGML declaration, a GRPO is enough. In the reference concrete syntax, this means that also "</>" and "</(" will terminate such an element, with the respective features used. (Note that there is no contextual constraint on the GRPO, which it seems that there should have been.) | Partly, this is to allow the end tag of an enclosing element to end the | CDATA element; Right, but note that this is true even if the end-tag for the element is not declared minimizable (something that tend to confuse people). | it's also necessary owing to SGML's token lookahead restrictions. Hmmm. [John Bowe] : | So, besides doing using (#PCDATA) as the content and | | <![ CDATA [ <p>This is an example paragraph.</p> ]]> | | how does one enter markup itself (ie. end tags) as data? Should I be | able to put anything I want (besides the obvious end-tag) as the | content of a CDATA element? I would avoid (replaceable) character data content for examples of SGML markup, though, and instead use it for content within which I do not want any other elements or markup recognition. Such as a "trademark" or "abbreviation" element within which "&" is not recognized: <tm>AT&T</>. | What are your favorite ways for doing this? [Eric R. Skinner] : | The only two characters that cause problems are < and &. In CDATA content, "<" and "&" do not cause problems, but the sequence "</" might. In RCDATA content, "&" might additionally cause problems. | You can "protect" these characters in a number of ways. | | 1. The clumsy entity reference expansion way, ie. AT&T | 2. With a null declaration, ie. AT&<!>T. This is better in | that no change to the DTD is required. | 3. Using a short reference string, to allow this: AT\&T. | To do this you must allow "\&" as a short reference delimiter, | then map "\&" to "&" in the necessary contexts. Tricky on the | implementation side but the hands-down winner for clarity of | markup. Hmmm, if no changes to the DTD is beneficial, adding a short reference string and declaring it in all short reference maps will be a major set of modifications. I think I'd go for the "clumsy entity reference expansion way" over this, but that may also be because I consider the backslash an annoyance, and because it might lead people to think that a backslash is somehow an escape character with general applicability. Getting the short reference maps right may also be tricky, and they do not apply in RCDATA content or in attribute values. But see below. Here are my favorite ways, in no particular order: o Always use a space or other punctuation before and after a literal "&" that could be misinterpreted. o Define CDATA entities for trademarks, so that if you later need to make a list of them, they're in one place. You can also define a "trademark" element, and change all the trademark entities at once. o Use a multicode syntax and use an MSSCHAR (markup-scan-suppress character) before literal characters. (If the backslash is not used for data, declaring it as a function character would give you the benefit of the escape character without the implementation costs of more short reference strings. Note that both schemes require changes to the concrete syntax. The standard does not say whether two MSSCHARs following each other will result in the latter being treated as a data character, but I have implemented it that way.) o Use CDATA marked sections (not a good idea for small parts, though). o Use entity references or numeric character entity references. A good DTD already includes an entity declaration for these characters. "amp" and "lt" are the usual entity names. | Of course, if you are writing a document with many examples of | SGML, you should consider using the SGML declaration to change | various delimiters to non-reference values, to allow you to | type SGML examples as you please. Hmmm. I consider this not so sound advice. A CDATA marked section is much simpler to deal with than changing the delimiters, but it is of course possible to change them. I have only changed the delimiters (to control characters) for processing mail and news articles, where this and extensive use of short references made it unnecessary to modify the input files in any way. Needless to say, "<" and "&" cannot be "magic" in existing material that was not intended for SGML processing. Best regards, </Erik> -- Erik Naggum ISO 8879 SGML +47 2295 0313 Oslo, Norway ISO 10744 HyTime <erik@naggum.no> ISO 9899 C Memento, terrigena <SGML@ifi.uio.no> ISO 10646 UCS Memento, vita brevis ----------------------------------------------------------------------- Newsgroups: comp.text.sgml From: Erik Naggum <SGML@ifi.uio.no> Message-ID: <19930201.007@erik.naggum.no> Date: 01 Feb 1993 07:22:18 UT Subject: 7.6.1 Record Boundaries revisited Lines: 345 A request for interpretation of ISO 8879 7.6.1 (Record Boundaries) has been submitted to ISO/IEC JTC 1/SC 18/WG 8 from the CALS Industry Standards Groups for the revision of SGML. I thought this might be interesting for readers of this newsgroup, and have also included my comments at the end. The text of the letter reads as follows (lines having a | in column 1 have been very slightly edited, but no content has been changed): 1992 September 15 | To: ISO/IEC JTC 1/SC 18/WG 8 During the evaluation of an SGML document (including both DTD and document instance) submitted to the Air Force CALS Test Bed, an error was reported during the parsing operation. The DTD was parsed without error using the Exoterica, Datalogics and SoftQuad parsers. When the SGML document instance was parsed, all three parsers reported errors on the same lines. During the detailed evaluation of the DTD and the document instance by the Air Force CALS Test Bed and Exoterica, the source of the error was found to be invalid placement of RE and RS characters in the document instance. The document instance had the end of one tag on one line and the start of another on the next line. The construction of the DTD did not permit this as the end of line character was seen as data. Discussions with the company submitting the file indicated an interpretation problem is ISO 8879, clause 7.6.1. A discussion by interested people was held at TechDoc in San Francisco on 27 August 1992. A general consensus of the meaning of the clause was made but the feeling was expressed that a written clarification is necessary to ISO 8879, clause 7.6.1. Two other recommendations were made as well. The data in question involved a file in which all desired characters, including record-start and record-end characters, were transmitted as variable-length records. The file resulting from the dechunking used at CTN equated each chuck with an input line, and added the character pair ASCII 13 ASCII 10 which the CALS 28001A SGML declaration identifies as RE and RS characters. Neither of those characters were contained in the document instance before chunking. Furthermore, the chunk-to-line mapping convention was not stated anywhere in 1840A, no in the ANSI tape standard it referenced for variable-length records, nor in 28001A. The three recommendations of the TechDoc discussion group are: 1. Clarify the wording of ISO 8879, clause 7.6.1. When ISO 8879 is revised, consider modifying its intent, as discussed below. (ISO 8879 Action) 2. Since 1840B recognizes the difficulties introduced by variable- length chunking and eliminates such chunking, it should include a comment on the use of RS and RE characters. In particular, it should clearly state that the system preparing document instances for transmission is responsible for including all desired RS and RE characters, and that the receiving sysem should not add such characters as the result of processing tape records. (CALS Digital Standards Office (CDSO) Action) 3. The DTD should be modified to avoid mixed content models where data cannot occur anywhere in the content of the relevant elements. The issue needing clarification in Clause 7.6.1 of ISO 8879 is whether an SGML parser must determine that an RE is data before deciding to ignore it under the conditions of this cluase. While the consensus of the meeting was that this is indeed the case, there was concern that the wording of the Standard is difficult to interpret. Furthermore, this interpretation leads to unfortunate cases in which a parser must report an error due to the presence of an RE which, if allowed, would be ignored. When ISO 8879 is revised, the group recommends that this possibility be eliminated. The problems arise in mixed content models where data cannot occur everywhere (e.g., if the model is not repeatable or uses a connector | other than OR). For example, consider the declarations: <!ELEMENT a - - (b , #PCDATA)> <!ELEMENT c - - (d | #PCDATA)> and the document instance segments <a> <b>contents</b> </a> and <c> <d>contents</d> | </c> Assume that each of these lines begins with an RS and ends with an RE. In the first case, data is not permitted immediately following the <A> start-tag. Thus, the RE after that tag is invalid, even though it would be ignored if data were permitted in that context. In the second case, data is allowed after the <C> start-tag. However, the presence of data means that the #PCDATA branch of the OR-group has been selected, and that the C element may not therefore also contain a D. Thus, the RE after the <C> start-tag indicates that the #PCDATA branch has been selected. This RE is ignored because it is the first one in the element and is not preceded by an RS, data, or proper subelement. The <D> start-tag is invalid, because the C cannot contain a D. Although the situation did not arise in the partiuclar document that triggered this discussion, a question was raised about RS and RE characters in CDATA and RCDATA elements. Should the first RE in such an element be ignored if no RS, data, or proper subelement preceded it? Should the last RE in such an element be ignored if no data or proper subelement follows it? The TechDoc meeting concluded that these REs should be ignored according to the above mentioned interpretation of 7.6.1; that is, there should be no difference for CDATA and RCDATA content versus other kinds of content with respect to 7.6.1. George Elwood Senior Systems Engineer Air Force CALS Test Bed For reference, the clause in question: 7.6.1 Record Boundaries If an RS in |content| is not interpreted as markup, it is ignored. Within |content|, an RE remaining after replacement of all references and recognition of markup is treated as data unless its presence can be attributed solely to markup. That is: a) The first RE in an element is ignored if no RS, data, or proper subelement preceded it. b) The last RE in an element is ignored if no data or proper subelement follows it. c) An RE that does not immediately follow an RS or RE is ignored if no data or proper subelement intervened. In applying these rules to an element, subelement content is ignored; that is, a proper or included subelement is treated as an atom that ends in the same record in which it begins. An RE is deemed to occur immediately prior to the first data or proper subelement that follows it (that is, after any intervening markup declaration, processing instruction, or included subelement). (Notes have been elided. Please see the [Goldfarb], p 321ff for details and Goldfarb's annotations.) I have several comments to the recommendations. First off, recommendation number 2 is in conflict with what I interpret to be the function of the entity manager, namely to ensure that the parser sees record start and end codes at the start and end, respectively, of a "record". When the input file is organized in variable-length records, an entity manager can legitimately (indeed, must) insert RS and RE codes to inform the parser of the record structure of the file. If RS and RE codes are not intended, a different record structure must be used. Therefore, the problem lies with the "chunking" process, not with the "dechunking" process. Also, if there are characters 10 and 13 in the source file prior to "dechunking", these must be treated as non-SGML characters, and should elicit an error. However, if the "chunking" and "dechunking" are considered to be a function of a "transport layer" (e.g., rolling data onto and off of tapes), there is a conflict between the specifications of the chunking and the dechunking as those can be inferred from the above description. In the presence of such an error, however, the problem reported has no impact on or from clause 7.6.1. Recommendation number 1 is much easier to deal with. It is actually two recommendations, 1a: "Clarify the wording ..." and 1b: "... consider modifying its intent ...". 1a can be accomplished by the rewriting that I did to understand this (I fully agree that the clause is difficult to interpret): Short reference delimiter strings are resolved. (RS and RE remain after resolution of such references (see [297:20]).) An RS is never sent to the application (but is not ignored by the parser). In element content, an RE is treated as an |s| and is not sent to the application. In mixed content, the following rules apply: a) If the first RS, RE, data, or proper subelement in an element is an RE, it is not sent to the application. b) If the last RE, data, or proper subelement in an element is an RE, it is not sent to the application. c) If a record is not empty, and the first RE, data, or proper subelement following the RS is an RE, it is not sent to the application. In elements with |replaceable character data| or |character data|, only items a) and b) apply. RS and RE in a subelement are part of that subelement's content. I have made several interpretations of the text, and the most important is to define "ignore" as "not send to the application". The text as it stands makes it difficult to understand how an RS can affect the interpretation of a following RE if it is truly "ignored". The next most important rewriting is to "tokenize" the input file into function characters, data characters, and interpreted markup constructs. I.e., I read a "token", and see if that "token" is an RE function character. "Tokens" can be markup declarations, processing instructions, included subelements, and proper subelements, in addition to individual function characters and lastly, data characters. The subelements are completely parsed before they are returned as "tokens". I can read the above and understand it without effort, but I still get confused if I try to read the original text; too many negations for my internal stack. I may be biased because I wrote it :-), but I think my rewriting helps explain things. Before we consider how we should "modify the intent" of the above clause, let's see what the intent is, as things are today: 1) A record end following a start-tag is ignored. 2) A record end preceding an end-tag is ignored. 3) If we use omitted end-tag minimization, a record which contains only a start-tag is ignored. (In mixed content otherwise, the end-tag and following start-tag would need to be juxtaposed in the same record to cause the RE to be ignored. In element content, an RE between an end-tag and a start-tag would be recognized as an |s| delimiter, and ignored.) 4) A record that contains only markup (not start- or end-tags, but markup declarations and processing instructions) is ignored. 5) A record that contains only included subelement(s) is ignored. 6) The RE after an empty record is _not_ ignored. (Item 6 is perhaps controversial. It is not clear to me whether item c in the list above takes precedence over item b if the empty record is the last record before the end-tag. ARC SGML sends this RE to the application.) It appears to me that the intent of this clause is that a record which contains only markup (tags included) is itself considered markup, and both RS and RE are to be ignored, but that the requirements in the standard tend to exclude certain REs from this general intent, and the order in which markup and data are recognized makes the whole thing overly complicated. The TechDoc discussion group recommends that the possibility be eliminated that an RE that would be ignored if it were allowed as data causes an error. I'm not thrilled with this desired modification, and instead propose that the intent to ignore records containing markup be straightened out. My suggested wording is as follows: || 7.6.1 Record Boundaries || || The RS and RE of a non-empty record that does not contain data are || treated as |s| separators. || || If the record contains data or subelements, an RE that immediately || follows a start-tag is ignored, and the RE (and RS) that immediately || precede an end-tag are ignored. || || An RS or RE function character ignored by the above requirements is not || subjected to further markup recognition. || || NOTE: I.e., short reference delimiter recognition occurs after record || boundary processing. This has the impact that the difference between an included subelement and a proper subelement is reduced. (The difference has caused much confusion previously, and Goldfarb even warns against it: "This implication of designating an elment to either be an inclusion or a proper subelement should be kept in mind when designing document types. It can have a significant impact on the likelihood that users will create record boundary errors." Amen.) The full impact of the proposed change has yet to be assessed (the number of combinations is large), but preliminary work shows that, except for records containing only included subelements, no RE that is ignored under the current requirements are treated as data under the proposed new requirements, although a number of RE's that are treated as data under the current requirements are ignored under the new requirements, as well as not causing short reference delimiter recognition (which today has a number of unwanted side effects). The example in the second note in 7.6.1 would cause the RE to never be ignored, as if the subelement were always proper (explicit RS and RE): &#RS;data<outer><sub>&#RE; &#RS;data</sub>&#RE; &#RS;data</outer>&#RE; The note states that the first RE in "outer" is that following the "sub" element and that it is data if "sub" is a proper subelement, but ignored if "sub" is an included subelement. I think this is really artificial and counter-intuitive. Under my proposed new requirements, only the first RE would be ignored. Although the above proposal simplifies things greatly, the TechDoc discussion group's recommendation number 3 is still advice. The question whether record boundary processing applies to CDATA and RCDATA elements should be answered by pointing out that 7.6.1 makes no claims about the type of the content of the element to which it applies. Indeed, that is part of the problem of reading this paragraph. Finally, how do the proposed new requirements fit the bill from the CALS Industry Standards Groups? To review, given the declarations <!ELEMENT A (B , #PCDATA)> <!ELEMENT C (D | #PCDATA)> the fragment <A> <B>contents</B> </A> will be parsed as <A><B>contents</B></A> and the fragment <C> <D>contents</D> </C> will be parsed as <C><D>contents</D></C> which seems to do the job. To summarize the effect: the record boundaries of the record containing <A> (and <C>) are ignored because the record contains no data. The RE and RS before </A> (and </C>) are ignored because the record contains data and occur immediately before an end-tag. Comments invited, especially from all those who have complained about the current requirements. Best regards, </Erik> -- Erik Naggum ISO 8879 SGML +47 2295 0313 Oslo, Norway ISO 10744 HyTime <erik@naggum.no> ISO 9899 C Memento, terrigena <SGML@ifi.uio.no> ISO 10646 UCS Memento, vita brevis ----------------------------------------------------------------------- Newsgroups: comp.text.sgml Subject: DTD DTD Summary Message-ID: <GNAT.93Jul21122937@kauri.kauri.vuw.ac.nz> From: gnat@kauri.vuw.ac.nz (Nathan Torkington) Date: 21 Jul 1993 12:29:37 UT Organization: CSC, Victoria University Of Wellington, New Zealand Lines: 159 Thanks to everyone for their responses. These came under the headings of: -- ``ooh ooh, me too!'' :-) -- ``there's probably one in SGMLS somewhere'' :-) and -- definite assistance. In the latter category comes this, the definitive answer from Donald Gignac. | This in response to your posted request. Appended is some work I did | along these lines earlier with respect to publishing the various CALS | DTDs in the governing DoD specifications. The SGML is CALS SGML | (MIL-M-28001B). You can tag a DTD three ways. | Enter the whole thing as CDATA. | Devise some rediculous SGML Declaration to redefine all your SGML | markup. | Do something like the following and try to format it for publishing | somehow. We planned on the CALS OS and FOSI approach. | Also see appendix B "Writing a book on SGML using SGML" in Eric van | Herwijnen's "Practical SGML". | | Use the following at your own risk. I have no idea if it's feasible. | | [then, when I asked if I could share his stuff] | | I do not feel the material I send you yesterday should be available | from one of the official SGML archive sites since it has not been | thoroughly tested. On the other hand, it could be useful to people | working in this area. You are free to distribute this material as you | see fit provided the following statement is included: | | | | The following SGML declarations and data dictionary were developed by | | Donald Gignac "gignac@oasys.dt.navy.mil" (301) 227-3348 | Advanced Information Systems Branch (Code 183) | David Taylor Model Basin | Headquarters, Carderock Division | Naval Surface Warfare Center | Bethesda, Maryland USA 20084-5000 | | Any information regarding the use of this material, mistakes or | inadequacies found therein, suggestions for improvement, etc. would | be greatly appreciated. | | This material is based on similar material provided to the US Navy by | the Datalogics Corporation. The United States Government provides no | guarantees regarding its completeness, correctness, or usefulness. The | United States Government is not responsible for any losses whatsoever | resulting from the use of this material. | | | <!-- MARY: The "yesorno" ENTITY declaration below was added for parsing | purposes. Remove it when the "dtd" declarations are inserted in the DTDs. --> | <!ENTITY % yesorno "NUMBER"> | | | <!-- START OF DECLARATIONS FOR MARKING UP A DTD --> | | <!-- BOILERPLATE ENTITY DECLARATIONS --> | | <!ENTITY docname "User supplies DTD's document (root element) name here."> | | <!ENTITY dtdident "User supplies DTD identifier from DTD's formal public | identifier here."> | | <!ENTITY DOCTYPE 'The following set of declarations may be referred | to using a public entity as follows: | | <!DOCTYPE &docname; PUB "&dtdident;">' > | | <!ENTITY NOTE1 'NOTE: In order to parse the following Document Type | Declaration Subset alone, append the Document Type Declaration below | to the beginning of the file: | | <!DOCTYPE &docname; [ | | and the associated "]>" to the end of the file.'> | | <!-- WARNING: The "etago" entity (or the equivalent "</") must be | used to provide the end tag open "</" in the RCDATA declared content | of "comment". If "</" is entered from the keyboard, the parser considers | it to be a delimiter and prematurely terminates the "comment" markup. | --> | <!ENTITY etago "</"> | | <!-- ELEMENT AND ATTLIST DECLARATIONS --> | | <!ELEMENT dtd - - (entset?, elemset?) +(comment)> | <!ATTLIST dtd docname CDATA #REQUIRED -- use &docname; for tag value -- | dtdident CDATA #REQUIRED -- use &dtdident; on tag -- > | | <!-- MARY: Are the above "docname" and "dtdident" attributes redundant? | Even if they are, shouldn't we have still them? --> | | <!ELEMENT comment - - RCDATA> | <!ATTLIST comment type (declaration | embedded) "declaration"> | | <!-- ELEMENT AND ATTRIBUTE DECLARATIONS FOR THE ENTITY SET --> | | <!ELEMENT entset - - (entdec | pentref | marksectdec)* > | | <!ELEMENT entdec - - (paramlit | datatext | bracktext | extentspec)> | <!ATTLIST entdec entname CDATA #REQUIRED> | | <!ELEMENT (paramlit | datatext | bracktext) - - CDATA> | | <!ATTLIST datatext type (CDATA | SDATA | PI) #REQUIRED> | | <!ATTLIST bracktext type (STARTTAG | ENDTAG | MS | MD) #REQUIRED> | | <!ELEMENT extentspec - - (enttype?)> | <!ATTLIST extentspec type (PUBLIC | SYSTEM) "PUBLIC" | entident CDATA #REQUIRED> | | <!ELEMENT enttype - - (datattspec?)> | <!ATTLIST enttype notname CDATA #REQUIRED | type (CDATA | NDATA | SDATA) #REQUIRED> | | <!ELEMENT datattspec - - (attname, valspec)*> | | <!ELEMENT valspec - - CDATA> | | <!-- ELEMENT AND ATTRIBUTE DECLARATIONS FOR THE DECLARATION SEPARATORS | (PARAMETER ENTITY REFERENCES AND MARKED SECTIONS) --> | | <!ELEMENT pentref - - CDATA> | | <!ELEMENT marksectdec - - (entdec | elemdec | attlistdec | pentref | | marksectdec)* > | <!ATTLIST marksectdec statkeyw CDATA #REQUIRED> | | <!-- ELEMENT AND ATTRIBUTE DECLARATIONS FOR THE ELEMENT SET --> | | <!ELEMENT elemset - - ((elemdec, attlistdec?) | (notdec, attlistdec?) | | pentref | marksectdec)*> | | <!ELEMENT elemdec - o (modgrp, excepts?)> | | <!-- WARNING: When a tag corresponding to the "elemdec" element has a | content model (i.e., the "declcont" attribute is not specified), then | the CDATA contents of the "modgrp", "exclusions" (if present), and | "inclusions" (if present) tags MUST be terminated by their end tags. When | the "declcont" attribute is specified on an "elemdec" tag, then an | "elemdec" end tag MUST NOT be present NOR shall that "elemdec" tag | have content. --> | | <!ATTLIST elemdec elemname CDATA #REQUIRED | startmin %yesorno; "0" | endmin %yesorno; "1" | declcont (CDATA | RCDATA | EMPTY) #CONREF> I hope this is of some use, Cheers; Nat ----------------------------------------------------------------------- Newsgroups: comp.text.sgml Date: 19 Jun 1994 02:44:01 UT From: Erik Naggum <erik@naggum.no> Organization: Naggum Software; +47 2295 0313 Message-ID: <19940619.3045@naggum.no> References: <1994Jun17.174907.14944@chemabs.uucp> Subject: Re: SGML DTD help needed for binary data. [Wayne Mills] | I'm working on writing a DTD to describe a data file. The fly in the | ointment is that there is some binary data in the mix. I have a tag | called GDSMM that will mark this data for me. binary data does not in general mix well with text. the cleanest solution is to use an external entity, which you may also give a NOTATION and notation attributes. another possibility is to accept the suggestion from SGMLS, and make all characters into (numeric) character references. | This is based on info I found in the book __Practical SGML__ by | Herwijnen. I really hope the book does not suggest using CDATA for binary data! CDATA and RCDATA should in fact not be used at all. they run counter to several SGML principles, the most important breach of which is that elements that have declared content (CDATA and RCDATA) are locked to the character set used in the document, since there is no way to turn them into entity references when (not if) needed. thus, the document is system-dependent despite every intention to make documents less so. the effect of CDATA and RCDATA can be obtained with marked section, which is under user control, where such decisions belong. this doesn't help you with binary data. | <GDSMM>^0^^&J6Yy</GDSMM> this does not appear to be consistent with the error messages you got, so I assume you meant ^O (not ^0) as control-O and ^^ as control-^. under this assumption, a valid instance could look like this, and you can use #PCDATA for the element. <GDSMM>&J6Yy</GDSMM> note that you obtain 200-400% overhead in this representation, but it's the only safe way to do it. binary data is not text; whereas the characters in the text may undergo translation, the binary data should remain the same. using this notation, they will. unless you have a lot of small snippets of binary data like this, the clean solution is an entity, with the option to inline the data as above: <!NOTATION BINARY-CGM PUBLIC "whatever"> <!ENTITY GDSMM001 SYSTEM "whatever" NDATA BINARY-CGM> <!ELEMENT GDSMM (#PCDATA)> <!ATTLIST GDSMM external ENTITY #CONREF> an instance can then look like this: <GDSMM external=GDSMM001> or <GDSMM>&GDSMM001</GDSMM> or <GDSMM>&J6Yy</GDSMM> depending on your needs. I favor the attribute solution, as there is no need to have the parser scan over the data. your application software will have to talk to the entity manager to obtain the data. | So, can anyone tell me the proper DTD approach and syntax for | specifying that a certain GI tags data which is binary and should be | "ignored" by sgmls? the short answer is that you can't make an SGML parser "ignore" data if it is also parsed. (the exception is NDATA entities, where the parser only looks for the entity end, but then, what's the point in having it go through the parser in the first place?) hope this helps. best regards, </Erik> -- Erik Naggum <erik@naggum.no> <SGML@ifi.uio.no> | memento, terrigena ISO 8652 Ada/ISO 8879 SGML/ISO 9899 C/ISO 10646 UCS | memento, vita brevis ftp://ftp.ifi.uio.no/pub/SGML wais://ftp.ifi.uio.no/comp.text.sgml ----------------------------------------------------------------------- Newsgroups: comp.text.sgml From: Erik Naggum <SGML@ifi.uio.no> Message-ID: <19930623.009@erik.naggum.no> Date: 23 Jun 1993 01:26:24 UT References: <1993Jun23.093319.1796@gp.co.nz> Subject: Re: CDATA and RCDATA understand confused Lines: 42 [Stephen Dixon] : | I had thought that it was possible to have an element "x" whose | content was declared CDATA or RCDATA in which could occur start-tags | (and associated end-tags) that may or may not be in the DTD. The | parser would look at these and only recognise that end-tag which | started it. Nope. The parser only looks for the end-tag open delimiter-in-context, i.e., the string "</", followed by a name start character, a letter. This is covered by the perhaps confusing requirement: 7.6 Content : The [content] of an element declared to be [character data] or [replaceable character data] is terminated only by an {etago} delimiter-in-context (which need not open a valid [end-tag]) or a valid {net}. Such termination is an error if it would have been an error had the [content] been [mixed content]. | How can this be achieved? You can use a marked section, instead, such as <![ CDATA [ ........ ]]> where you run into the same problem with the string "]]>", but are otherwise relieved of the particular problems you have right now. A good idea is to refer to CDATA entities for examples of SGML markup. Best regards, </Erik> -- Erik Naggum <erik@naggum.no> <SGML@ifi.uio.no> ISO 8879 SGML Chairman, SGML SIGhyper <SGML.SIGhyper@ifi.uio.no> ISO 10744 HyTime "Memento, terrigena. Memento, vita brevis." ISO 10646 UCS ----------------------------------------------------------------------- Newsgroups: comp.text.sgml Date: 28 Apr 1994 23:13:53 UT From: Erik Naggum <erik@naggum.no> Organization: Naggum Software; +47 2295 0313 Message-ID: <19940429.1992@naggum.no> References: <1994Apr28.210539.23985@sics.se> Subject: Re: Normalizing SGML with sgmls? [Hakan Soderstrom] | Somebody happens to know about a back-end to sgmls that puts the output | together to SGML again? What I need is an inexpensive way of | normalizing SGML, i.e. expand any markup minimization features. too much information is lost from the original document for this to be generally possible or desirable. you need the SGML declaration and the prolog untouched, and the ESIS output that SGMLS provides does not support this "information set". further, you may not want to have all entities expanded, only markup minimization. translation from one feature set to another is very complicated task, and specialized tools must do the job. further, it is not generally possible due to the incredibly silly syntax-changing aspects of some features in combination with the destructive nature of the CDATA and RCDATA declared content. if you avoid using those, things become _much_ easier, so don't use them. <!SGML "ISO 8879:1986" ... CONCUR NO ... > <!DOCTYPE example [ <!ELEMENT example CDATA> ]> <example><(ignore)this>foobar</(ignore)this></example> now, try <!SGML "ISO 8879:1986" ... CONCUR YES ... > <!DOCTYPE example [ <!ELEMENT example CDATA> ]> <example><(ignore)this>foobar</(ignore)this></example> both <(ignore)this> and </(ignore)this> will _vanish_ because of the rules of the CONCUR feature. the problem is that there's no way you can keep them from vanishing, because no markup is recognized in CDATA. you have a different document, and you may not even know it! there's no warning from the parser, _nothing_ to let you know you suddenly lost vital data in this example. you have in effect prohibited this document from being used with CONCUR. maybe not a disaster, since CONCUR is a unimplemented and should remain so, but with LINK, where an example would be slightly pathological, a document instance can prohibit processing with LINK, either by producing syntax errors or by making data vanish as ignored markup. CDATA and RCDATA as declared content should not be used. there is no need for them, and they will harm your information, as well as confuse the user. going from some markup minimization features to none is simpler, but not generally possible if you use entities that hold minimization-dependent markup. this points to a weakness in the technique to use such entities, which is often the case with short reference maps. there's no way short of expanding such entities. if you don't want the space savings and document organization in the entity structure to get lost in the process, you need selective entity replacement. contrary to what one might think, an SGML document is bound very tightly to its SGML declaration and feature set, including minimization, but the SGML declaration describes more than the document instance, and encroaches on the user's ability to process the document. editing an SGML declaration to remove or add features is not a task for the faint of heart. the obvious conclusion is that the SGML declaration should never be changed, at least not without being willing to pay the penalty of checking that the document is indeed the same before and after. this is _most_ unfortunate. in other words -- you don't _want_ to expand markup minimization features with a quick and dirty solution. you may not want to do it at all. those programs that purport to produce "normalized SGML" may destroy some of your information without letting you know about it; maybe not when you "normalize" it, but when you read it back in with another parser. some may even produce "normalized SGML" that isn't SGML at all (MARK-IT comes to mind). best regards, </erik> -- Erik Naggum <erik@naggum.no> <SGML@ifi.uio.no> | memento, terrigena. ISO 8879 SGML, ISO 10744 HyTime, ISO 10646 UCS | memento, vita brevis. ---------------------------------------------------------------------------- From: hajagos@frame.com (Lani Hajagos) Newsgroups: comp.text.sgml Subject: More FrameBuilder and SGML Date: 26 Jan 1993 01:44:23 UT Organization: UTexas Mail-to-News Gateway Lines: 60 Sender: daemon@cs.utexas.edu Message-ID: <9301260142.AA02622@lani.corp.frame.com.frame> In article 1879@gmd.de, thomas@gmd.de asks "Are you claiming that there is a one-to-one mapping between FrameBuilder documents and SGML documents? Or only that some (or even most) SGML documents can be mapped into FrameBuilder, and vice versa? I don't believe that any SGML product on the market currently supports 100% of of ISO 8879, so I certainly won't make the claim for FrameBuilder. FrameBuilder represents the SGML entity structure of a document without using SGML syntax. There are several SGML constructs which have no direct correspondence in FrameBuilder: - I mentioned attributes in my previous message. These can be mapped into several different FrameBuilder structures, depending on their intended usage. There is no concept of current or content reference attributes, however. - FrameBuilder has no direct analog to the SGML entity structure. FrameBuilder's Book handles documents divided into multiple files, although each document in the Book must be a complete element, and books cannot be nested. - FrameBuilder does not support SGML's optional LINK and CONCUR features. SGML documents using syntax-specific SGML constructs such as markup minimization or CDATA and RCDATA marked sections can be imported into or exported out of FrameBuilder fairly easily. These constructs are irrelvant to the FrameBuilder model, however, and have no counterpart within FrameBuilder. FrameBuilder does have a construct analogous to INCLUDE, IGNORE, and TEMP marked sections. This construct, called Conditional Text, however, does not support spanning partial elements. Since FrameBuilder is an application for formatting structured documents as well as an editor for manipulating them, the question of whether FrameBuilder supports all possible SGML structures is less meaningful than whether it supports them in the intended way. Because the conceivable applications are unlimited, SGML was deliberately defined so that no one system could do so. For example, while FrameBuilder recognizes graphics in many file formats, thereby supporting several data content notations, it certainly does not recognize every possible graphics format. As another example, tables can be represented in many ways in SGML. Of course the possible element structures can be represented as element structures within FrameBuilder. What is wanted, though, is a representation within the rich table facility that FrameBuilder inherits from FrameMaker. Very little programming is required to move the SGML tables we have seen at Frame to and from the FrameBuilder table mechanism. In its initial release, FrameBuilder internally supports elements that correspond to entire tables and allows table cells to contain hierarchies of elements. There are no elements for table rows or columns, however, and there is no content model defining the order in which particular types of cells are expected to occur. I hope this answers your questions. ----------------------------------------------------------------------- RCDATA This is a searchable index. Enter search keywords: Index comp.text.sgml contains the following 32 items relevant to 'RCDATA'. The first figure for each entry is its relative score, the second the number of lines in the item. 1000 78 183937.Naggum /local/ftp/pub/SGML/comp.text.sgml/19940619/ 929 101 155758.Naggum /local/ftp/pub/SGML/comp.text.sgml/19940401/ 786 65 193622.Megginson /local/ftp/pub/SGML/comp.text.sgml/19940619/ 643 49 213319.Dixon /local/ftp/pub/SGML/comp.text.sgml/19930622/ 572 449 183053.Skinner /local/ftp/pub/SGML/comp.text.sgml/19930206/ 572 160 024237.Naggum /local/ftp/pub/SGML/comp.text.sgml/19930314/ 500 351 072218.Naggum /local/ftp/pub/SGML/comp.text.sgml/19930201/ 500 166 122937.Torkington /local/ftp/pub/SGML/comp.text.sgml/19930721/ 500 84 024401.Naggum /local/ftp/pub/SGML/comp.text.sgml/19940619/ 429 333 051234.Connolly /local/ftp/pub/SGML/comp.text.sgml/19930116/ 429 49 012624.Naggum /local/ftp/pub/SGML/comp.text.sgml/19930623/ 429 85 231353.Naggum /local/ftp/pub/SGML/comp.text.sgml/19940428/ 357 663 085319.Popham /local/ftp/pub/SGML/comp.text.sgml/19920526/ 357 68 014423.Jahagos /local/ftp/pub/SGML/comp.text.sgml/19930126/ 357 83 190816.Brueni /local/ftp/pub/SGML/comp.text.sgml/19930221/ 357 56 010814.Bowe /local/ftp/pub/SGML/comp.text.sgml/19930311/ 357 74 223624.Skinner /local/ftp/pub/SGML/comp.text.sgml/19930312/ 357 95 233109.Skinner /local/ftp/pub/SGML/comp.text.sgml/19930314/ 357 93 204143.Naggum /local/ftp/pub/SGML/comp.text.sgml/19930317/ 357 54 182552.Kimber /local/ftp/pub/SGML/comp.text.sgml/19930326/ 357 332 215637.Kimber /local/ftp/pub/SGML/comp.text.sgml/19930524/ 357 213 234410.Kimber /local/ftp/pub/SGML/comp.text.sgml/19930926/ 357 37 102257.Thompson /local/ftp/pub/SGML/comp.text.sgml/19931007/ 357 152 184256.Kimber /local/ftp/pub/SGML/comp.text.sgml/19931007/ 357 374 050500.Suttor /local/ftp/pub/SGML/comp.text.sgml/19931201/ 357 25 021410.English /local/ftp/pub/SGML/comp.text.sgml/19940303/ 357 55 132229.Holman /local/ftp/pub/SGML/comp.text.sgml/19940310/ 357 143 152315.Rath /local/ftp/pub/SGML/comp.text.sgml/19940407/ 357 69 160004.Kimber /local/ftp/pub/SGML/comp.text.sgml/19940521/ 357 56 132300.Kimber /local/ftp/pub/SGML/comp.text.sgml/19940525/ 357 46 174907.Mills /local/ftp/pub/SGML/comp.text.sgml/19940617/ 357 34 100400.McArthur /local/ftp/pub/SGML/comp.text.sgml/19940619/ ----------------------------------------------------------------------- Newsgroups: comp.text.sgml From: bowe@acme.osf.org (John Bowe) Subject: CDATA content and end-tags as data Message-ID: <1993Mar11.010814.6386@osf.org> Summary: how does one includes endtags in CDATA content? Keywords: CDATA content, end-tag Sender: news@osf.org (USENET News System) Organization: Open Software Foundation, Cambridge, MA, USA Date: 11 Mar 1993 01:08:14 UT Lines: 46 What are the rules for including tags in CDATA and RCDATA content? For example, can I insert a chunk of example markup as the content of a CDATA element? I could not find an explicit statement about this in The SGML Handbook. What I did find was something about the character data ending when the parser finds an ETAGO followed by a valid name (does that mean *any* valid name?). I also found something about the char data ending when the *matching* end tag is found. sgmls (1.1) gives an error. Here is the DTD and instance: % cat -n /tmp/cdata 1 <!DOCTYPE doc [ 2 <!ELEMENT doc - o (data) > 3 <!ELEMENT data - - CDATA > 4 ]> 5 6 <doc> 7 <data> 8 <p>This is an example paragraph.</p> 9 </data> 10 </doc> Here's the error message: % sgmls /tmp/cdata sgmls: SGML error at cdata, line 8 at ">": No element declaration for P end-tag GI; end-tag ignored sgmls: SGML error at cdata, line 8 at ">": Bad end-tag in R/CDATA element; treated as short (no GI) end-tag sgmls: SGML error at cdata, line 9 at ">": DATA end-tag ignored: doesn't end any open element (current is DOC) (DOC (DATA - <p>This is an example paragraph. )DATA )DOC So, besides doing using (#PCDATA) as the content and <![ CDATA [ <p>This is an example paragraph.</p> ]]> how does one enter markup itself (ie. end tags) as data? Should I be able to put anything I want (besides the obvious end-tag) as the content of a CDATA element? What are your favorite ways for doing this? - john ----------------------------------------------------------------------- Newsgroups: comp.text.sgml From: ers@xgml.com (Eric R. Skinner) Subject: Re: CDATA content and end-tags as data Message-ID: <1993Mar12.223624.26543@xgml.com> Keywords: CDATA content, end-tag Organization: Exoterica Corporation References: <1993Mar11.010814.6386@osf.org> Date: 12 Mar 1993 22:36:24 UT Lines: 65 In article <1993Mar11.010814.6386@osf.org> bowe@acme.osf.org (John Bowe) writes: >What are the rules for including tags in CDATA and RCDATA content? For >example, can I insert a chunk of example markup as the content of a CDATA >element? I could not find an explicit statement about this in The SGML >Handbook. What I did find was something about the character data ending >when the parser finds an ETAGO followed by a valid name (does that mean >*any* valid name?). I also found something about the char data ending when >the *matching* end tag is found. Clause 9.6.1 (Recognition Modes) provides the first clue: "Note: Most delimiters will not be recognized when the content is character data or replaceable character data." But of course that doesn't answer the question. Clause 7.6, eerily close to this group's favorite 7.6.1, says: "The content of an element declared to be character data or replaceable character data is terminated only by an ETAGO delimiter-in-context (***which need not open a valid end-tag***) or a valid net. Such a termination is an error if it would have been an error had the content been mixed content." In other words, anything that starts to look like an end tag (ie. an ETAGO followed by a name start character) will cause the end of the CDATA element. If the end tag is invalid an error will also be generated. Partly, this is to allow the end tag of an enclosing element to end the CDATA element; it's also necessary owing to SGML's token lookahead restrictions. >So, besides doing using (#PCDATA) as the content and > > <![ CDATA [ <p>This is an example paragraph.</p> ]]> > >how does one enter markup itself (ie. end tags) as data? Should I be able >to put anything I want (besides the obvious end-tag) as the content of a >CDATA element? What are your favorite ways for doing this? The only two characters that cause problems are < and &. You can "protect" these characters in a number of ways. 1. The clumsy entity reference expansion way, ie. AT&T 2. With a null declaration, ie. AT&<!>T. This is better in that no change to the DTD is required. 3. Using a short reference string, to allow this: AT\&T. To do this you must allow "\&" as a short reference delimiter, then map "\&" to "&" in the necessary contexts. Tricky on the implementation side but the hands-down winner for clarity of markup. The same steps could be taken to protect the "<" character. Of course, if you are writing a document with many examples of SGML, you should consider using the SGML declaration to change various delimiters to non-reference values, to allow you to type SGML examples as you please. Cheers, -- Eric R. Skinner ers@xgml.com Exoterica Corporation Tel +1 613 722 1700 Ottawa, Canada Fax +1 613 722 5706 ----------------------------------------------------------------------- Newsgroups: comp.text.sgml From: ers@xgml.com (Eric R. Skinner) Subject: Re: CDATA content and end-tags as data Message-ID: <1993Mar14.233109.18643@xgml.com> Organization: Exoterica Corporation References: <1993Mar11.010814.6386@osf.org> <1993Mar12.223624.26543@xgml.com> <19930314.002@erik.naggum.no> Date: 14 Mar 1993 23:31:09 UT Lines: 87 In article <19930314.002@erik.naggum.no> Erik Naggum <SGML@ifi.uio.no> writes: >An ETAGO delimiter-in-context is not only an ETAGO followed by a name start >character, though. If SHORTTAG YES is specified, a TAGC satisfies the >contextual constraints (Clause 9.6.2), and if CONCUR YES is specified in >the SGML declaration, a GRPO is enough. In the reference concrete syntax, >this means that also "</>" and "</(" will terminate such an element, with >the respective features used. (Note that there is no contextual constraint >on the GRPO, which it seems that there should have been.) You're right... I only told a small part of the story. To make a complete list, we need to include a NET delimiter which is valid if a NET has been used to end the start tag. > >| Partly, this is to allow the end tag of an enclosing element to end the >| CDATA element; > >Right, but note that this is true even if the end-tag for the element is >not declared minimizable (something that tend to confuse people). Absolutely true. It forces the end of the CDATA then causes an error. Then on to favorite ways of protecting characters. I suggested the use of "\&" and "\<" as short references because in general those are the two characters that can cause trouble. Erik writes: > o Use a multicode syntax and use an MSSCHAR (markup-scan-suppress > character) before literal characters. (If the backslash is not used > for data, declaring it as a function character would give you the > benefit of the escape character without the implementation costs of > more short reference strings. Note that both schemes require changes > to the concrete syntax. There is a significant problem with using MSSCHAR (and the related MSOCHAR and MSICHAR, in that while an MSSCHAR "suppresses recognition of markup for the next character in the same entity" (paraphrased 9.7), nowhere does the standard say that the MSSCHAR itself is to be discarded. Hence, the MSSCHAR is treated as data in the context of the surrounding element (ie PCDATA or CDATA or RCDATA) and is passed to the application. It's a pain for the application to then rip out the MSSCHARs. I think it's certainly possible to define "\&" and "\<" as short references in all contexts; getting it right should not be too tricky. It works just like an MSSCHAR from the user's point of view, covers all necessary cases, and doesn't result in spurious characters being passed to the application. >Hmmm. I consider this not so sound advice. A CDATA marked section is much >simpler to deal with than changing the delimiters, but it is of course >possible to change them. I have only changed the delimiters (to control >characters) for processing mail and news articles, where this and extensive >use of short references made it unnecessary to modify the input files in >any way. Needless to say, "<" and "&" cannot be "magic" in existing >material that was not intended for SGML processing. Well, using CDATA marked sections is fine except that you now need to protect the "]]" character sequence if it is part of your data. There is no general solution short of changing the delimiters. If I were writing an SGML text I would seriously consider using alternate delimiters; short of implementation complexities (which should not be the user's problem) I don't see a serious argument against it. The SGML declaration which implements this in a comprehensive fashion is not terribly tricky -- anyone want a copy? Incidentally, I should point out that Exoterica has a document entitled "Understanding the SGML Declaration" available for free. It explains the syntax of the declaration in lots of detail with good examples. For your copy send mail to info@xgml.com with your mailing address. The complete library of free documents: ECM03 Understanding the SGML Declaration ECM11 Content Model Algebra ETR13 Record Boundary Processing in SGML Cheers, -- Eric R. Skinner ers@xgml.com Exoterica Corporation Tel +1 613 722 1700 Ottawa, Canada Fax +1 613 722 5706 ----------------------------------------------------------------------- Newsgroups: comp.text.sgml From: Erik Naggum <SGML@ifi.uio.no> Message-ID: <19930317.015@erik.naggum.no> Date: 17 Mar 1993 20:41:43 UT Supersedes: <19930317.014@erik.naggum.no> Editing-Time: 83 min Subject: SGML markup examples in SGML Lines: 85 In article <1993Mar11.010814.6386@osf.org>, John Bowe asked about rules for including markup as data in SGML documents, and about favorite ways of doing it. After thinking about this a little, I think the best answer is to go one step back, ask what was to be accomplished, and find alternate ways to accomplish it: Using CDATA and RCDATA elements or marked sections is clearly _one_ of the possible solutions, but not an entirely satisfactory one because of the many situations that Eric Skinner and I had to cover in our replies. One very simple answer is to use external entities, but if entities map to files, this can lead to a large number of files. IMNSHO, a good SGML system should allow more than one storage mechanism for entities. Another solution, although not necessarily as immediately visually gratifying as a solution which would let you see what you get (so to speak), I think it would make sense to use entity references for each delimiter role, named after the delimiter role. Consider the following entity declarations: <!ENTITY stago CDATA "<"> <!ENTITY etago CDATA "</"> <!ENTITY tagc CDATA ">"> Your example can now easily be put into an ordinary mixed content element: <example> &stago;p&tagc;This is an example paragraph.&etago;p&tagc; </example> Note that since all entities have data text as their entity text, there are no special cases to consider. For convenience, a full entity set is appended to this message. (It is validated, and has been tested on cute furry animals: no one died.) Best regards, </Erik> -- Erik Naggum ISO 8879 SGML +47 2295 0313 Oslo, Norway ISO 10744 HyTime <erik@naggum.no> ISO 9899 C Memento, terrigena <SGML@ifi.uio.no> ISO 10646 UCS Memento, vita brevis <![ -------------------------------------------------------------------------- Entity Set for Reference Delimiter Set. Created by Erik Naggum <SGML@ifi.uio.no>, 1993-03-17. PUBLIC "+//ISBN 82-7640-000//ENTITIES Reference Delimiter Set//EN" -------------------------------------------------------------------------- [ <!ENTITY and CDATA "&" > <!ENTITY com CDATA "--" > <!ENTITY cro CDATA "&#" > <!ENTITY dso CDATA "[" > <!ENTITY dsc CDATA "]" > <!ENTITY dtgo CDATA "[" > <!ENTITY dtgc CDATA "]" > <!ENTITY ero CDATA "&" > <!ENTITY etago CDATA "</" > <!ENTITY grpo CDATA "(" > <!ENTITY grpc CDATA ")" > <!ENTITY lit CDATA '"' > <!ENTITY lita CDATA "'" > <!ENTITY mdo CDATA "<!" > <!ENTITY mdc CDATA ">" > <!ENTITY minus CDATA "-" > <!ENTITY msc CDATA "]]" > <!ENTITY net CDATA "/" > <!ENTITY opt CDATA "?" > <!ENTITY or CDATA "|" > <!ENTITY pero CDATA "%" > <!ENTITY pio CDATA "<?" > <!ENTITY pic CDATA ">" > <!ENTITY plus CDATA "+" > <!ENTITY refc CDATA ";" > <!ENTITY rep CDATA "*" > <!ENTITY rni CDATA "#" > <!ENTITY seq CDATA "," > <!ENTITY stago CDATA "<" > <!ENTITY tagc CDATA ">" > <!ENTITY vi CDATA "=" > ]]> From: drmacro@ralvm13.VNET.IBM.COM Message-ID: <19930326.103751.421@almaden.ibm.com> Date: 26 Mar 1993 18:25:52 UT Newsgroups: comp.text.sgml Subject: Re: < and > ??? Disclaimer: This posting represents the poster's views, not those of IBM News-Software: UReply 3.1 References: <DSCHIEB.93Mar26100352@muse.cv.nrao.edu> <19930326.075059.448@almaden.ibm.com> Lines: 44 In <19930326.075059.448@almaden.ibm.com> drmacro@ralvm13.VNET.IBM.COM writes: > >Another approach might be to still use a notation for >C code, but encapsulate the code itself within a CDATA >marked section (assuming nothing in the code will ever >look like an end tag open delimiter), e.g.: > > <CodeExample notation=cplusplus> > <(. CDATA (. > This is c++ code<template> > .).)> > </CodeExample> > >Your application will get the data and be told the notation >and so it can provide the same sort of functions as >the NDATA entity method. > Wayne Wohler reminded me that the CodeExample element in this case must have a declared content of CDATA or RCDATA, so the CDATA marked section would be incorrect. The correct example would be: <CodeExample notation=cplusplus> This is c++ code<template> Hope "</" followed by name-start-character is not in the data </CodeExample> Where CodeExample is defined thusly: <!ELEMENT CodeExample - - CDATA > <!ELEMENT notation NOTATION #REQUIRED > And the notation 'cplusplus' would be defined: <!NOTATION Cplusplus SYSTEM "cpp2TeX.filter" -- Filter C++ to TeX --> Eliot Kimber Internet: drmacro@ralvm13.vnet.ibm.com Dept E14/B500 IBMMAIL: USIB2DK9@IBMMAIL Network Programs Information Development Phone: 1-919-254-5160 IBM Corporation Research Triangle Park, NC 27709 ----------------------------------------------------------------------- Newsgroups: comp.text.sgml Date: 26 Sep 1993 23:44:10 UT From: Eliot Kimber <drmacro@vnet.IBM.COM> Message-ID: <19930926.170123.489@almaden.ibm.com> Subject: DTD for DTDs A few weeks back, somebody asked if there was a DTD for DTDs. There was no response. Even though I was doing my best to take a day off from thinking about SGML, my brain woke me at 6:30 spinning with a desire to write such a DTD, so I did. It happens we can use this to provide some intelligent SGML processing for DTD management and documentation, so it is not time wasted. I tested the DTD with Author/Editor (which makes a nifty little DTD editor when you set up the obvious style sheet so it looks more or less like you're editing a real DTD), but I haven't scrubbed it really hard. The DTD is designed to allow the creation of arbitrary DTD fragments, not complete DTDs (e.g., it does not include a DOCTYPE declaration element. It is not just a literal translation of the declaration productions but tries to capture some of the semantics not expressed in the productions alone. I tried to account for all the places that comments are allowed, but I may have missed some or defined some ambiguous content models. I've used long element names, so you'll need an SGML declaration that defines long names, the OSF-BOOK or DOCBOOK declarations should work fine. <!--===============================================================--> <!-- --> <!-- DTD Fragment DTD Version 1.0 --> <!-- --> <!-- Describes components of SGML document type declarations (DTD) --> <!-- as defined in ISO 8879. --> <!-- --> <!-- Author: W. Eliot Kimber, drmacro@vnet.ibm.com --> <!-- --> <!-- Date: 26 September 1993 --> <!-- --> <!-- --> <!--===============================================================--> <!-- PublicDTD document type --> <?PUBLICID: +//ISBN 0-933186::IBM//DTD DTD Fragment V1.0//EN> <!ELEMENT DTDFragment O O ( ElementDecl | TextEntityDecl | InternalParmEntityDecl | ExternalParmEntityDecl | DataEntityDecl | NotationDecl | AttlistDecl | UseMapDecl | ShortRefDecl | CommentDecl | MarkedSection | ParmEntRef)* > <!ATTLIST DTDFragment ID ID #IMPLIED > <!--===============================================================--> <!-- Common Components --> <!--===============================================================--> <!ELEMENT CommentDecl - - (Comment+) > <!ELEMENT Comment - - (#PCDATA | GI | ParmEntName | GeneralEntName | Notation | MapName)* -- Restriction: Cannot contain double dash -- > <!ELEMENT GI - - (#PCDATA) -- Restriction: single name --> <!ELEMENT GIGroup - - (GI+) -- GI Group outside of content models --> <!ELEMENT Notation - - (#PCDATA) -- Restriction: single name --> <!ELEMENT AttName - - (#PCDATA) -- Restriction: single name --> <!ELEMENT AttValue - - (#PCDATA) -- Restriction: Cannot mixe LIT and LITA in content -- > <!ELEMENT AttSpecList - - (AttSpec* | ParmEntRef) > <!ELEMENT AttSpec - - (AttName?, AttValue) > <!ELEMENT (GeneralEntName | ParmEntName) - - (#PCDATA) -- Restriction: single name --> <!-- NOTE: "%" is not part of parameter entity names --> <!ELEMENT ParmEntRef - - (ParmEntName) > <!ELEMENT GeneralEntRef - - (GeneralEntName) > <!ELEMENT MinimumLit - - (#PCDATA) -- Restriction: S and RE characters normalized -- > <!ELEMENT NameGroup - - (Name | ParmEntRef)+ > <!ELEMENT Name - - (#PCDATA) -- Restriction: single SGML Name --> <!ELEMENT NotationGroup - - (Notation | ParmEntRef)+ > <!--===============================================================--> <!-- Element Declaration Components --> <!--===============================================================--> <!ELEMENT ElementDecl - - ( Comment*, (GI | GIGroup | ParmEntRef), (ContentModel | ParmEntRef), ((Exceptions, Comment*) | (Comment*, Exceptions) | Comment)?, Comment*) > <!ATTLIST ElementDecl ID ID #IMPLIED StartOmit (sreq | somit) sreq EndOmit (ereq | eomit) ereq > <!ELEMENT ContentModel - O (GIToken | ContentModel | PCData | ParmEntRef)+ > <!ATTLIST ContentModel Any (any) #CONREF Connector (seq | or | and) #REQUIRED Occurrence (once | zeroormore | oneormore) once > <!ELEMENT GIToken - - (GI | ParmEntRef) > <!ATTLIST GIToken Occurrence (once | zeroormore | oneormore) once > <!ELEMENT PCData - O EMPTY> <!ATTLIST PCData Occurrence NAME #FIXED zeroormore > <!ELEMENT Exceptions - - ( (Exclusions, (Comment*, Inclusions)?) | Inclusions) > <!ELEMENT (Exclusions | Inclusions) - - (GIGroup | ParmEntRef) > <!--===============================================================--> <!-- Attribute List Declaration Components --> <!--===============================================================--> <!ELEMENT AttlistDecl - - ((GI | GIGroup | Notation | NotationGroup | ParmEntRef), AttDefinitions+) > <!ELEMENT AttDefinitions - - (NormalAtt | FixedAtt | ParmEntRef | Comment)+ > <!ELEMENT NormalAtt - - (AttName, DeclValue, Default) > <!ELEMENT DeclValue - O (NameGroup | NotationGroup) > <!ATTLIST DeclValue DataType (CDATA | ENTITY | ENTITIES | ID | IDREF | IDREFS | NAME | NAMES | NMTOKEN | NMTOKENS | NUMBER | NUMBERS | NUTOKEN | NUTOKENS) #CONREF > <!ELEMENT Default - O (AttValue) > <!ATTLIST Default DefaultBehavior (REQUIRED | CURRENT | CONREF | IMPLIED) #CONREF > <!ELEMENT FixedAtt - - (AttName, DeclValue, Comment*, AttValue) > <!--===============================================================--> <!-- Entity Declaration Components --> <!--===============================================================--> <!ELEMENT TextEntityDecl - - (Comment*, GeneralEntName, Comment*, (ReplacementText | ExternalLocation), Comment*) > <!ELEMENT InternalParmEntityDecl - - (Comment*, ParmEntName, Comment*, ParmReplacementText, Comment*) > <!ELEMENT ExternalParmEntityDecl - - (Comment*, ParmEntName, Comment*, ExternalLocation, Comment*) > <!ELEMENT ParmReplacementText - - (GI | GIToken | GIGroup | Notation | NotationGroup | ParmEntRef | ContentModel | AttDefinitions | FixedAtt) > <!ELEMENT ReplacementText - - (#PCDATA) > <!ATTLIST ReplacementText TextType (CDATA | SDATA | PI | STARTTAG | ENDTAG | MS | MD | SIMPLE) simple > <!ELEMENT ExternalLocation - - (PublicID?, Comment*, SystemID?) > <!ELEMENT PublicID - - (MinimumLit) > <!ELEMENT SystemID - - (#PCDATA) > <!ELEMENT DataEntityDecl - - (Comment*, GeneralEntName, Comment*, (ExternalLocation), Notation, Comment*, AttSpecList?) > <!ATTLIST DataEntityDecl Type (CDATA | SDATA | NDATA | SUBDOC) NDATA > <!--===============================================================--> <!-- Notation Declaration Components --> <!--===============================================================--> <!ELEMENT NotationDecl - - (Comment*, Notation, Comment*, ExternalLocation, Comment*) > <!--===============================================================--> <!-- Shortref Declaration Components --> <!--===============================================================--> <!ELEMENT ShortrefDecl - - (Comment*, MapName, Comment*, ShortrefDelimiter, Comment*, GeneralEntName, Comment*) > <!ELEMENT (MapName | ShortrefDelimiter) - - (#PCDATA)> <!ELEMENT UseMapDecl - - (Comment*, (MapName | Empty), Comment*, (GI | GIGroup), Comment*) > <!ELEMENT Empty - O EMPTY > <!--===============================================================--> <!-- Marked Section Declaration Components --> <!--===============================================================--> <!ELEMENT MarkedSection - - ((MSKeyword | ParmEntRef), MSBody) > <!ELEMENT MSBody - - ANY > <!ELEMENT MSKeyword - O EMPTY > <!ATTLIST MSKeyword Value (INCLUDE | IGNORE | CDATA | RCDATA | TEMP) #REQUIRED > <!--===============================================================--> <!-- End of DTD Fragment --> <!--===============================================================--> -- Eliot Kimber Internet: drmacro@vnet.ibm.com Dept E14/B500 IBMMAIL: USIB2DK9@IBMMAIL Network Programs Information Development Phone: 1-919-254-5160 IBM Corporation Research Triangle Park, NC 27709