SGML: CDATA and RCDATA

SGML: CDATA and RCDATA

[CR: 19960903]

A collection of postings to CTS, more or less germane to CDATA and RCDATA as declared content, in light of SGML's recognition modes under the influence of active FEATURES set in the SGML declaration. "To him that hath ears. . ." Erik Naggum warns against this use of CDATA and RCDATA (see below). Other experts, as far as I know, (rather fully) agree with Erik's appraisal; (e.g., Steve DeRose/David Durand, Joe English). Please send me updates/corrections/additions/demurrals. -- Robin Cover



Newsgroups: comp.text.sgml
Date: 01 Apr 1994 15:57:58 UT
From: Erik Naggum <erik@naggum.no>
Organization: Naggum Software; +47 2295 0313
Message-ID: <19940401.1664.erik@naggum.no>
References: <JCOLLIE.94Mar30102934@blue.weeg.uiowa.edu>
Subject: Re: Question about content models

[Jeffrey C. Ollie]

|   What is the difference between the content models "(#PCDATA)" and
|   "CDATA"?  It is my understanding that both define a string of
|   characters.  Are they synonyms, or are there other differences?

actually, they both allow data characters, with a minor twist: PCDATA
(parsed character data) actually means that if the characters cannot be
interpreted as markup, they are valid in that context as data characters.
CDATA means that characters in that element are declared to be data
characters and is only terminated by an end-tag open in context (a </
followed by (, >, or a letter; or / if the start-tag was "net-enabling".)
this may change as the FEATURES clause in the SGML declaration changes.  if
prematurely ended, the error is the same as if these characters occurred in
a non-CDATA element, so there's not really any surprises lurking here.

after some study of the impact on CDATA and RCDATA, I have found that they
serve no useful purpose, give a false sense of security that text will not
be parsed, and confuse users.  therefore, I recommend that people not use
CDATA or RCDATA at all.

if the functionality of CDATA and RCDATA is needed, it would be much better
for all parties involved if this is controlled in the document instance
rather than in the DTD.  not only is parsing made simpler, unexpected and
undesirable things happen in CDATA and RCDATA elements: users can no longer
use comments, references to a character they need appear verbatim, omitted
or minimized tags no longer work, etc.

the functionality of CDATA and RCDATA is available with marked sections.
marked sections stand out, are visible to the user at the point where they
occur, and won't screw things up unexpectedly.  furthermore, if the
element's content is (#PCDATA), the user is free to use a marked section or
entity references, other content may occur in the same element, and things
obtain a much more elegant look when you don't have to "know" whether an
element has CDATA or RCDATA contents.  the application software will not
know whether the content was declared CDATA or allowed to contain PCDATA,
so the only party you constrain is the user.

the specific thing that caused me to reject CDATA and RCDATA completely was
that changing an element's content from CDATA to (#PCDATA) is not possible
without also checking all instances of that element.  this means that if
you need more information in an element, or want a smaller granularity to
what was previously just "character data", you lose, or, more precisely,
your information receives a serious blow, and may not recover.  that is,
you may intentionally avoid fixing all the "broken" documents, and thus you
pay a hefty fee for the initial short-cut of using CDATA if that data once
again becomes useful.  CDATA and RCDATA are dangerous and best avoided.

what is gained by CDATA and RCDATA?  that a few characters will not be
treated as markup, that's all.  if these characters occur elsewhere in the
document, they will have to be replaced by entity references, or other
strange ways to "hide" them from the parser.  training users to escape
their special characters costs enough already if they should not also have
to remember that these rules have exceptions that apply only to certain
elements.  training them to use marked sections may take additional cost,
but at least it has no impact on them if they don't use it, and if they use
it, it's just an added benefit.

there are several ways to hide characters from the parser.  suppose we want
to say "AT&T", here are some ways to do it:

through a character entity:             AT&amp;T
through a numeric character reference:  AT&#38T
through an empty comment declaration:   AT&<!>T
in a marked section:                    <![ CDATA [AT&T]]>
using the markup-scan-suppress code
 (where \ is the MSSCHAR):              AT\&T
in an entity (declared):                <!ENTITY ATT CDATA "AT&T">
             (referenced):              &ATT;
in a clever entity (declared):          <!ENTITY T CDATA "&T">
                   (referenced):        AT&T
by making \& a short ref delimiter:     AT\&T

this exact same issue was discussed 1993-03-11 through --14 between John
Bowe, Eric Skinner and myself.  the URL

        wais://ifi.uio.no/comp.text.sgml?MSSCHAR

will give you 5 documents all pertinent to this topic.  (lynx supports WAIS
searches, if your gee-whiz graphical WWW browser doesn't.)

|   Is there a way to define a content model such that only certain
|   characters are permitted, and a certain number of them?  For example, I
|   would like to define an element that contains a ZIP code (5 digits).

no.

best regards,
</erik>
--
Erik Naggum <erik@naggum.no> <SGML@ifi.uio.no>  |  memento, terrigena.
ISO 8879 SGML, ISO 10744 HyTime, ISO 10646 UCS  |  memento, vita brevis.

for information on SGML and HyTime, try ftp.ifi.uio.no:/pub/SGML first.


--------------------------------------------------------------------------


Newsgroups: comp.text.sgml
Date: 19 Jun 1994 18:39:37 UT
From: Erik Naggum <erik@naggum.no>
Organization: Naggum Software; +47 2295 0313
Message-ID: <19940619.3062@naggum.no>
References: <1994Jun17.174907.14944@chemabs.uucp> <19940619.3045@naggum.no>
<2u1utg$fmc@news.delphi.com>
Subject: Re: SGML DTD help needed for binary data.

[Jeffrey McArthur]

|   Erik Naggum writes:
|   > CDATA and RCDATA should in fact not be used at all.
|
|   That is a little strong.

it was reserved.  closer to what I actually think, and which _is_ a little
strong, is that CDATA and RCDATA mar the otherwise elegant picture of a
language by providing the means to destroy central principles to the design
of that language.  CDATA and RCDATA have absolutely no utility that cannot
be accomplished _easier_ without them (when considering the whole picture).
redundancy is bad enough, but when it becomes destructive, it is time to
speak up and make people aware of the evil potential of these features.
CDATA and RCDATA must not be used if you want your documents to outlive
your system or software, or the current version of either.  they are _bad_
for you and for your information.

now, that's a _little_ strong.  further escalation available by mail.

|   I would like to hear a good argument why a sort key should not be
|   CDATA.

and _I_ would like to hear a good argument for why there isn't a God.  how
about some arguments in _favor_ of CDATA, since this is your position?  the
onus of proof is on he who asserts the positive.

|   They should be used with care.  Consider the problem of a sort key.
|   Sorting can become a complex task.  For example companies with numbers
|   in their names may be sorted based on the spelling of the number.  For
|   example, 3M may sort in the T's (Three M), and 16 Magazine may sort in
|   the S's (Sixteen).  Other publishers may want 3M to sort before 16
|   Magazine that precedes the companies that start with the letter A.

this is irrelevant to a discussion of CDATA.

however, as an example, consider a name containing a small y with diaeresis,
the famous ISO 8859-1 character that the reference concrete syntax of SGML
in its limited wisdom decided to SHUN.  it needs to be in the content of an
element declared CDATA (because the DTD designer thought "we'll never need
an entity reference in _this_ baby"), and, voila!, you can't use &#255, you
can't represent your data, and you can't change the DTD because somewhere
else someone used the ampersand as data where it looks like an entity
reference if you allow RCDATA or PCDATA.  this is destructive.  this is bad
karma.  you should become an ex-DTD designer if you use CDATA.  RCDATA is
only a little bit better.  everywhere in the document, an author can insert
<!-- comments --> to leave information to others in the collaborative
editing team, _except_ in RCDATA elements.  the elements can't be changed
to PCDATA because the elements frequently include start-tags, and changing
them would have to be done manually since the information about the special
parsing conditions is not localized with the affected information.  this is
also destructive.  this is also bad karma.  next life, when you're digging
up mushrooms with your snout, you will regret that RCDATA element.

instead, use marked sections, teach users how to avoid markup recognition,
and make it work the same _everywhere_.  consistency is a major win in this
game.  CDATA and RCDATA screw up the consistency, confuse users, and make
funny remnants of markup appear in the final document unless you're _very_
careful, and no parser on the market today will tell you about this.

(CDATA and RCDATA marked sections localize the special parsing information
with the data, and are therefore perfectly OK, indeed essential.  CDATA
attributes are of a different kind of CDATA, and do in fact accept entity
references, as do all attribute literals.)

</Erik>
--
Erik Naggum <erik@naggum.no> <SGML@ifi.uio.no>       |  memento, terrigena
ISO 8652 Ada/ISO 8879 SGML/ISO 9899 C/ISO 10646 UCS  |  memento, vita brevis

ftp://ftp.ifi.uio.no/pub/SGML           wais://ftp.ifi.uio.no/comp.text.sgml
-----------------------------------------------------------------------
Newsgroups: comp.text.sgml
Date: 19 Jun 1994 19:36:22 UT
From: David Megginson <dmeggins@aix1.uottawa.ca>
Organization: Department of English, University of Ottawa
Message-ID: <DMEGGINS.94Jun19153622@aix1.uottawa.ca>
References: <1994Jun17.174907.14944@chemabs.uucp> <19940619.3045@naggum.no>
<2u1utg$fmc@news.delphi.com> <19940619.3062@naggum.no>
Subject: Re: SGML DTD help needed for binary data.

I support Erik on this point -- I do not think that CDATA or RCDATA should
_ever_ be used as a declared content model for an element.  First, there
are the practical problems.  Consider this DTD fragment from a document
about SGML (and thus, which will quote a lot of SGML literally):

  <!ELEMENT example             - - CDATA>

The following document fragment would _not_ work!

  <example>
  Here is some sample SGML text with an <tag>embedded tag</tag>
  and an &entity;
  </example>

The problem is that you cannot use any end tag in a CDATA content model
anyway.  You _could_ use RCDATA with the appropriate entities:

  <!ELEMENT example             - - RCDATA>

  <example>
  Here is some sample SGML text with a <tag>embedded tag&lt;/tag>
  and an &amp;entity;
  </example>

but if your goal was to save typing, you're not gaining much, especially if
the SGML sample is long.  You also have to worry about entity references as
well as end tags, so your text will be full of &amp; entities. The best way
to do it is just

  <example>
  <![CDATA[
  Here is some sample SGML text with a <tag>embedded tag</tag>
  and an &entity;
  ]]>
  </example>

It is especially nice because it allows you to cut and past directly from
the SGML document without worrying about converting some characters to
entities.

The second reason is that the DTD should not impose the choice of CDATA or
RCDATA on the user.  If you are writing an example using a marked section,
you will have to use some entities to avoid trouble, and RCDATA will be
appropriate; if you are writing an example with a lot of entities, however,
RCDATA will force you to use &amp; too often.  Let the DTD decide the
document's _structure_ and the user decide the document's _content_.

I wonder whether it would be possible to remove the CDATA and RCDATA
content models from the standard altogether, or at least, to mark them as
counter-indicated.

David

--
David Megginson                Department of English, University of Ottawa,
dmeggins@aix1.uottawa.ca       Ottawa, Ontario, CANADA  K1N 6N5
dmeggins@acadvm1.uottawa.ca    Phone: +1 613 564 6850 (Office)
ak117@freenet.carleton.ca             +1 613 564 9175 (FAX)


----------------------------------------------------------------------
Joe English on CDATA as declared content

Message-id:     <9607240020.AA11739@trystero.art.com>
From:           Joe English <joe@trystero.art.com>
To:             www-html@w3.org
Date:           Tue, 23 Jul 1996 17:20:33 PDT
Subject:        Cougar DTD: Do not use CDATA declared content for SCRIPT



The 12-July-1996 draft of the "Cougar" HTML DTD [1] declares:

    <!ELEMENT SCRIPT - - CDATA -- script statements -->

This will not work.

In particular, the use of CDATA declared content is incompatible with
JavaScript (which, I presume, will be one of the primary scripting
languages used in HTML documents).

The main reason for this is that the arguments to JavaScript's 
'document.write()' method [2], which inserts text and HTML markup
into a document, may contain end-tags, e.g.:


    <SCRIPT>
	document.write("<H1>", "Foo", "</H1>")
    </SCRIPT>


Elements with CDATA declared content cannot contain any
sequence of characters that "looks like" an end-tag  --
ETAGO (</) followed by a letter -- since that will prematurely
terminate the element.  There is no way around this; it is
a fundamental problem with CDATA declared content.

Here are a few alternatives:

1) Use <!ELEMENT SCRIPT - - (#PCDATA)>, and require all occurrences
of '<', '&', and '>' in the content to be replaced with '&lt;',
'&amp;', and '&gt;'.  This is more consistent with the rest of HTML.

2) Use <!ELEMENT SCRIPT - - (#PCDATA)> and add browser support 
for CDATA marked sections:

    <SCRIPT><![ CDATA [
	document.write("<H1>", "Foo", "</H1>")
    ]]></SCRIPT>

This is the approach favored by most other SGML applications.

3) Allow scripts to be included by external reference:

	<SCRIPT SRC="http://www.foo.com/myscript.js"></SCRIPT>

This approach may increase network latency, but has the advantage of 
better backward-compatibility with SCRIPT-unaware user agents.


 * * *


CDATA declared content is in general a bad idea (it should
not be used for STYLE either, and IMO the XMP and LISTING
elements should be removed entirely.)  Of all of SGML's 
broken features, CDATA declared content is among the worst. 
For more details, please refer to the relevant entries on 
Robin Cover's SGML Web Page [3] under "Other Grammar/Parsing 
Issues and FEATURES" [4].

Many of the issues brought up there are not particularly
relevant to the Web, though there are other problems with CDATA
declared content that make it especially dangerous for HTML.
[I've expounded on this before on html-wg, but because of the
current lack of a working archive I can't cite references :-(]

Two things that come to mind are that the presence of *any*
element with CDATA or RCDATA declared content in the HTML DTD
makes it much more difficult to write a Web search engine -- it
becomes necessary to parse against the DTD instead of simple
lexical scanning, e.g., with tools like Dan Connolly's lexical
analyzer [5] -- and that it greatly increases the amount of SGML
knowledge necessary for authors to construct a valid document
including such elements.



[1] <URL: http://www.w3.org/pub/WWW/MarkUp/Cougar/HTML.dtd >
[2] <HURL: http://www.netscape.com/eng/mozilla/2.0/handbook/
	javascript/ref_t-z.html#write_method >
[3] <URL: http://www.sil.org/sgml/ >
[4] <URL: http://www.sil.org/sgml/topics.html#miscGrammar >
[5] <URL: http://www.w3.org/pub/WWW/TR/WD-sgml-lex >


--Joe English

  joe@art.com


-----------------------------------------------------------------------

From: Stephen Dixon <stephen@gp.co.nz>
Newsgroups: comp.text.sgml
Subject: CDATA and RCDATA understand confused
Message-ID: <1993Jun23.093319.1796@gp.co.nz>
Date: 22 Jun 1993 21:33:19 UT
Summary: END-TAG error in CDATA, RCDATA declared content.
Keywords: ETAGO, CDATA, RCDATA, GI's
Distribution: world
Organization: GP Print Ltd, Wellington, New Zealand
Lines: 39

Greetings All.

I seem to have got myself confused somewhere with regard to CDATA and
RCDATA.

I had thought that it was possible to have an element "x" whose
content was declared CDATA or RCDATA in which could occur start-tags
(and associated end-tags) that may or may not be in the DTD. The
parser would look at these and only recognise that end-tag which
started it.

Example text:

  <x>
  <s><st>Title type text</st>
  <ss>text a plenty ...</ss></s>
  ...
  </x>

Errors are produced for the end-tag </st>, </ss> and </s>. Correct?
(Both parsers I have access to agree by producing errors for those
three end-tags (and other end-tags within the <x> element).)

Using an entity like &etago; for the end-tag open is not an option as
the end-tag needs to be seen (like the start-tag) so that the
following processing application can use it.

In this case the elements <s>, <st> and <ss> are defined, but they may
represent GI's that are not defined as elements in the DTD.

How can this be achieved?  Have I gone wrong? Where?

Apologies if this has been discussed before.

--
   _____   Regards: Stephen J. Dixon           
  /    /   Thoughts etc. my own. 
 / GP /    email: stephen@gp.co.nz
/____/     SGML: Seeking to understand it better.

-------------------------------------------------------------

Newsgroups: comp.text.sgml
From: ers@xgml.com (Eric R. Skinner)
Subject: Re: 7.6.1 Record Boundaries revisited
Message-ID: <1993Feb6.183053.6512@xgml.com>
Organization: Exoterica Corporation
References: <19930201.007@erik.naggum.no>
Date: 06 Feb 1993 18:30:53 UT
Lines: 441

In article <19930201.007@erik.naggum.no> Erik Naggum <SGML@ifi.uio.no> writes:
>A request for interpretation of ISO 8879 7.6.1 (Record Boundaries) has been
>submitted to ISO/IEC JTC 1/SC 18/WG 8 from the CALS Industry Standards
>Groups for the revision of SGML.  I thought this might be interesting for
>readers of this newsgroup, and have also included my comments at the end.


[This article was previously posted with editing errors, cancelled
and re-posted.  Apologies to any who received both posts.]

The following is Exoterica's commentary on record boundary processing,
as stated in Exoterica document ETR13-0792.  I am providing it here
for information purposes, as it describes Exoterica's interpretation
of clause 7.6.1.

This document was made available to the participants of the Techdoc
discussion, and has been submitted to ISO WG8, although I do not
know the document number assigned to it at this time.

If you redistribute this article, please retain all copyright
notices.

Hope this helps...


---------

RECORD BOUNDARY PROCESSING IN SGML


ETR1307-92


Exoterica Corporation
1545 Carling Ave.
Suite 404
Ottawa, Ontario
Canada
K1Z 8P9
Tel 613-722-1700
Fax 613-722-5706

Information in this document is subject to change without notice and
does not represent a commitment on the part of Exoterica Corporation.

Change History:
Release 1  23 July 1992


PROPRIETARY RIGHTS NOTICE

All rights reserved by Exoterica Corporation. No part of this material
may be reproduced, translated or transmitted in any form or by any
means, electronic, mechanical, or otherwise, including photocopying
and recording, WITHOUT INCLUDING ALL COPYRIGHT NOTICES AND THIS NOTICE.

DISCLAIMER

Exoterica Corporation makes no representation or warranty, express or
implied with respect to this publication or the programs or
information described therein. In no event shall Exoterica
Corporation, its employees or contractors be liable for specific,
indirect, or consequential damages.

(c) Exoterica Corporation, 1992
    All rights reserved.


1 Record Boundaries
2 Record End Handling
3 ESIS
4 Record Start Handling
5 CDATA and RCDATA
6 Character References
7 Conclusion

RECORD BOUNDARY PROCESSING IN SGML

This paper explains Exoterica's interpretation of ISO 8879 with
respect to record boundary handling and why we decided to do things
the way we do. Clause 7.6.1 of ISO 8879, which describes the handling
of record boundaries, contains some difficulties in the way it is
written. We have tried to interpret it and implement our SGML parser
in a manner consistent with the wording of both this clause and others
in ISO 8879, as well as in a manner that provides the maximum benefits
to our users.

1  Record Boundaries

The SGML standard describes text files in terms of "records",
which are bounded by record-start (RS) and record-end (RE) characters.
This is not the only way of looking at text files, and is not the way
text files are stored by most computer systems. In fact, ISO 8879 does
not require a document to be organized in records. SGML's
record-oriented view does, however, have the advantage of providing a
model for describing precisely what a "blank record" is as well
as providing a basis for supporting the short reference minimization
feature of SGML.

Exoterica has traditionally embedded its SGML parser in its products
in a way that supports the standard's description of text records. No
matter what computer system you are running on, the Exoterica software
gives the SGML parser each text record terminated by an RE and with
the following text record started by an RS. We do this for a number of
reasons:

- Short reference minimization is an important part of many SGML
  applications and short references that include RE and RS need to be
  available on all systems.

- Our software must exhibit consistent behavior on all of the large
  variety of platforms on which it runs. Consistent behavior is required
  if data interchange is to be practical.

- On computer systems that use the ASCII line-feed character to
  delimit text records (i.e. unix), the literal interpretation of this
  character as an RS would cause its being discarded and would result in
  lines of text that were run together (i.e. if line-feeds in unix files
  were discarded by the SGML parser, a word at the end of a line would
  be run together with the word at the start of the following line).

- The task of an SGML system is to process the text submitted to it,
  not just to copy or discard the text. This processing includes
  recognizing and processing the starts and ends of lines of text.

Exoterica's new products, being released this autumn, support user
control over this behavior, while providing the traditional behavior
in the default case. We expect this control to be used only in the
support of wildly variant syntaxes (unusual uses of the SGML
Declaration) and in supporting consistency with the behavior of other
vendors' products. In particular, we expect that most users will
continue to want consistency across platforms, and full access to
SGML's minimization features, and will therefore want to take
advantage of the traditional behavior.

2  Record End Handling

The poor wording of Clause 7.6.1 in ISO 8879 results in an ambiguity
as to when REs are to be "ignored". In the present absence of a
revision to ISO 8879 that resolves this difficulty, we, and other
implementors, have had to come up with our own resolutions based on
trying to maintain consistency with other parts of ISO 8879 and
providing the kind of behavior that users expect.

The ambiguity revolves around the inadequately defined term "ignored".
Ignored presumably means that a character is discarded at some point
in processing. The ambiguity is in determining when a character (an RE
or RS) is ignored:

- REs and RSes are clearly not ignored a priori (prior to the SGML
  parser seeing them) because they can be recognized as short reference
  delimiters or parts thereof. So the discarding has to be done at some
  later point in processing.

- REs "attributed to markup" and RSes are ignored prior to the data in
  the parsed data in the SGML document being returned to the
  application, because the ignoring has to be done at some time

- The critical issue is whether REs attributed to markup and RSes are
  ignored instead of, or in addition to their being recognized as data.
  If they are recognized as "parsed character data", then they are
  available for matching with #PCDATA content tokens in the content
  models of mixed content elements (elements that have content models
  that contain one or more #PCDATA content tokens), and can cause errors
  if parsed character data is not allowed by a content model (i.e. a
  #PCDATA content token is not available to be matched with).

There are a number of considerations that lead us to believe that ISO
8879 says that all REs in the content of mixed content elements are
recognized as parsed character data, independent of whether they are
also "ignored" (RSes are discussed in more detail later on in this
paper):

- REs, RSes, spaces and other white-space characters are recognized as
  "s separators" in element content (production [26]) and are ignored by the
  provisions of Clause 6.2.1. The recognition of white-space characters
  in element content is done as part of the recognition of data and
  markup (it is provided for in the grammar of SGML). The provisions of
  Clause 7.6.1 are only required, therefore, to deal with white-space
  characters that are not distinguished from data characters by the
  grammar of SGML.

- The text of Clauses 7.6 and 7.6.1 never describe the REs to be
  ignored as "markup" (it refers to REs "attributed" to markup). The
  last sentence of the note in Clause 7.6 describes any SGML character
  in mixed content that is not recognized as markup to be data.
  Therefore, characters in mixed content (but not in element content),
  that are not recognized as markup, are data.

  Similarly, production [25] does not allow for an RE or RS separately, but
  only as a "data character".

- In the first note in Clause 7.6.1, it is recognized that the rules
  preceding the note only produce the "intuitive results" when "data can
  occur anywhere in the content of an element". An example is provided
  of a content model that would produce unintuitive results:
          (x, #PCDATA)
  It is stated that the rules for "ignoring" REs notwithstanding, an RE
  prior to the element "x" would produce an error. This could only be so
  if the RE were treated as parsed character data and an attempt were
  made to match it with a #PCDATA content model token prior to its being
  ignored.

  The note in Clause 11.2.4 expands on the explanation of the potential
  difficulty caused by this and similar models, and recommends against
  their use.

- The note in Clause 11.2.4 states that "separator characters, which
  are recognized as separators in element content, are treated as data
  in mixed content." This statement would seem to be quite explicit, but
  there is, however, a difficulty: the definition of "separator
  characters" (definition 4.277) is written so that it is not clear
  whether SPACE or RE characters are intended to be separator
  characters, and the proposed revision (WG 8 paper N 1035) makes it
  explicit that "separator characters" are members of the SEPCHAR class
  (i.e. TAB in the Reference Concrete Syntax), and do not include RE, RS
  or SPACE.

  On the other hand, the statement quoted clearly applies to SPACE
  characters equally well as to SEPCHAR character. Furthermore, the
  example that immediately follows the statement describes the
  difficulty caused by RE characters occurring in the locations
  described in the statement. It would therefore seem that the statement
  is intended to apply to characters that can occur in "s separators"
  (RE, RS, SPACE and SEPCHAR class characters) rather than just SEPCHAR
  characters.

- A similar problem to that of leading REs occurs within elements. An example 
  is the following declarations:
        <!ELEMENT a - O (b, c?, #PCDATA)>
        <!ELEMENT (b, c) - - (#PCDATA)>
  and the following use of these elements:
        <a><b>data</b>
        <c>more data ...

  As there is no provision for discarding the RE following the </b> in
  ISO 8879, it is therefore matched with the #PCDATA content model token
  in the element "a" and the following start tag for "c" is in error.

  Note that this is basically the same problem as occurs with REs at the
  start of mixed content elements, but that it has no simple resolution
  of the "always ignore leading white-space" variety. Resolution of the
  difficulty introduced by the RE in the example would require looking
  ahead to the <c> start-tag.  Look-ahead over a large number of
  characters could be required if there were a large amount of white
  space or the presence of comments intervening between the white-space
  and the <c>. ISO 8879 is quite strict in requiring delimiters and
  other constructs to be recognized with a minimum of look-ahead. Such a
  resolution would require breaking this provision of the standard.

  In the absence of an alternative straight-forward resolution, treating
  the case in the example as an error seems the only reasonable action.
  As a consequence of this consideration and of the desire to treat all
  REs in mixed content in as consistent as possible a fashion, REs
  cannot be allowed in all contexts in mixed content.

The fact that this resolution of the ambiguity produces results that
are not "intuitive", is recognized in the notes in Clauses 7.6.1 and
11.2.4. Clause 11.2.4 further strongly recommends against the use of
mixed content models that, in particular, do not allow #PCDATA at
their start. The very existence of these notes indicates that thought
was given to the mechanism and consequences of RE processing. What
does not seem to have been realized is that more than clarification of
the issues was required: the normative text (i.e. not the notes)
should have been written to be clear and unambiguous as to the
required processing.

Although the behavior of Exoterica's SGML parser with respect to REs
is not always what the user would desire, it does seem to be the
approach most consistent with the wording of ISO 8879. Any other
approach would have to ignore ISO 8879 in a number of clauses.
Implementors of standards are not free to ignore parts of a standard
and still claim conformance for their products, so we feel obliged to
take this approach.

The text that most strongly supports the approach we have taken is in
notes in ISO 8879. Notes are not normative text (i.e. the text in
notes and non-normative annexes is not officially part of an ISO
standard, and should not contain any of the requirements of a
standard). However, in the absence of any other source of resolution
of an ambiguity in the normative text, the non-normative text of the
standard is the best place to look for a resolution that can be
accepted by most users of the standard.

3  ESIS

The preceding discussion of RE handling seems to conclude that REs are
both ignored and not ignored. This is not as bizarre a conclusion as
it may at first seem as there are many examples in SGML of text that
affects parsing but is not part of the document as seen by the
application receiving the results of SGML parsing. For example, an RE
on a line of text that contains a comment declaration is ignored if
the line only contains the comment, but is not ignored if either the
line also contains other data characters or if the line does not
contain the comment declaration. In effect, the presence or absence of
the comment declaration, that is not part of the tagging or text of
the SGML document, influences how the SGML parser processes parsed
character data.

Characters can be both ignored and not ignored because the view of an
SGML document as seen by an SGML parser is different from the view of
an SGML document as seen by an application that receives a parsed
document from an SGML parser. An SGML parser sees all the markup, REs
and RSes and other text of the document. An application sees the
effects of the tags in the returned document structure and the data
that is not "ignored".

This two-fold view of parsed documents is fundamental to all data
description languages. It has been formalized for SGML in the
definition of ESIS (Element Structure Information Set), which has been
developed by ISO/IEC JTC 1 SC 18 WG 8, the ISO group responsible for
the maintenance of ISO 8879. In the same way that ISO 8879 now defines
the form of an SGML document, ESIS defines "the set of information
that is acted upon by implementations of structure-controlled
applications", an application that "only operates on the element
structure that is described by SGML markup, never on the markup
itself." A definition of ESIS is contained in (a normative annex) in
the newly approved (July 1992) ANSI standard, "Conformance Testing for
Standard Generalized Markup Language (SGML) Systems". REs attributed
to markup are "ignored" by the ESIS view of a document, but not in the
document as viewed by an SGML parser.

4  Record Start Handling

Determining how RSes should be processed is subject to the same
difficulties as for REs, except that in the case of RSes, there are
not the helpful notes that there are for REs. On the one hand, there
is the simple statement in Clause 7.6.1 that "if an RS in content is
not interpreted as markup, it is ignored." On the other hand:

- The provisions of the grammar (productions [25] and [26]) and the last
  sentence of the note in Clause 7.6 to apply equally well to RSes as
  REs: RSes are parsed data characters. In addition, the first sentence
  of Clause 7.6.1 (quoted in the paragraph above) refers to RSes "not
  interpreted as markup", strengthening the applicability of the note in
  Clause 7.6 to RSes.

- The wording used to describe what happens to an RS is the same as
  that for a leading or trailing RE: "it is ignored". There is no
  indication that RSes should be ignored at a different phase or in a
  different manner than REs attributed to markup.

- The note in Clause 11.2.4 also applies equally to RSes as REs. In
  particular: "separator characters, which are recognized as separators
  in element content, are treated as data in mixed content."

The Exoterica SGML parser matches all RSes in mixed content to #PCDATA
tokens prior to (or as well as) discarding them simply because ISO
8879 seems to indicate that RSes are handled in the same manner as REs
attributable to markup. If ISO 8879 were more straight-forwardly
written, the first sentence in Clause 7.6.1 could be take out of
context and at face value, but the interpretation of most other parts
of the standard requires substantial reading in context, and an
exception cannot be justified in this one case.

The argument in favor of the current behavior of the Exoterica SGML
parser with respect to RSes is not as strong as that for REs. It is
more difficult to make a categorical statement as to which way to go
with RSes. However, the balance seems to be on the side of treating
RSes in mixed content as parsed character data, matching them to
#PCDATA content tokens, and reporting any errors that may result.

Because the Exoterica products "normalize" text prior to its being
submitted to the SGML parser (all record boundaries are converted
include both RE and RS) any difficulties caused by the type of content
model recommended against in the notes in Clauses 7.6.1 and 11.2.4 of
ISO 8879 have been primarily caused by out-of-place REs. The
difficulty is only ever ascribable to RSes when explicit &#RS;
character references are used.

5  CDATA and RCDATA

The arguments that apply to mixed content apply equally well, one way
or the other, to the content of CDATA (character data) and RCDATA
(replaceable character data) elements, and the content of such
elements should be processed in the same manner as that of mixed
content elements. The difficulties encountered in mixed content do not
occur in these cases, however, as they cannot contain any markup other
than entity and character references (in RCDATA).

6  Character References

Clause 9.5 states that the replacement character for a character
reference is treated as though it were entered directly. Therefore,
&#RE; and &#RS; must be ignored in exactly the same manner (in
particular, at the same phase of processing) as if the corresponding
characters were entered in the document. In particular, they must be
discarded if RE or RS would be discarded in the same context.

7  Conclusion

Implementors of computer software based on national or international
standards must implement the standard as written, where at all
possible. ISO 8879 fails to be clear and precise in its description of
record boundary processing, but it does seem to have a distinct
intent, as follows:

- The term "ignored" in Clause 7.6.1 means ignored (or discarded) with
  respect to ESIS information only. The parsing process cannot ignore
  the presence of an RE or RS.

- An RE or RS in the content of a mixed content element (i.e. one that
  allows #PCDATA anywhere) is treated as is any other parsed data
  content for the purpose of SGML parsing, even though it is discarded
  for ESIS purposes.

  There are two classes of difficulties associated with record boundary
  processing in SGML: an unnecessarily difficult to read standard, and
  different text line delimiting conventions in different computer
  systems. These difficulties can be reduced in the following ways:

- The text of ISO 8879 should be revised so as to be clear and precise
  as to the required processing of record boundaries. The current WG 8
  project to revise ISO 8879 provides an opportunity to do this,
  although it will probably take two to four years to complete the
  revision cycle. SGML system implementors, SGML system users and other
  standards incorporating SGML have to decide what to do in the
  meantime.

- The CALS standards describing how documents are transmitted between
  computer sytems, particularly the latest members of the MIL-M-28001
  and MIL-M-1840 series, should specify how text records are to be
  represented in explicit terms. The best approach would be to specify
  that text records are explicitly bounded by record-start and
  record-end characters when transmitted, no matter what the conventions
  of the transmitting and receiving systems are (i.e. placing characters
  for both RE and RS between text records), although any specification
  in this regard would be better than none. Requiring both REs and RSes
  at text record boundaries would have the additional advantage that an
  RE character preceding each RS would simplify the issues involved in
  discussing record boundary handling.

  Using the CALS standards as a vehicle to resolve record boundary
  handling difficulties is practical, because the CALS standard revision
  cycle is usually considerably shorter than the ISO cycle.

---------
-- 
Eric R. Skinner                ers@xgml.com
Exoterica Corporation   Tel +1 613 722 1700
Ottawa, Canada          Fax +1 613 722 5706
-----------------------------------------------------------------------
Newsgroups: comp.text.sgml
From: Erik Naggum <SGML@ifi.uio.no>
Message-ID: <19930314.002@erik.naggum.no>
Date: 14 Mar 1993 02:42:37 UT
Editing-Time: 115 min
References: <1993Mar11.010814.6386@osf.org> <1993Mar12.223624.26543@xgml.com>
Subject: Re: CDATA content and end-tags as data
Lines: 152

Some additional comments to a generally sound reply from Eric R. Skinner.

[John Bowe]
:
|   What are the rules for including tags in CDATA and RCDATA content?

Briefly stated, you cannot include a valid end-tag in character data and
replaceable character data content because doing so would terminate the
content (prematurely).

|   I could not find an explicit statement about this in The SGML Handbook.

See B.8.3 Unparsable Sections [50:15], which discusses marked sections with
the "CDATA" or "RCDATA" status keyword.

|   What I did find was something about the character data ending when the
|   parser finds an ETAGO followed by a valid name (does that mean *any*
|   valid name?).

This is actually a complication of the rules, which does not include the
"valid name" part, so your parenthetical question is not relevant.

|   I also found something about the char data ending when the *matching*
|   end tag is found.

B.13.1.1 [59:1] discusses this: "Only the correct end-tag (or that of an
element in which this element is nested) will be recognized."  However
generally useful, this is in conflict with the requirements in 7.6.

ISO 8879, clause 7.6, paragraph 2:

        The |content| of an element declared to be |character data| or
        |replaceable character data| is terminated only by an ETAGO
        delimiter-in-context (which need not open a valid |end-tag|) or a
        valid NET.  Such termination is an error if it would have been an
        error had the |content| been |mixed content|.  [321:6-11]

[Eric R. Skinner]
:
|   In other words, anything that starts to look like an end tag (ie. an
|   ETAGO followed by a name start character) will cause the end of the
|   CDATA element.  If the end tag is invalid an error will also be
|   generated.

An ETAGO delimiter-in-context is not only an ETAGO followed by a name start
character, though.  If SHORTTAG YES is specified, a TAGC satisfies the
contextual constraints (Clause 9.6.2), and if CONCUR YES is specified in
the SGML declaration, a GRPO is enough.  In the reference concrete syntax,
this means that also "</>" and "</(" will terminate such an element, with
the respective features used.  (Note that there is no contextual constraint
on the GRPO, which it seems that there should have been.)

|   Partly, this is to allow the end tag of an enclosing element to end the
|   CDATA element;

Right, but note that this is true even if the end-tag for the element is
not declared minimizable (something that tend to confuse people).

|   it's also necessary owing to SGML's token lookahead restrictions.

Hmmm.

[John Bowe]
:
|   So, besides doing using (#PCDATA) as the content and
|
|       <![ CDATA [ <p>This is an example paragraph.</p> ]]>
|
|   how does one enter markup itself (ie. end tags) as data?  Should I be
|   able to put anything I want (besides the obvious end-tag) as the
|   content of a CDATA element?

I would avoid (replaceable) character data content for examples of SGML
markup, though, and instead use it for content within which I do not want
any other elements or markup recognition.  Such as a "trademark" or
"abbreviation" element within which "&" is not recognized: <tm>AT&T</>.

|   What are your favorite ways for doing this?

[Eric R. Skinner]
:
|   The only two characters that cause problems are < and &.

In CDATA content, "<" and "&" do not cause problems, but the sequence "</"
might.  In RCDATA content, "&" might additionally cause problems.

|   You can "protect" these characters in a number of ways.
|
|   1.  The clumsy entity reference expansion way, ie.    AT&amp;T

|   2.  With a null declaration, ie.  AT&<!>T.  This is better in
|       that no change to the DTD is required.

|   3.  Using a short reference string, to allow this:  AT\&T.
|       To do this you must allow "\&" as a short reference delimiter,
|       then map "\&" to "&" in the necessary contexts.  Tricky on the
|       implementation side but the hands-down winner for clarity of
|       markup.

Hmmm, if no changes to the DTD is beneficial, adding a short reference
string and declaring it in all short reference maps will be a major set of
modifications.  I think I'd go for the "clumsy entity reference expansion
way" over this, but that may also be because I consider the backslash an
annoyance, and because it might lead people to think that a backslash is
somehow an escape character with general applicability.  Getting the short
reference maps right may also be tricky, and they do not apply in RCDATA
content or in attribute values.  But see below.

Here are my favorite ways, in no particular order:

 o  Always use a space or other punctuation before and after a literal "&"
    that could be misinterpreted.

 o  Define CDATA entities for trademarks, so that if you later need to make
    a list of them, they're in one place.  You can also define a "trademark"
    element, and change all the trademark entities at once.

 o  Use a multicode syntax and use an MSSCHAR (markup-scan-suppress
    character) before literal characters.  (If the backslash is not used
    for data, declaring it as a function character would give you the
    benefit of the escape character without the implementation costs of
    more short reference strings.  Note that both schemes require changes
    to the concrete syntax.  The standard does not say whether two MSSCHARs
    following each other will result in the latter being treated as a
    data character, but I have implemented it that way.)

 o  Use CDATA marked sections (not a good idea for small parts, though).

 o  Use entity references or numeric character entity references.  A good
    DTD already includes an entity declaration for these characters.  "amp"
    and "lt" are the usual entity names.

|   Of course, if you are writing a document with many examples of
|   SGML, you should consider using the SGML declaration to change
|   various delimiters to non-reference values, to allow you to
|   type SGML examples as you please.

Hmmm.  I consider this not so sound advice.  A CDATA marked section is much
simpler to deal with than changing the delimiters, but it is of course
possible to change them.  I have only changed the delimiters (to control
characters) for processing mail and news articles, where this and extensive
use of short references made it unnecessary to modify the input files in
any way.  Needless to say, "<" and "&" cannot be "magic" in existing
material that was not intended for SGML processing.

Best regards,
</Erik>
--
Erik Naggum                 ISO  8879 SGML                   +47 2295 0313
Oslo, Norway                ISO 10744 HyTime
<erik@naggum.no>            ISO  9899 C                 Memento, terrigena
<SGML@ifi.uio.no>           ISO 10646 UCS             Memento, vita brevis
-----------------------------------------------------------------------
Newsgroups: comp.text.sgml
From: Erik Naggum <SGML@ifi.uio.no>
Message-ID: <19930201.007@erik.naggum.no>
Date: 01 Feb 1993 07:22:18 UT
Subject: 7.6.1 Record Boundaries revisited
Lines: 345

A request for interpretation of ISO 8879 7.6.1 (Record Boundaries) has been
submitted to ISO/IEC JTC 1/SC 18/WG 8 from the CALS Industry Standards
Groups for the revision of SGML.  I thought this might be interesting for
readers of this newsgroup, and have also included my comments at the end.

The text of the letter reads as follows (lines having a | in column 1 have
been very slightly edited, but no content has been changed):


    1992 September 15

|   To: ISO/IEC JTC 1/SC 18/WG 8

    During the evaluation of an SGML document (including both DTD and
    document instance) submitted to the Air Force CALS Test Bed, an error
    was reported during the parsing operation.  The DTD was parsed without
    error using the Exoterica, Datalogics and SoftQuad parsers.  When the
    SGML document instance was parsed, all three parsers reported errors on
    the same lines.

    During the detailed evaluation of the DTD and the document instance by
    the Air Force CALS Test Bed and Exoterica, the source of the error was
    found to be invalid placement of RE and RS characters in the document
    instance.  The document instance had the end of one tag on one line and
    the start of another on the next line.  The construction of the DTD did
    not permit this as the end of line character was seen as data.

    Discussions with the company submitting the file indicated an
    interpretation problem is ISO 8879, clause 7.6.1.

    A discussion by interested people was held at TechDoc in San Francisco
    on 27 August 1992.  A general consensus of the meaning of the clause
    was made but the feeling was expressed that a written clarification is
    necessary to ISO 8879, clause 7.6.1.  Two other recommendations were
    made as well.

    The data in question involved a file in which all desired characters,
    including record-start and record-end characters, were transmitted as
    variable-length records.  The file resulting from the dechunking used
    at CTN equated each chuck with an input line, and added the character
    pair ASCII 13 ASCII 10 which the CALS 28001A SGML declaration
    identifies as RE and RS characters.  Neither of those characters were
    contained in the document instance before chunking.  Furthermore, the
    chunk-to-line mapping convention was not stated anywhere in 1840A, no
    in the ANSI tape standard it referenced for variable-length records,
    nor in 28001A.

    The three recommendations of the TechDoc discussion group are:

    1.  Clarify the wording of ISO 8879, clause 7.6.1.  When ISO 8879 is
        revised, consider modifying its intent, as discussed below.  (ISO
        8879 Action)
    2.  Since 1840B recognizes the difficulties introduced by variable-
        length chunking and eliminates such chunking, it should include a
        comment on the use of RS and RE characters.  In particular, it
        should clearly state that the system preparing document instances
        for transmission is responsible for including all desired RS and RE
        characters, and that the receiving sysem should not add such
        characters as the result of processing tape records.  (CALS Digital
        Standards Office (CDSO) Action)
    3.  The DTD should be modified to avoid mixed content models where data
        cannot occur anywhere in the content of the relevant elements.

    The issue needing clarification in Clause 7.6.1 of ISO 8879 is whether
    an SGML parser must determine that an RE is data before deciding to
    ignore it under the conditions of this cluase.  While the consensus of
    the meeting was that this is indeed the case, there was concern that
    the wording of the Standard is difficult to interpret.  Furthermore,
    this interpretation leads to unfortunate cases in which a parser must
    report an error due to the presence of an RE which, if allowed, would
    be ignored.  When ISO 8879 is revised, the group recommends that this
    possibility be eliminated.

    The problems arise in mixed content models where data cannot occur
    everywhere (e.g., if the model is not repeatable or uses a connector
|   other than OR).  For example, consider the declarations:

        <!ELEMENT a - - (b , #PCDATA)>
        <!ELEMENT c - - (d | #PCDATA)>

    and the document instance segments

        <a>
        <b>contents</b>
        </a>

    and

        <c>
        <d>contents</d>
|       </c>

    Assume that each of these lines begins with an RS and ends with an RE.
    In the first case, data is not permitted immediately following the <A>
    start-tag.  Thus, the RE after that tag is invalid, even though it
    would be ignored if data were permitted in that context.  In the second
    case, data is allowed after the <C> start-tag.  However, the presence
    of data means that the #PCDATA branch of the OR-group has been
    selected, and that the C element may not therefore also contain a D.
    Thus, the RE after the <C> start-tag indicates that the #PCDATA branch
    has been selected.  This RE is ignored because it is the first one in
    the element and is not preceded by an RS, data, or proper subelement.
    The <D> start-tag is invalid, because the C cannot contain a D.

    Although the situation did not arise in the partiuclar document that
    triggered this discussion, a question was raised about RS and RE
    characters in CDATA and RCDATA elements.  Should the first RE in such
    an element be ignored if no RS, data, or proper subelement preceded it?
    Should the last RE in such an element be ignored if no data or proper
    subelement follows it?

    The TechDoc meeting concluded that these REs should be ignored
    according to the above mentioned interpretation of 7.6.1; that is,
    there should be no difference for CDATA and RCDATA content versus other
    kinds of content with respect to 7.6.1.


    George Elwood
    Senior Systems Engineer
    Air Force CALS Test Bed


For reference, the clause in question:


    7.6.1 Record Boundaries

    If an RS in |content| is not interpreted as markup, it is ignored.

    Within |content|, an RE remaining after replacement of all references
    and recognition of markup is treated as data unless its presence can be
    attributed solely to markup.  That is:

    a)  The first RE in an element is ignored if no RS, data, or proper
        subelement preceded it.
    b)  The last RE in an element is ignored if no data or proper
        subelement follows it.
    c)  An RE that does not immediately follow an RS or RE is ignored if no
        data or proper subelement intervened.

    In applying these rules to an element, subelement content is ignored;
    that is, a proper or included subelement is treated as an atom that
    ends in the same record in which it begins.

    An RE is deemed to occur immediately prior to the first data or proper
    subelement that follows it (that is, after any intervening markup
    declaration, processing instruction, or included subelement).


(Notes have been elided.  Please see the [Goldfarb], p 321ff for details
and Goldfarb's annotations.)

I have several comments to the recommendations.  First off, recommendation
number 2 is in conflict with what I interpret to be the function of the
entity manager, namely to ensure that the parser sees record start and end
codes at the start and end, respectively, of a "record".  When the input
file is organized in variable-length records, an entity manager can
legitimately (indeed, must) insert RS and RE codes to inform the parser of
the record structure of the file.  If RS and RE codes are not intended, a
different record structure must be used.  Therefore, the problem lies with
the "chunking" process, not with the "dechunking" process.  Also, if there
are characters 10 and 13 in the source file prior to "dechunking", these
must be treated as non-SGML characters, and should elicit an error.

However, if the "chunking" and "dechunking" are considered to be a function
of a "transport layer" (e.g., rolling data onto and off of tapes), there is
a conflict between the specifications of the chunking and the dechunking as
those can be inferred from the above description.  In the presence of such
an error, however, the problem reported has no impact on or from clause
7.6.1.

Recommendation number 1 is much easier to deal with.  It is actually two
recommendations, 1a: "Clarify the wording ..." and 1b: "... consider
modifying its intent ...".  1a can be accomplished by the rewriting that I
did to understand this (I fully agree that the clause is difficult to
interpret):

    Short reference delimiter strings are resolved.  (RS and RE remain
    after resolution of such references (see [297:20]).)

    An RS is never sent to the application (but is not ignored by the
    parser).

    In element content, an RE is treated as an |s| and is not sent to the
    application.

    In mixed content, the following rules apply:

    a)  If the first RS, RE, data, or proper subelement in an element is an
        RE, it is not sent to the application.
    b)  If the last RE, data, or proper subelement in an element is an RE,
        it is not sent to the application.
    c)  If a record is not empty, and the first RE, data, or proper
        subelement following the RS is an RE, it is not sent to the
        application.

    In elements with |replaceable character data| or |character data|,
    only items a) and b) apply.

    RS and RE in a subelement are part of that subelement's content.

I have made several interpretations of the text, and the most important is
to define "ignore" as "not send to the application".  The text as it stands
makes it difficult to understand how an RS can affect the interpretation of
a following RE if it is truly "ignored".  The next most important rewriting
is to "tokenize" the input file into function characters, data characters,
and interpreted markup constructs.  I.e., I read a "token", and see if that
"token" is an RE function character.  "Tokens" can be markup declarations,
processing instructions, included subelements, and proper subelements, in
addition to individual function characters and lastly, data characters.
The subelements are completely parsed before they are returned as "tokens".

I can read the above and understand it without effort, but I still get
confused if I try to read the original text; too many negations for my
internal stack.  I may be biased because I wrote it :-), but I think my
rewriting helps explain things.

Before we consider how we should "modify the intent" of the above clause,
let's see what the intent is, as things are today:

1)  A record end following a start-tag is ignored.
2)  A record end preceding an end-tag is ignored.
3)  If we use omitted end-tag minimization, a record which contains only a
    start-tag is ignored.  (In mixed content otherwise, the end-tag and
    following start-tag would need to be juxtaposed in the same record to
    cause the RE to be ignored.  In element content, an RE between an
    end-tag and a start-tag would be recognized as an |s| delimiter, and
    ignored.)
4)  A record that contains only markup (not start- or end-tags, but markup
    declarations and processing instructions) is ignored.
5)  A record that contains only included subelement(s) is ignored.
6)  The RE after an empty record is _not_ ignored.

(Item 6 is perhaps controversial.  It is not clear to me whether item c in
the list above takes precedence over item b if the empty record is the last
record before the end-tag.  ARC SGML sends this RE to the application.)

It appears to me that the intent of this clause is that a record which
contains only markup (tags included) is itself considered markup, and both
RS and RE are to be ignored, but that the requirements in the standard tend
to exclude certain REs from this general intent, and the order in which
markup and data are recognized makes the whole thing overly complicated.

The TechDoc discussion group recommends that the possibility be eliminated
that an RE that would be ignored if it were allowed as data causes an
error.  I'm not thrilled with this desired modification, and instead
propose that the intent to ignore records containing markup be straightened
out.  My suggested wording is as follows:

||  7.6.1 Record Boundaries
||
||  The RS and RE of a non-empty record that does not contain data are
||  treated as |s| separators.
||
||  If the record contains data or subelements, an RE that immediately
||  follows a start-tag is ignored, and the RE (and RS) that immediately
||  precede an end-tag are ignored.
||
||  An RS or RE function character ignored by the above requirements is not
||  subjected to further markup recognition.
||
||  NOTE: I.e., short reference delimiter recognition occurs after record
||        boundary processing.

This has the impact that the difference between an included subelement and
a proper subelement is reduced.  (The difference has caused much confusion
previously, and Goldfarb even warns against it: "This implication of
designating an elment to either be an inclusion or a proper subelement
should be kept in mind when designing document types.  It can have a
significant impact on the likelihood that users will create record boundary
errors."  Amen.)

The full impact of the proposed change has yet to be assessed (the number
of combinations is large), but preliminary work shows that, except for
records containing only included subelements, no RE that is ignored under
the current requirements are treated as data under the proposed new
requirements, although a number of RE's that are treated as data under the
current requirements are ignored under the new requirements, as well as not
causing short reference delimiter recognition (which today has a number of
unwanted side effects).

The example in the second note in 7.6.1 would cause the RE to never be
ignored, as if the subelement were always proper (explicit RS and RE):

        &#RS;data<outer><sub>&#RE;
        &#RS;data</sub>&#RE;
        &#RS;data</outer>&#RE;

The note states that the first RE in "outer" is that following the "sub"
element and that it is data if "sub" is a proper subelement, but ignored if
"sub" is an included subelement.  I think this is really artificial and
counter-intuitive.  Under my proposed new requirements, only the first RE
would be ignored.

Although the above proposal simplifies things greatly, the TechDoc
discussion group's recommendation number 3 is still advice.

The question whether record boundary processing applies to CDATA and RCDATA
elements should be answered by pointing out that 7.6.1 makes no claims
about the type of the content of the element to which it applies.  Indeed,
that is part of the problem of reading this paragraph.

Finally, how do the proposed new requirements fit the bill from the CALS
Industry Standards Groups?  To review, given the declarations

        <!ELEMENT  A  (B , #PCDATA)>
        <!ELEMENT  C  (D | #PCDATA)>

the fragment

        <A>
        <B>contents</B>
        </A>

will be parsed as

        <A><B>contents</B></A>

and the fragment

        <C>
        <D>contents</D>
        </C>

will be parsed as

        <C><D>contents</D></C>

which seems to do the job.

To summarize the effect: the record boundaries of the record containing <A>
(and <C>) are ignored because the record contains no data.  The RE and RS
before </A> (and </C>) are ignored because the record contains data and
occur immediately before an end-tag.

Comments invited, especially from all those who have complained about the
current requirements.

Best regards,
</Erik>
--
Erik Naggum                 ISO  8879 SGML                   +47 2295 0313
Oslo, Norway                ISO 10744 HyTime
<erik@naggum.no>            ISO  9899 C                 Memento, terrigena
<SGML@ifi.uio.no>           ISO 10646 UCS             Memento, vita brevis
-----------------------------------------------------------------------
Newsgroups: comp.text.sgml
Subject: DTD DTD Summary
Message-ID: <GNAT.93Jul21122937@kauri.kauri.vuw.ac.nz>
From: gnat@kauri.vuw.ac.nz (Nathan Torkington)
Date: 21 Jul 1993 12:29:37 UT
Organization: CSC, Victoria University Of Wellington, New Zealand
Lines: 159

Thanks to everyone for their responses.  These came under the headings
of:
 -- ``ooh ooh, me too!'' :-)
 -- ``there's probably one in SGMLS somewhere'' :-)
and
 -- definite assistance.

In the latter category comes this, the definitive answer from Donald
Gignac.

|  This in response to your posted request. Appended is some work I did
|  along these lines earlier with respect to publishing the various CALS
|  DTDs in the governing DoD specifications. The SGML is CALS SGML
|  (MIL-M-28001B).  You can tag a DTD three ways.
|     Enter the whole thing as CDATA.
|     Devise some rediculous SGML Declaration to redefine all your SGML
|       markup.
|     Do something like the following and try to format it for publishing
|       somehow. We planned on the CALS OS and FOSI approach.
|  Also see appendix B "Writing a book on SGML using SGML" in Eric van
|  Herwijnen's "Practical SGML". 
| 
|  Use the following at your own risk. I have no idea if it's feasible.
| 
|  [then, when I asked if I could share his stuff]
| 
|  I do not feel the material I send you yesterday should be available
|  from one of the official SGML archive sites since it has not been
|  thoroughly tested. On the other hand, it could be useful to people
|  working in this area. You are free to distribute this material as you
|  see fit provided the following statement is included:
| 
| 
| 
|     The following SGML declarations and data dictionary were developed by
| 
|          Donald Gignac  "gignac@oasys.dt.navy.mil"  (301) 227-3348
|          Advanced Information Systems Branch (Code 183)
|          David Taylor Model Basin
|          Headquarters, Carderock Division
|          Naval Surface Warfare Center
|          Bethesda, Maryland  USA  20084-5000
| 
|     Any information regarding the use of this material, mistakes or
|     inadequacies found therein, suggestions for improvement, etc. would
|     be greatly appreciated.
| 
|     This material is based on similar material provided to the US Navy by
|     the Datalogics Corporation. The United States Government provides no
|     guarantees regarding its completeness, correctness, or usefulness. The
|     United States Government is not responsible for any losses whatsoever
|     resulting from the use of this material.
| 
| 
| <!-- MARY: The "yesorno" ENTITY declaration below was added for parsing
| purposes. Remove it when the "dtd" declarations are inserted in the DTDs. -->
| <!ENTITY % yesorno "NUMBER">
| 
| 
| <!-- START OF DECLARATIONS FOR MARKING UP A DTD -->
| 
| <!-- BOILERPLATE ENTITY DECLARATIONS -->
| 
| <!ENTITY docname "User supplies DTD's document (root element) name here.">
| 
| <!ENTITY dtdident "User supplies DTD identifier from DTD's formal public 
|                      identifier here.">
| 
| <!ENTITY DOCTYPE 'The following set of declarations may be referred
| to using a public entity as follows:
| 
| <!DOCTYPE &docname; PUB "&dtdident;">' >
| 
| <!ENTITY NOTE1 'NOTE: In order to parse the following Document Type
| Declaration Subset alone, append the Document Type Declaration below
| to the beginning of the file:
| 
|    <!DOCTYPE &docname; [
| 
| and the associated "]>" to the end of the file.'>
| 
| <!-- WARNING: The "etago" entity (or the equivalent "&#60;/") must be
| used to provide the end tag open "</" in the RCDATA declared content 
| of "comment". If "</" is entered from the keyboard, the parser considers 
| it to be a delimiter and prematurely terminates the "comment" markup. 
| -->
| <!ENTITY etago "</">
| 
| <!-- ELEMENT AND ATTLIST DECLARATIONS -->
| 
| <!ELEMENT dtd  - -  (entset?, elemset?) +(comment)>
| <!ATTLIST dtd  docname   CDATA  #REQUIRED -- use &docname; for tag value --
|           dtdident  CDATA  #REQUIRED -- use &dtdident; on tag -- >
| 
| <!-- MARY: Are the above "docname" and "dtdident" attributes redundant?
| Even if they are, shouldn't we have still them? -->
| 
| <!ELEMENT comment  - -  RCDATA>
| <!ATTLIST comment  type  (declaration | embedded)  "declaration">
| 
| <!-- ELEMENT AND ATTRIBUTE DECLARATIONS FOR THE ENTITY SET -->
| 
| <!ELEMENT entset  - -  (entdec | pentref | marksectdec)* >
| 
| <!ELEMENT entdec  - -  (paramlit | datatext | bracktext | extentspec)>
| <!ATTLIST entdec  entname   CDATA  #REQUIRED>
| 
| <!ELEMENT (paramlit | datatext | bracktext)  - -  CDATA>
| 
| <!ATTLIST datatext  type  (CDATA | SDATA | PI)  #REQUIRED>
| 
| <!ATTLIST bracktext  type (STARTTAG | ENDTAG | MS | MD)  #REQUIRED>
| 
| <!ELEMENT extentspec  - -  (enttype?)>
| <!ATTLIST extentspec  type      (PUBLIC | SYSTEM)  "PUBLIC"
|                       entident  CDATA              #REQUIRED>
| 
| <!ELEMENT enttype  - -  (datattspec?)>
| <!ATTLIST enttype  notname   CDATA                     #REQUIRED
|          type      (CDATA | NDATA | SDATA)   #REQUIRED>
| 
| <!ELEMENT datattspec  - -  (attname, valspec)*>
| 
| <!ELEMENT valspec  - -  CDATA>
| 
| <!-- ELEMENT AND ATTRIBUTE DECLARATIONS FOR THE DECLARATION SEPARATORS 
|        (PARAMETER ENTITY REFERENCES AND MARKED SECTIONS) -->
| 
| <!ELEMENT pentref  - -  CDATA>
| 
| <!ELEMENT marksectdec  - -  (entdec | elemdec | attlistdec | pentref | 
|               marksectdec)* >
| <!ATTLIST marksectdec  statkeyw   CDATA   #REQUIRED>
| 
| <!-- ELEMENT AND ATTRIBUTE DECLARATIONS FOR THE ELEMENT SET -->
| 
| <!ELEMENT elemset  - -  ((elemdec, attlistdec?) | (notdec, attlistdec?) |
|           pentref | marksectdec)*>
| 
| <!ELEMENT elemdec  - o  (modgrp, excepts?)>
| 
| <!-- WARNING: When a tag corresponding to the "elemdec" element has a
| content model (i.e., the "declcont" attribute is not specified), then
| the CDATA contents of the "modgrp", "exclusions" (if present), and 
| "inclusions" (if present) tags MUST be terminated by their end tags. When
| the "declcont" attribute is specified on an "elemdec" tag, then an
| "elemdec" end tag MUST NOT be present NOR shall that "elemdec" tag 
| have content. -->
| 
| <!ATTLIST elemdec  elemname   CDATA                       #REQUIRED
|          startmin   %yesorno;                   "0"
|          endmin     %yesorno;                   "1"
|          declcont   (CDATA | RCDATA | EMPTY)    #CONREF>

I hope this is of some use,

Cheers;

Nat
-----------------------------------------------------------------------
Newsgroups: comp.text.sgml
Date: 19 Jun 1994 02:44:01 UT
From: Erik Naggum <erik@naggum.no>
Organization: Naggum Software; +47 2295 0313
Message-ID: <19940619.3045@naggum.no>
References: <1994Jun17.174907.14944@chemabs.uucp>
Subject: Re: SGML DTD help needed for binary data.

[Wayne Mills]

|   I'm working on writing a DTD to describe a data file.  The fly in the
|   ointment is that there is some binary data in the mix.  I have a tag
|   called GDSMM that will mark this data for me.

binary data does not in general mix well with text.  the cleanest solution
is to use an external entity, which you may also give a NOTATION and
notation attributes.  another possibility is to accept the suggestion from
SGMLS, and make all characters into (numeric) character references.

|   This is based on info I found in the book __Practical SGML__ by
|   Herwijnen.

I really hope the book does not suggest using CDATA for binary data!  CDATA
and RCDATA should in fact not be used at all.  they run counter to several
SGML principles, the most important breach of which is that elements that
have declared content (CDATA and RCDATA) are locked to the character set
used in the document, since there is no way to turn them into entity
references when (not if) needed.  thus, the document is system-dependent
despite every intention to make documents less so.  the effect of CDATA and
RCDATA can be obtained with marked section, which is under user control,
where such decisions belong.  this doesn't help you with binary data.

|   <GDSMM>^0^^&J6Yy</GDSMM>

this does not appear to be consistent with the error messages you got, so I
assume you meant ^O (not ^0) as control-O and ^^ as control-^.  under this
assumption, a valid instance could look like this, and you can use #PCDATA
for the element.

    <GDSMM>&#15&#30&#38&#74&#54&#89&#121</GDSMM>

note that you obtain 200-400% overhead in this representation, but it's the
only safe way to do it.  binary data is not text; whereas the characters in
the text may undergo translation, the binary data should remain the same.
using this notation, they will.

unless you have a lot of small snippets of binary data like this, the clean
solution is an entity, with the option to inline the data as above:

    <!NOTATION BINARY-CGM PUBLIC "whatever">
    <!ENTITY GDSMM001 SYSTEM "whatever" NDATA BINARY-CGM>

    <!ELEMENT GDSMM (#PCDATA)>
    <!ATTLIST GDSMM external ENTITY #CONREF>

an instance can then look like this:

    <GDSMM external=GDSMM001>
or
    <GDSMM>&GDSMM001</GDSMM>
or
    <GDSMM>&#15&#30&#38&#74&#54&#89&#121</GDSMM>

depending on your needs.  I favor the attribute solution, as there is no
need to have the parser scan over the data.  your application software will
have to talk to the entity manager to obtain the data.

|   So, can anyone tell me the proper DTD approach and syntax for
|   specifying that a certain GI tags data which is binary and should be
|   "ignored" by sgmls?

the short answer is that you can't make an SGML parser "ignore" data if it
is also parsed.  (the exception is NDATA entities, where the parser only
looks for the entity end, but then, what's the point in having it go
through the parser in the first place?)

hope this helps.

best regards,
</Erik>
--
Erik Naggum <erik@naggum.no> <SGML@ifi.uio.no>       |  memento, terrigena
ISO 8652 Ada/ISO 8879 SGML/ISO 9899 C/ISO 10646 UCS  |  memento, vita brevis

ftp://ftp.ifi.uio.no/pub/SGML           wais://ftp.ifi.uio.no/comp.text.sgml
-----------------------------------------------------------------------
Newsgroups: comp.text.sgml
From: Erik Naggum <SGML@ifi.uio.no>
Message-ID: <19930623.009@erik.naggum.no>
Date: 23 Jun 1993 01:26:24 UT
References: <1993Jun23.093319.1796@gp.co.nz>
Subject: Re: CDATA and RCDATA understand confused
Lines: 42

[Stephen Dixon]
:
|   I had thought that it was possible to have an element "x" whose
|   content was declared CDATA or RCDATA in which could occur start-tags
|   (and associated end-tags) that may or may not be in the DTD. The
|   parser would look at these and only recognise that end-tag which
|   started it.

Nope.  The parser only looks for the end-tag open delimiter-in-context,
i.e., the string "</", followed by a name start character, a letter.  This
is covered by the perhaps confusing requirement:

    7.6 Content
     :

    The [content] of an element declared to be [character data]
    or [replaceable character data] is terminated only by an
    {etago} delimiter-in-context (which need not open a valid
    [end-tag]) or a valid {net}.  Such termination is an error
    if it would have been an error had the [content] been [mixed
    content].


|   How can this be achieved?

You can use a marked section, instead, such as

    <![ CDATA [
        ........
    ]]>

where you run into the same problem with the string "]]>", but are
otherwise relieved of the particular problems you have right now.

A good idea is to refer to CDATA entities for examples of SGML markup.

Best regards,
</Erik>
--
Erik Naggum <erik@naggum.no> <SGML@ifi.uio.no>          ISO  8879 SGML
Chairman, SGML SIGhyper <SGML.SIGhyper@ifi.uio.no>      ISO 10744 HyTime
"Memento, terrigena.  Memento, vita brevis."            ISO 10646 UCS
-----------------------------------------------------------------------


Newsgroups: comp.text.sgml
Date: 28 Apr 1994 23:13:53 UT
From: Erik Naggum <erik@naggum.no>
Organization: Naggum Software; +47 2295 0313
Message-ID: <19940429.1992@naggum.no>
References: <1994Apr28.210539.23985@sics.se>
Subject: Re: Normalizing SGML with sgmls?

[Hakan Soderstrom]

|   Somebody happens to know about a back-end to sgmls that puts the output
|   together to SGML again?  What I need is an inexpensive way of
|   normalizing SGML, i.e. expand any markup minimization features.

too much information is lost from the original document for this to be
generally possible or desirable.  you need the SGML declaration and the
prolog untouched, and the ESIS output that SGMLS provides does not support
this "information set".  further, you may not want to have all entities
expanded, only markup minimization.

translation from one feature set to another is very complicated task, and
specialized tools must do the job.  further, it is not generally possible
due to the incredibly silly syntax-changing aspects of some features in
combination with the destructive nature of the CDATA and RCDATA declared
content.  if you avoid using those, things become _much_ easier, so don't
use them.

        <!SGML "ISO 8879:1986" ... CONCUR NO ... >
        <!DOCTYPE example [
        <!ELEMENT example CDATA>
        ]>
        <example><(ignore)this>foobar</(ignore)this></example>

now, try

        <!SGML "ISO 8879:1986" ... CONCUR YES ... >
        <!DOCTYPE example [
        <!ELEMENT example CDATA>
        ]>
        <example><(ignore)this>foobar</(ignore)this></example>

both <(ignore)this> and </(ignore)this> will _vanish_ because of the rules
of the CONCUR feature.  the problem is that there's no way you can keep
them from vanishing, because no markup is recognized in CDATA.  you have a
different document, and you may not even know it!  there's no warning from
the parser, _nothing_ to let you know you suddenly lost vital data in this
example.  you have in effect prohibited this document from being used with
CONCUR.  maybe not a disaster, since CONCUR is a unimplemented and should
remain so, but with LINK, where an example would be slightly pathological,
a document instance can prohibit processing with LINK, either by producing
syntax errors or by making data vanish as ignored markup.

CDATA and RCDATA as declared content should not be used.  there is no need
for them, and they will harm your information, as well as confuse the user.

going from some markup minimization features to none is simpler, but not
generally possible if you use entities that hold minimization-dependent
markup.  this points to a weakness in the technique to use such entities,
which is often the case with short reference maps.  there's no way short of
expanding such entities.  if you don't want the space savings and document
organization in the entity structure to get lost in the process, you need
selective entity replacement.

contrary to what one might think, an SGML document is bound very tightly to
its SGML declaration and feature set, including minimization, but the SGML
declaration describes more than the document instance, and encroaches on
the user's ability to process the document.  editing an SGML declaration to
remove or add features is not a task for the faint of heart.  the obvious
conclusion is that the SGML declaration should never be changed, at least
not without being willing to pay the penalty of checking that the document
is indeed the same before and after.  this is _most_ unfortunate.

in other words -- you don't _want_ to expand markup minimization features
with a quick and dirty solution.  you may not want to do it at all.

those programs that purport to produce "normalized SGML" may destroy some
of your information without letting you know about it; maybe not when you
"normalize" it, but when you read it back in with another parser.  some may
even produce "normalized SGML" that isn't SGML at all (MARK-IT comes to
mind).

best regards,
</erik>
--
Erik Naggum <erik@naggum.no> <SGML@ifi.uio.no>  |  memento, terrigena.
ISO 8879 SGML, ISO 10744 HyTime, ISO 10646 UCS  |  memento, vita brevis.

----------------------------------------------------------------------------

From: hajagos@frame.com (Lani Hajagos)
Newsgroups: comp.text.sgml
Subject: More FrameBuilder and SGML
Date: 26 Jan 1993 01:44:23 UT
Organization: UTexas Mail-to-News Gateway
Lines: 60
Sender: daemon@cs.utexas.edu
Message-ID: <9301260142.AA02622@lani.corp.frame.com.frame>


In article 1879@gmd.de, thomas@gmd.de asks "Are you claiming that there
is a one-to-one mapping between FrameBuilder documents and SGML
documents?   Or only that some (or even most) SGML documents can be
mapped into FrameBuilder, and vice versa?

I don't believe that any SGML product on the market currently supports
100% of of ISO 8879, so I certainly won't make the claim for
FrameBuilder. FrameBuilder represents the SGML entity structure of a
document without using SGML syntax. There are several SGML constructs
which have no direct correspondence in FrameBuilder:

- I mentioned attributes in my previous message. These can be mapped
into several different FrameBuilder structures, depending on their
intended usage. There is no concept of current or content reference
attributes, however.

- FrameBuilder has no direct analog to the SGML entity structure.
FrameBuilder's Book handles documents divided into multiple files,
although each document in the Book must be a complete element, and
books cannot be nested.

- FrameBuilder does not support SGML's optional LINK and CONCUR features.

SGML documents using syntax-specific SGML constructs such as markup
minimization or CDATA and RCDATA marked sections can be imported into
or exported out of FrameBuilder fairly easily. These constructs are
irrelvant to the FrameBuilder model, however, and have no counterpart
within FrameBuilder.

FrameBuilder does have a construct analogous to INCLUDE, IGNORE, and TEMP
marked sections. This construct, called Conditional Text, however,
does not support spanning partial elements.

Since FrameBuilder is an application for formatting structured
documents as well as an editor for manipulating them, the question of
whether FrameBuilder supports all possible SGML structures is less
meaningful than whether it supports them in the intended way. Because
the conceivable applications are unlimited, SGML was deliberately
defined so that no one system could do so.

For example, while FrameBuilder recognizes graphics in many file
formats, thereby supporting several data content notations, it
certainly does not recognize every possible graphics format. As another
example, tables can be represented in many ways in SGML. Of course the
possible element structures can be represented as element structures
within FrameBuilder. What is wanted, though, is a representation within
the rich table facility that FrameBuilder inherits from FrameMaker.
Very little programming is required to move the SGML tables we have
seen at Frame to and from the FrameBuilder table mechanism.  In its
initial release, FrameBuilder internally supports elements that
correspond to entire tables and allows table cells to contain
hierarchies of elements.  There are no elements for table rows or
columns, however, and there is no content model defining the order in
which particular types of cells are expected to occur.

I hope this answers your questions.


-----------------------------------------------------------------------

RCDATA


This is a searchable index. Enter search keywords: 



Index comp.text.sgml contains the following 32 items relevant to 'RCDATA'. The
first figure for each entry is its relative score, the second the number of lines in the
item. 

   1000 78 183937.Naggum /local/ftp/pub/SGML/comp.text.sgml/19940619/ 
   929 101 155758.Naggum /local/ftp/pub/SGML/comp.text.sgml/19940401/ 
   786 65 193622.Megginson /local/ftp/pub/SGML/comp.text.sgml/19940619/ 
   643 49 213319.Dixon /local/ftp/pub/SGML/comp.text.sgml/19930622/ 
   572 449 183053.Skinner /local/ftp/pub/SGML/comp.text.sgml/19930206/ 
   572 160 024237.Naggum /local/ftp/pub/SGML/comp.text.sgml/19930314/ 
   500 351 072218.Naggum /local/ftp/pub/SGML/comp.text.sgml/19930201/ 
   500 166 122937.Torkington /local/ftp/pub/SGML/comp.text.sgml/19930721/ 
   500 84 024401.Naggum /local/ftp/pub/SGML/comp.text.sgml/19940619/ 
   429 333 051234.Connolly /local/ftp/pub/SGML/comp.text.sgml/19930116/ 
   429 49 012624.Naggum /local/ftp/pub/SGML/comp.text.sgml/19930623/ 
   429 85 231353.Naggum /local/ftp/pub/SGML/comp.text.sgml/19940428/ 
   357 663 085319.Popham /local/ftp/pub/SGML/comp.text.sgml/19920526/ 
   357 68 014423.Jahagos /local/ftp/pub/SGML/comp.text.sgml/19930126/ 
   357 83 190816.Brueni /local/ftp/pub/SGML/comp.text.sgml/19930221/ 
   357 56 010814.Bowe /local/ftp/pub/SGML/comp.text.sgml/19930311/ 
   357 74 223624.Skinner /local/ftp/pub/SGML/comp.text.sgml/19930312/ 
   357 95 233109.Skinner /local/ftp/pub/SGML/comp.text.sgml/19930314/ 
   357 93 204143.Naggum /local/ftp/pub/SGML/comp.text.sgml/19930317/ 
   357 54 182552.Kimber /local/ftp/pub/SGML/comp.text.sgml/19930326/ 
   357 332 215637.Kimber /local/ftp/pub/SGML/comp.text.sgml/19930524/ 
   357 213 234410.Kimber /local/ftp/pub/SGML/comp.text.sgml/19930926/ 
   357 37 102257.Thompson /local/ftp/pub/SGML/comp.text.sgml/19931007/ 
   357 152 184256.Kimber /local/ftp/pub/SGML/comp.text.sgml/19931007/ 
   357 374 050500.Suttor /local/ftp/pub/SGML/comp.text.sgml/19931201/ 
   357 25 021410.English /local/ftp/pub/SGML/comp.text.sgml/19940303/ 
   357 55 132229.Holman /local/ftp/pub/SGML/comp.text.sgml/19940310/ 
   357 143 152315.Rath /local/ftp/pub/SGML/comp.text.sgml/19940407/ 
   357 69 160004.Kimber /local/ftp/pub/SGML/comp.text.sgml/19940521/ 
   357 56 132300.Kimber /local/ftp/pub/SGML/comp.text.sgml/19940525/ 
   357 46 174907.Mills /local/ftp/pub/SGML/comp.text.sgml/19940617/ 
   357 34 100400.McArthur /local/ftp/pub/SGML/comp.text.sgml/19940619/ 

-----------------------------------------------------------------------
Newsgroups: comp.text.sgml
From: bowe@acme.osf.org (John Bowe)
Subject: CDATA content and end-tags as data
Message-ID: <1993Mar11.010814.6386@osf.org>
Summary: how does one includes endtags in CDATA content?
Keywords: CDATA content, end-tag
Sender: news@osf.org (USENET News System)
Organization: Open Software Foundation, Cambridge, MA, USA
Date: 11 Mar 1993 01:08:14 UT
Lines: 46

What are the rules for including tags in CDATA and RCDATA content?  For
example, can I insert a chunk of example markup as the content of a CDATA
element?  I could not find an explicit statement about this in The SGML
Handbook.  What I did find was something about the character data ending
when the parser finds an ETAGO followed by a valid name (does that mean
*any* valid name?).  I also found something about the char data ending when
the *matching* end tag is found.

sgmls (1.1) gives an error.  Here is the DTD and instance:

     % cat -n /tmp/cdata
     1  <!DOCTYPE doc [
     2  <!ELEMENT doc   - o     (data)  >
     3  <!ELEMENT data  - -     CDATA   >
     4  ]>
     5  
     6  <doc>
     7  <data>
     8     <p>This is an example paragraph.</p>
     9  </data>
    10  </doc>

Here's the error message:

    % sgmls /tmp/cdata 
    sgmls: SGML error at cdata, line 8 at ">":
           No element declaration for P end-tag GI; end-tag ignored
    sgmls: SGML error at cdata, line 8 at ">":
           Bad end-tag in R/CDATA element; treated as short (no GI) end-tag
    sgmls: SGML error at cdata, line 9 at ">":
           DATA end-tag ignored: doesn't end any open element (current is DOC)
    (DOC
    (DATA
    -   <p>This is an example paragraph.
    )DATA
    )DOC

So, besides doing using (#PCDATA) as the content and

    <![ CDATA [ <p>This is an example paragraph.</p> ]]>

how does one enter markup itself (ie. end tags) as data?  Should I be able
to put anything I want (besides the obvious end-tag) as the content of a
CDATA element?  What are your favorite ways for doing this?

        - john
-----------------------------------------------------------------------
Newsgroups: comp.text.sgml
From: ers@xgml.com (Eric R. Skinner)
Subject: Re: CDATA content and end-tags as data
Message-ID: <1993Mar12.223624.26543@xgml.com>
Keywords: CDATA content, end-tag
Organization: Exoterica Corporation
References: <1993Mar11.010814.6386@osf.org>
Date: 12 Mar 1993 22:36:24 UT
Lines: 65

In article <1993Mar11.010814.6386@osf.org> bowe@acme.osf.org (John Bowe) writes:
>What are the rules for including tags in CDATA and RCDATA content?  For
>example, can I insert a chunk of example markup as the content of a CDATA
>element?  I could not find an explicit statement about this in The SGML
>Handbook.  What I did find was something about the character data ending
>when the parser finds an ETAGO followed by a valid name (does that mean
>*any* valid name?).  I also found something about the char data ending when
>the *matching* end tag is found.

Clause 9.6.1 (Recognition Modes) provides the first clue:

"Note:  Most delimiters will not be recognized when the content is
character data or replaceable character data."

But of course that doesn't answer the question.  Clause 7.6,
eerily close to this group's favorite 7.6.1, says:

"The content of an element declared to be character data or replaceable
character data is terminated only by an ETAGO delimiter-in-context
(***which need not open a valid end-tag***) or a valid net.  Such a
termination is an error if it would have been an error had the content
been mixed content."

In other words, anything that starts to look like an end tag (ie.
an ETAGO followed by a name start character) will cause the end of
the CDATA element.  If the end tag is invalid an error will also be
generated.

Partly, this is to allow the end tag of an enclosing element to end
the CDATA element;  it's also necessary owing to SGML's token
lookahead restrictions.

>So, besides doing using (#PCDATA) as the content and
>
>    <![ CDATA [ <p>This is an example paragraph.</p> ]]>
>
>how does one enter markup itself (ie. end tags) as data?  Should I be able
>to put anything I want (besides the obvious end-tag) as the content of a
>CDATA element?  What are your favorite ways for doing this?

The only two characters that cause problems are < and &.  You can
"protect" these characters in a number of ways.

1.  The clumsy entity reference expansion way, ie.    AT&amp;T
2.  With a null declaration, ie.  AT&<!>T.  This is better in 
    that no change to the DTD is required.
3.  Using a short reference string, to allow this:  AT\&T.
    To do this you must allow "\&" as a short reference delimiter,
    then map "\&" to "&" in the necessary contexts.  Tricky on the
    implementation side but the hands-down winner for clarity of
    markup.

The same steps could be taken to protect the "<" character.

Of course, if you are writing a document with many examples of
SGML, you should consider using the SGML declaration to change
various delimiters to non-reference values, to allow you to
type SGML examples as you please.

Cheers,

-- 
Eric R. Skinner                ers@xgml.com
Exoterica Corporation   Tel +1 613 722 1700
Ottawa, Canada          Fax +1 613 722 5706
-----------------------------------------------------------------------

Newsgroups: comp.text.sgml
From: ers@xgml.com (Eric R. Skinner)
Subject: Re: CDATA content and end-tags as data
Message-ID: <1993Mar14.233109.18643@xgml.com>
Organization: Exoterica Corporation
References: <1993Mar11.010814.6386@osf.org> <1993Mar12.223624.26543@xgml.com> <19930314.002@erik.naggum.no>
Date: 14 Mar 1993 23:31:09 UT
Lines: 87

In article <19930314.002@erik.naggum.no> Erik Naggum <SGML@ifi.uio.no> writes:
>An ETAGO delimiter-in-context is not only an ETAGO followed by a name start
>character, though.  If SHORTTAG YES is specified, a TAGC satisfies the
>contextual constraints (Clause 9.6.2), and if CONCUR YES is specified in
>the SGML declaration, a GRPO is enough.  In the reference concrete syntax,
>this means that also "</>" and "</(" will terminate such an element, with
>the respective features used.  (Note that there is no contextual constraint
>on the GRPO, which it seems that there should have been.)

You're right...  I only told a small part of the story.  To make
a complete list, we need to include a NET delimiter which is valid
if a NET has been used to end the start tag.

>
>|   Partly, this is to allow the end tag of an enclosing element to end the
>|   CDATA element;
>
>Right, but note that this is true even if the end-tag for the element is
>not declared minimizable (something that tend to confuse people).

Absolutely true.  It forces the end of the CDATA then causes an error.

Then on to favorite ways of protecting characters.  I suggested the
use of "\&" and "\<" as short references because in general those
are the two characters that can cause trouble.

Erik writes:
> o  Use a multicode syntax and use an MSSCHAR (markup-scan-suppress
>    character) before literal characters.  (If the backslash is not used
>    for data, declaring it as a function character would give you the
>    benefit of the escape character without the implementation costs of
>    more short reference strings.  Note that both schemes require changes
>    to the concrete syntax.

There is a significant problem with using MSSCHAR (and the related
MSOCHAR and MSICHAR, in that while an MSSCHAR "suppresses recognition
of markup for the next character in the same entity" (paraphrased
9.7), nowhere does the standard say that the MSSCHAR itself is to be
discarded.  Hence, the MSSCHAR is treated as data in the context of
the surrounding element (ie PCDATA or CDATA or RCDATA) and is passed
to the application.  It's a pain for the application to then rip out
the MSSCHARs.

I think it's certainly possible to define "\&" and "\<" as short
references in all contexts;  getting it right should not be too
tricky.  It works just like an MSSCHAR from the user's point of view,
covers all necessary cases, and doesn't result in spurious characters
being passed to the application.

>Hmmm.  I consider this not so sound advice.  A CDATA marked section is much
>simpler to deal with than changing the delimiters, but it is of course
>possible to change them.  I have only changed the delimiters (to control
>characters) for processing mail and news articles, where this and extensive
>use of short references made it unnecessary to modify the input files in
>any way.  Needless to say, "<" and "&" cannot be "magic" in existing
>material that was not intended for SGML processing.

Well, using CDATA marked sections is fine except that you now need
to protect the "]]" character sequence if it is part of your data.
There is no general solution short of changing the delimiters.  If
I were writing an SGML text I would seriously consider using alternate
delimiters;  short of implementation complexities (which should not
be the user's problem) I don't see a serious argument against it.  The
SGML declaration which implements this in a comprehensive fashion is
not terribly tricky -- anyone want a copy?

Incidentally, I should point out that Exoterica has a document
entitled "Understanding the SGML Declaration" available for free.
It explains the syntax of the declaration in lots of detail with
good examples.  For your copy send mail to info@xgml.com with
your mailing address.

The complete library of free documents:

ECM03  Understanding the SGML Declaration
ECM11  Content Model Algebra
ETR13  Record Boundary Processing in SGML


Cheers,



-- 
Eric R. Skinner                ers@xgml.com
Exoterica Corporation   Tel +1 613 722 1700
Ottawa, Canada          Fax +1 613 722 5706
-----------------------------------------------------------------------

Newsgroups: comp.text.sgml
From: Erik Naggum <SGML@ifi.uio.no>
Message-ID: <19930317.015@erik.naggum.no>
Date: 17 Mar 1993 20:41:43 UT
Supersedes: <19930317.014@erik.naggum.no>
Editing-Time: 83 min
Subject: SGML markup examples in SGML
Lines: 85

In article <1993Mar11.010814.6386@osf.org>, John Bowe asked about rules for
including markup as data in SGML documents, and about favorite ways of
doing it.  After thinking about this a little, I think the best answer is
to go one step back, ask what was to be accomplished, and find alternate
ways to accomplish it: Using CDATA and RCDATA elements or marked sections
is clearly _one_ of the possible solutions, but not an entirely
satisfactory one because of the many situations that Eric Skinner and I had
to cover in our replies.  One very simple answer is to use external
entities, but if entities map to files, this can lead to a large number of
files.  IMNSHO, a good SGML system should allow more than one storage
mechanism for entities.

Another solution, although not necessarily as immediately visually
gratifying as a solution which would let you see what you get (so to
speak), I think it would make sense to use entity references for each
delimiter role, named after the delimiter role.

Consider the following entity declarations:

    <!ENTITY stago CDATA "<">
    <!ENTITY etago CDATA "</">
    <!ENTITY tagc  CDATA ">">

Your example can now easily be put into an ordinary mixed content element:

    <example>
    &stago;p&tagc;This is an example paragraph.&etago;p&tagc;
    </example>

Note that since all entities have data text as their entity text, there are
no special cases to consider.

For convenience, a full entity set is appended to this message.  (It is
validated, and has been tested on cute furry animals: no one died.)

Best regards,
</Erik>
--
Erik Naggum                 ISO  8879 SGML                   +47 2295 0313
Oslo, Norway                ISO 10744 HyTime
<erik@naggum.no>            ISO  9899 C                 Memento, terrigena
<SGML@ifi.uio.no>           ISO 10646 UCS             Memento, vita brevis


<![
--------------------------------------------------------------------------
                  Entity Set for Reference Delimiter Set.

           Created by Erik Naggum <SGML@ifi.uio.no>, 1993-03-17.

    PUBLIC "+//ISBN 82-7640-000//ENTITIES Reference Delimiter Set//EN"
--------------------------------------------------------------------------
[
<!ENTITY and   CDATA "&"  >
<!ENTITY com   CDATA "--" >
<!ENTITY cro   CDATA "&#" >
<!ENTITY dso   CDATA "["  >
<!ENTITY dsc   CDATA "]"  >
<!ENTITY dtgo  CDATA "["  >
<!ENTITY dtgc  CDATA "]"  >
<!ENTITY ero   CDATA "&"  >
<!ENTITY etago CDATA "</" >
<!ENTITY grpo  CDATA "("  >
<!ENTITY grpc  CDATA ")"  >
<!ENTITY lit   CDATA '"'  >
<!ENTITY lita  CDATA "'"  >
<!ENTITY mdo   CDATA "<!" >
<!ENTITY mdc   CDATA ">"  >
<!ENTITY minus CDATA "-"  >
<!ENTITY msc   CDATA "]]" >
<!ENTITY net   CDATA "/"  >
<!ENTITY opt   CDATA "?"  >
<!ENTITY or    CDATA "|"  >
<!ENTITY pero  CDATA "%"  >
<!ENTITY pio   CDATA "<?" >
<!ENTITY pic   CDATA ">"  >
<!ENTITY plus  CDATA "+"  >
<!ENTITY refc  CDATA ";"  >
<!ENTITY rep   CDATA "*"  >
<!ENTITY rni   CDATA "#"  >
<!ENTITY seq   CDATA ","  >
<!ENTITY stago CDATA "<"  >
<!ENTITY tagc  CDATA ">"  >
<!ENTITY vi    CDATA "="  >
]]>
From: drmacro@ralvm13.VNET.IBM.COM
Message-ID: <19930326.103751.421@almaden.ibm.com>
Date: 26 Mar 1993 18:25:52 UT
Newsgroups: comp.text.sgml
Subject: Re: &lt; and &gt; ???
Disclaimer: This posting represents the poster's views, not those of IBM
News-Software: UReply 3.1
References: <DSCHIEB.93Mar26100352@muse.cv.nrao.edu>
            <19930326.075059.448@almaden.ibm.com>
Lines: 44

In <19930326.075059.448@almaden.ibm.com> drmacro@ralvm13.VNET.IBM.COM writes:
>
>Another approach might be to still use a notation for
>C code, but encapsulate the code itself within a CDATA
>marked section (assuming nothing in the code will ever
>look like an end tag open delimiter), e.g.:
>
>  <CodeExample  notation=cplusplus>
>  <(. CDATA (.
>   This is c++ code<template>
>  .).)>
>  </CodeExample>
>
>Your application will get the data and be told the notation
>and so it can provide the same sort of functions as
>the NDATA entity method.
>

Wayne Wohler reminded me that the CodeExample element in this
case must have a declared content of CDATA or RCDATA, so
the CDATA marked section would be incorrect.  The correct
example would be:

 <CodeExample notation=cplusplus>
  This is c++ code<template>
  Hope "</" followed by name-start-character is
  not in the data
 </CodeExample>

Where CodeExample is defined thusly:

  <!ELEMENT CodeExample - - CDATA >
  <!ELEMENT notation  NOTATION #REQUIRED >

And the notation 'cplusplus' would be defined:

  <!NOTATION Cplusplus SYSTEM "cpp2TeX.filter"
    -- Filter C++ to TeX -->

Eliot Kimber                      Internet:  drmacro@ralvm13.vnet.ibm.com
Dept E14/B500                     IBMMAIL:   USIB2DK9@IBMMAIL
Network Programs Information Development     Phone: 1-919-254-5160
IBM Corporation
Research Triangle Park, NC 27709
-----------------------------------------------------------------------

Newsgroups: comp.text.sgml
Date: 26 Sep 1993 23:44:10 UT
From: Eliot Kimber <drmacro@vnet.IBM.COM>
Message-ID: <19930926.170123.489@almaden.ibm.com>
Subject: DTD for DTDs

A few weeks back, somebody asked if there was a DTD for DTDs.  There was no
response.  Even though I was doing my best to take a day off from thinking
about SGML, my brain woke me at 6:30 spinning with a desire to write such a
DTD, so I did.  It happens we can use this to provide some intelligent SGML
processing for DTD management and documentation, so it is not time wasted.
I tested the DTD with Author/Editor (which makes a nifty little DTD editor
when you set up the obvious style sheet so it looks more or less like
you're editing a real DTD), but I haven't scrubbed it really hard.

The DTD is designed to allow the creation of arbitrary DTD fragments, not
complete DTDs (e.g., it does not include a DOCTYPE declaration element.  It
is not just a literal translation of the declaration productions but tries
to capture some of the semantics not expressed in the productions alone.  I
tried to account for all the places that comments are allowed, but I may
have missed some or defined some ambiguous content models.  I've used long
element names, so you'll need an SGML declaration that defines long names,
the OSF-BOOK or DOCBOOK declarations should work fine.

<!--===============================================================-->
<!--                                                               -->
<!-- DTD Fragment DTD Version 1.0                                  -->
<!--                                                               -->
<!-- Describes components of SGML document type declarations (DTD) -->
<!-- as defined in ISO 8879.                                       -->
<!--                                                               -->
<!-- Author:  W. Eliot Kimber, drmacro@vnet.ibm.com                -->
<!--                                                               -->
<!-- Date:  26 September 1993                                      -->
<!--                                                               -->
<!--                                                               -->
<!--===============================================================-->
<!-- PublicDTD document type  -->
<?PUBLICID:  +//ISBN 0-933186::IBM//DTD DTD Fragment V1.0//EN>

<!ELEMENT DTDFragment O O ( ElementDecl | TextEntityDecl | InternalParmEntityDecl |
                            ExternalParmEntityDecl | DataEntityDecl |
                            NotationDecl | AttlistDecl | UseMapDecl | ShortRefDecl |
                            CommentDecl | MarkedSection | ParmEntRef)*
>
<!ATTLIST DTDFragment
            ID   ID  #IMPLIED
>

<!--===============================================================-->
<!--                   Common Components                           -->
<!--===============================================================-->
<!ELEMENT CommentDecl - - (Comment+)
>
<!ELEMENT Comment     - - (#PCDATA | GI | ParmEntName | GeneralEntName | Notation |
                           MapName)*
                          -- Restriction:  Cannot contain double dash --
>
<!ELEMENT GI          - - (#PCDATA) -- Restriction:  single name -->
<!ELEMENT GIGroup     - - (GI+) -- GI Group outside of content models -->

<!ELEMENT Notation    - - (#PCDATA) -- Restriction:  single name -->
<!ELEMENT AttName     - - (#PCDATA) -- Restriction:  single name -->
<!ELEMENT AttValue    - - (#PCDATA) -- Restriction:  Cannot mixe LIT
                                      and LITA in content --
>
<!ELEMENT AttSpecList - - (AttSpec* | ParmEntRef) >
<!ELEMENT AttSpec     - - (AttName?, AttValue) >

<!ELEMENT (GeneralEntName | ParmEntName)     - - (#PCDATA) -- Restriction:  single name -->
<!-- NOTE:  "%" is not part of parameter entity names            -->

<!ELEMENT ParmEntRef      - - (ParmEntName) >
<!ELEMENT GeneralEntRef   - - (GeneralEntName) >

<!ELEMENT MinimumLit  - - (#PCDATA) -- Restriction:  S and RE characters
                                       normalized --
>
<!ELEMENT NameGroup - - (Name | ParmEntRef)+ >
<!ELEMENT Name          - -  (#PCDATA) -- Restriction:  single SGML Name -->
<!ELEMENT NotationGroup - - (Notation | ParmEntRef)+ >


<!--===============================================================-->
<!--             Element Declaration Components                    -->
<!--===============================================================-->

<!ELEMENT ElementDecl - - ( Comment*, (GI | GIGroup | ParmEntRef), (ContentModel | ParmEntRef),
                            ((Exceptions, Comment*) | (Comment*, Exceptions)
                             | Comment)?, Comment*)
>
<!ATTLIST ElementDecl
              ID          ID  #IMPLIED
              StartOmit   (sreq | somit) sreq
              EndOmit     (ereq | eomit) ereq
>
<!ELEMENT ContentModel - O (GIToken | ContentModel | PCData | ParmEntRef)+ >
<!ATTLIST ContentModel
              Any        (any) #CONREF
              Connector  (seq | or | and)  #REQUIRED
              Occurrence (once | zeroormore | oneormore) once
>
<!ELEMENT GIToken  - - (GI | ParmEntRef) >
<!ATTLIST GIToken
              Occurrence (once | zeroormore | oneormore) once
>
<!ELEMENT PCData  - O EMPTY>
<!ATTLIST PCData
              Occurrence NAME #FIXED zeroormore
>
<!ELEMENT Exceptions - - ( (Exclusions, (Comment*, Inclusions)?) | Inclusions) >
<!ELEMENT (Exclusions | Inclusions) - - (GIGroup | ParmEntRef) >

<!--===============================================================-->
<!--             Attribute List Declaration Components             -->
<!--===============================================================-->

<!ELEMENT AttlistDecl - - ((GI | GIGroup | Notation | NotationGroup | ParmEntRef),
                           AttDefinitions+) >
<!ELEMENT AttDefinitions - - (NormalAtt | FixedAtt | ParmEntRef | Comment)+ >
<!ELEMENT NormalAtt   - - (AttName, DeclValue, Default) >
<!ELEMENT DeclValue   - O (NameGroup | NotationGroup) >
<!ATTLIST DeclValue
             DataType (CDATA | ENTITY | ENTITIES | ID | IDREF | IDREFS |
                       NAME | NAMES | NMTOKEN | NMTOKENS | NUMBER |
                       NUMBERS | NUTOKEN | NUTOKENS) #CONREF
>
<!ELEMENT Default     - O (AttValue) >
<!ATTLIST Default
          DefaultBehavior  (REQUIRED | CURRENT | CONREF | IMPLIED) #CONREF
>
<!ELEMENT FixedAtt   - - (AttName, DeclValue, Comment*, AttValue) >

<!--===============================================================-->
<!--                 Entity Declaration Components                 -->
<!--===============================================================-->

<!ELEMENT TextEntityDecl  - - (Comment*, GeneralEntName, Comment*,
                               (ReplacementText | ExternalLocation),
                                Comment*)
>
<!ELEMENT InternalParmEntityDecl
                          - - (Comment*, ParmEntName, Comment*,
                               ParmReplacementText, Comment*)
>
<!ELEMENT ExternalParmEntityDecl
                          - - (Comment*, ParmEntName, Comment*,
                               ExternalLocation, Comment*)
>
<!ELEMENT ParmReplacementText - - (GI | GIToken | GIGroup | Notation | NotationGroup | ParmEntRef |
                                   ContentModel | AttDefinitions | FixedAtt) >

<!ELEMENT ReplacementText - - (#PCDATA) >
<!ATTLIST ReplacementText
             TextType  (CDATA | SDATA | PI | STARTTAG | ENDTAG | MS | MD |
                        SIMPLE) simple
>
<!ELEMENT ExternalLocation  - - (PublicID?, Comment*, SystemID?) >
<!ELEMENT PublicID          - - (MinimumLit) >
<!ELEMENT SystemID          - - (#PCDATA) >

<!ELEMENT DataEntityDecl  - - (Comment*, GeneralEntName, Comment*,
                               (ExternalLocation), Notation,
                                Comment*, AttSpecList?)
>
<!ATTLIST DataEntityDecl
          Type  (CDATA | SDATA | NDATA | SUBDOC) NDATA
>
<!--===============================================================-->
<!--                 Notation Declaration Components               -->
<!--===============================================================-->

<!ELEMENT NotationDecl  - - (Comment*, Notation, Comment*,
                               ExternalLocation, Comment*)
>

<!--===============================================================-->
<!--                 Shortref Declaration Components               -->
<!--===============================================================-->

<!ELEMENT ShortrefDecl  - - (Comment*, MapName, Comment*,
                               ShortrefDelimiter, Comment*, GeneralEntName, Comment*)
>
<!ELEMENT (MapName | ShortrefDelimiter)       - - (#PCDATA)>

<!ELEMENT UseMapDecl    - - (Comment*, (MapName | Empty), Comment*,
                               (GI | GIGroup), Comment*)
>
<!ELEMENT Empty         - O EMPTY >

<!--===============================================================-->
<!--              Marked Section Declaration Components            -->
<!--===============================================================-->

<!ELEMENT MarkedSection - - ((MSKeyword | ParmEntRef), MSBody)
>
<!ELEMENT MSBody        - - ANY
>
<!ELEMENT MSKeyword     - O EMPTY
>
<!ATTLIST MSKeyword
            Value  (INCLUDE | IGNORE | CDATA | RCDATA | TEMP) #REQUIRED
>

<!--===============================================================-->
<!--                     End of DTD Fragment                       -->
<!--===============================================================-->

--
Eliot Kimber                      Internet:  drmacro@vnet.ibm.com
Dept E14/B500                     IBMMAIL:   USIB2DK9@IBMMAIL
Network Programs Information Development     Phone: 1-919-254-5160
IBM Corporation
Research Triangle Park, NC 27709