CDATA in attributes and content

Title: Re: CDATA in attributes and content

Author: jenglish@crl2.crl.com (Joe English)

Date: 5 Sep 1997 15:41:02 -0700


Andrew M Greene  <agreene@bitstream.com> wrote:
>
>The term "CDATA" seems to be used for two different things,
|                                      ^^^

Actually, five, possibly even more.

>and I'm a little confused.

You're not alone :-)  See http://www.art.com/~joe/sgml/cdata.html,
which attempts to explain all of this.


>Consider the following document instance:
>[...]
>     <!ATTLIST test
>        attribute CDATA #IMPLIED
>     >
>[...]
>     <!ELEMENT cdata  - - CDATA>
>[...]
>
>My understanding was that CDATA is never interpreted at all;

No: when a start-tag for an element with CDATA declared content
is encountered, the parser switches to a delimiter recognition
mode in which no markup is recognized except for the closing
delimiter.  That is, CDATA declared content is not *entirely*
uninterpreted; the parser still has to scan for a TAGC
delimiter-in-context ("</" followed by a name start character).
(BTW, you you should never, ever, ever use elements with CDATA declared
content if you're writing a DTD.)

>The SGML standard says (page numbers refer to the SGML Handbook):
>
>[page 409, in the section on declared content]
>  RCDATA means that the content is replaceable character data ([46] 343:1)
>  CDATA means that the content is character data ([47] 344:1)

This applies to element declared content.

>[page 423, in the section on attribute declared values]
>  CDATA means the attribute value is character data ([47] 344:1)

This applies to attribute declared values.  Here "CDATA" means
something else, as you have noted.

Attributes with declared value "CDATA" are not *further* interpreted
by the parser.  That is, the attribute value is taken as is, and is not
tokenized, folded to uppercase, or validated with respect to any additional
semantic constraints (as is the case with, say, "ID", "IDREFS",
or "ENTITIES" attribute declared values).

However, an _attribute value literal_ -- text inside quotes in a
start-tag -- *is* interpreted by the parser; specifically, it is parsed
as replaceable character data.  This means that general entity references 
(&foo;) and character references (&#nnn;) are recognized and replaced.
In addition, REs and SEPCHARs are turned into spaces.  See 7.9.3 "Attribute 
Value Specification", pp. 330-332, productions [33] and [34].

Attribute specifications are all initially parsed the same way, regardless
of the attribute's declared value.  The productions in 7.9.4 "Attribute
Value" on pp. 332-334 apply _after_ the attribute specification has
been parsed; these rules determine how the attribute specification
is interpreted as an attribute value.  (Basically, attribute values
get parsed twice; [33] and [34] apply to the first parse, and [35]
through [43] apply to the second parse).


Hope this helps...

--Joe English

  jenglish@crl.com
Note the collection of commentary materials on CDATA ("CDATA [and RCDATA] as Declared Content") in the Grammar section of the SGML Web Page.