CDATA in attributes and content

Title: Re: CDATA in attributes and content
Author: jenglish@crl2.crl.com (Joe English)
Date: 5 Sep 1997 15:41:02 -0700
Andrew M Greene <agreene@bitstream.com> wrote: > >The term "CDATA" seems to be used for two different things, | ^^^ Actually, five, possibly even more. >and I'm a little confused. You're not alone :-) See http://www.art.com/~joe/sgml/cdata.html, which attempts to explain all of this. >Consider the following document instance: >[...] > <!ATTLIST test > attribute CDATA #IMPLIED > > >[...] > <!ELEMENT cdata - - CDATA> >[...] > >My understanding was that CDATA is never interpreted at all; No: when a start-tag for an element with CDATA declared content is encountered, the parser switches to a delimiter recognition mode in which no markup is recognized except for the closing delimiter. That is, CDATA declared content is not *entirely* uninterpreted; the parser still has to scan for a TAGC delimiter-in-context ("</" followed by a name start character). (BTW, you you should never, ever, ever use elements with CDATA declared content if you're writing a DTD.) >The SGML standard says (page numbers refer to the SGML Handbook): > >[page 409, in the section on declared content] > RCDATA means that the content is replaceable character data ([46] 343:1) > CDATA means that the content is character data ([47] 344:1) This applies to element declared content. >[page 423, in the section on attribute declared values] > CDATA means the attribute value is character data ([47] 344:1) This applies to attribute declared values. Here "CDATA" means something else, as you have noted. Attributes with declared value "CDATA" are not *further* interpreted by the parser. That is, the attribute value is taken as is, and is not tokenized, folded to uppercase, or validated with respect to any additional semantic constraints (as is the case with, say, "ID", "IDREFS", or "ENTITIES" attribute declared values). However, an _attribute value literal_ -- text inside quotes in a start-tag -- *is* interpreted by the parser; specifically, it is parsed as replaceable character data. This means that general entity references (&foo;) and character references (&#nnn;) are recognized and replaced. In addition, REs and SEPCHARs are turned into spaces. See 7.9.3 "Attribute Value Specification", pp. 330-332, productions [33] and [34]. Attribute specifications are all initially parsed the same way, regardless of the attribute's declared value. The productions in 7.9.4 "Attribute Value" on pp. 332-334 apply _after_ the attribute specification has been parsed; these rules determine how the attribute specification is interpreted as an attribute value. (Basically, attribute values get parsed twice; [33] and [34] apply to the first parse, and [35] through [43] apply to the second parse). Hope this helps... --Joe English jenglish@crl.com

Note the collection of commentary materials on CDATA ("CDATA [and RCDATA] as Declared Content") in the Grammar section of the SGML Web Page.