XML standalone document declaration

On May 08, 1998, Paul Prescod raised some questions about the design of the "standalone document declaration" in XML 1.0. His original posting and some followup messages (Tim Bray, David Megginson, James Clark) from from XML-DEV are collected in this document. Last updated: May 09, 1998.


From owner-xml-dev@ic.ac.uk Fri May  8 08:28:49 1998
Date: Fri, 08 May 1998 09:18:22 -0400
From: Paul Prescod <papresco@technologist.com>
Subject: SDD bogus

Is the standalone document declaration bogus and perhaps dangerous? The
whole feature strikes me as over-complicated and over-specific for a
language like XML, but I'm aware of the historical processes that gave
rise to it.

My understanding of a typical usage scenario goes like this: a sender
creates a document. It creates it specifically so that it will be
standalone. It validates that this is the case (while it validates
everything else) and then it sends it to the receiver who hopes to consume
it without validating it. Things already strike me as a little bizarre,
because if your protocol is designed such that the consumer trusts the
receiver, then couldn't the SDD be implied in your out-of-band agreement?
Further, what do you do if the SDD is other than you expect? Halt the
parse and start again with a validating processor?

But that's not what I'm concerned about. I'm concerned because I believe
this to be a valid XML document:

<?xml version="1.0" standalone="yes"?> 
<!DOCTYPE MEMO SYSTEM "http://www.sgmlsource.com/memo.dtd" [ 
<!ENTITY % mess-everything-up SYSTEM "mess.ent">
<!ATTLIST MEMO SECURITY CDATA "TOP-SECRET">
]>
<MEMO></MEMO> 

In my opinion, section 5.1 will require the non-validating parser to skip
the attribute list declaration, even if memo.dtd is an empty file.
The receiver has no way of knowing that this case has occured if it uses a
"standard parser" (since XML's semantics are, for the moment at least,
imprecisely specified, I only know what that means intuitively ... SAX,
Lark, Expat, etc. would not give you enough information to detect this
case).

This to me suggest that applications cannot trust the SDD and it must
therefore be presumed to be meaningless.

But I'm glad to be proven wrong. Despite its reputation to the contrary,
XML is intricate and deep and I may have missed something important.

[**Note on diction:

Paul clarified/qualified the use of "bogus" in a later post:

> Tim Bray wrote:
>> 
>> Having said that, Paul did raise a valid concern about the SDD (too bad
>> this issue wasn't pointed out before the spec was frozen).  
>
> Yes, I want to point out to those who do not know the dynamics here that I
> use the word "bogus" because I was in the SIG and it is as much my fault
> as anyone's that it got through. Were I talking about someone else's spec.
> I would be more tactful.
]

Paul Prescod  - http://itrc.uwaterloo.ca/~papresco

Can we afford to feed that army, 
 while so many children are naked and hungry?
Can we afford to remain passive, 
 while that soldier-army is growing so massive?
  - "Gabby" Barbadian Calpysonian in "Boots"

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo@ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo@ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa@ic.ac.uk)


=======================================================================
From owner-xml-dev@ic.ac.uk Fri May  8 08:39:42 1998
Date: Fri, 08 May 1998 09:28:10 -0400
From: Paul Prescod <papresco@technologist.com>
X-Mailer: Mozilla 4.04 [en] (WinNT; U)
MIME-Version: 1.0
To: xml-dev <xml-dev@ic.ac.uk>
Subject: SDD again


Let me risk another step into the language courtroom. Validating parsers
must always read the whole DTD. So the SDD is only for non-validating
parsers. Non-validating parsers do not read element type declarations. So
what is the point of this line:

"The standalone document declaration must have the value "no" if any
external markup declarations contain declarations of:"
...
"element types with element content, if white space occurs directly within
any instance of those types."

First, why does a non-validating parser care about element/mixed content?
It has no responsibility to do any marking of insignificant whitespace
anyhow. Second, if there is no class of processor that can reliably
reproduce the intended parse tree without reading the whole DTD, then
doesn't that significantly weaken the utility (okay, "purity") of the SDD?
Even if I am wrong on the last point, it seems that it does not do what it
is supposed to do properly.

 Paul Prescod  - http://itrc.uwaterloo.ca/~papresco

Can we afford to feed that army, 
 while so many children are naked and hungry?
Can we afford to remain passive, 
 while that soldier-army is growing so massive?
  - "Gabby" Barbadian Calpysonian in "Boots"

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo@ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo@ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa@ic.ac.uk)

=======================================================================

From owner-xml-dev@ic.ac.uk Fri May  8 09:05:31 1998
Date: Fri, 08 May 1998 06:51:30 -0700
To: xml-dev <xml-dev@ic.ac.uk>
From: Tim Bray <tbray@textuality.com>
Subject: Re: SDD bogus


At 09:18 AM 5/8/98 -0400, Paul Prescod wrote:
>But that's not what I'm concerned about. I'm concerned because I believe
>this to be a valid XML document:
>
><?xml version="1.0" standalone="yes"?> 
><!DOCTYPE MEMO SYSTEM "http://www.sgmlsource.com/memo.dtd" [ 
><!ENTITY % mess-everything-up SYSTEM "mess.ent">
><!ATTLIST MEMO SECURITY CDATA "TOP-SECRET">
>]>
><MEMO></MEMO> 
>
>In my opinion, section 5.1 will require the non-validating parser to skip
>the attribute list declaration, even if memo.dtd is an empty file.

Welll, it can't be valid if memo.dtd is an empty file, because you
don't have <!ELEMENT memo .. > anywhere.  But yes, 5.1 suggests the
attribute default shouldn't be used.

>The receiver has no way of knowing that this case has occured if it uses a
>"standard parser"

If the sender is stupid enough to send something like this to a 
non-validating parser, he gets what he deserves.  If it's a validating
parser, then of course the emptiness of memo.dtd will be detected.

> (since XML's semantics are, for the moment at least,
>imprecisely specified, I only know what that means intuitively ... SAX,
>Lark, Expat, etc. would not give you enough information to detect this
>case).

Huh?

>This to me suggest that applications cannot trust the SDD and it must
>therefore be presumed to be meaningless.

You do raise a good question; it would seem that standalone='true'
*ought* to mean that the rule of 5.1 about the effect of 
external PE refs could be ignored.  Hmmmm -Tim


xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo@ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo@ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa@ic.ac.uk)

=======================================================================

From owner-xml-dev@ic.ac.uk Fri May  8 09:16:43 1998
Date: Fri, 08 May 1998 09:51:39 -0400
From: David Megginson <ak117@freenet.carleton.ca>
Subject: SDD bogus
In-reply-to: <3553061D.BF3A3461@technologist.com>


Paul Prescod writes:

 > Is the standalone document declaration bogus and perhaps dangerous?

Yes and yes.  

The problem, I think, came from the mistaken idea that people
(i.e. desperate Perl hackers) would write custom parsers for each XML
application (like RDF), and that these people would not want to deal
with seemingly difficult problems like external entity resolution.  

In the end, as one might have predicted, there is an impressive range
of free XML processors available in several different programming
languages: someone writing an RDF tool does not need to worry about
the character and entity level of XML at all, and can work with XML
easily through a more abstract interface such as the DOM or SAX.

So, we should let the authors decide -- if an author creates a
document referencing external entities (including an external DTD
subset), then the XML parser should handle them; if the author does
not want to use external entities, then she can simply avoid
referencing any.

As many XML parser writers have shown, resolving external entities is
one of the easiest parts of XML (especially in higher-level languages
like Java or Perl, and, I presume, Python).  Allowing parsers to skip
external entities -- rather than simplifying XML -- ended up making it
much more complicated, and as you point it, the standalone declaration
really doesn't help things.


All the best,


David

-- 
David Megginson                 ak117@freenet.carleton.ca
Microstar Software Ltd.         dmeggins@microstar.com
      http://home.sprynet.com/sprynet/dmeggins/


xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo@ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo@ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa@ic.ac.uk)

=======================================================================

From owner-xml-dev@ic.ac.uk Fri May  8 10:17:44 1998
Date: Fri, 08 May 1998 11:00:17 -0400
From: David Megginson <ak117@freenet.carleton.ca>
Subject: SDD again
In-reply-to: <3553086A.3494D7F9@technologist.com>


Paul Prescod writes:

 > Let me risk another step into the language courtroom. Validating
 > parsers must always read the whole DTD. So the SDD is only for
 > non-validating parsers. Non-validating parsers do not read element
 > type declarations. So what is the point of this line:

Your first premise is correct, but your second one is not.  The spec
states that a validating parser must use the whole DTD; it does not
state that a non-validating parser may not use the DTD.  AElfred, for
example, reads the DTD well enough that it can even flag ignorable
whitespace base on an element type's content model, but it is
non-validating.

That said, I still agree that the standalone declaration is wrong.
Perhaps some day, if there's an XML 1.1, we can think about fixing it.


All the best,


David

-- 
David Megginson                 ak117@freenet.carleton.ca
Microstar Software Ltd.         dmeggins@microstar.com
      http://home.sprynet.com/sprynet/dmeggins/

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo@ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo@ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa@ic.ac.uk)

=======================================================================


From owner-xml-dev@ic.ac.uk Fri May  8 10:31:07 1998
Date: Fri, 08 May 1998 08:17:12 -0700
From: Tim Bray <tbray@textuality.com>
Subject: Re: SDD bogus


At 09:51 AM 5/8/98 -0400, David Megginson wrote:
>As many XML parser writers have shown, resolving external entities is
>one of the easiest parts of XML 

Yes, but as is well-documented, difficulty is *not* the reason we made
their processing optional by non-validating processors.  The prime mover
behind this decision was a passionate presentation from Jean Paoli 
explaining that the auto-include semantic of parsed entities is just
*wrong* for web browsers.  I've attached an explanation of why at the
end of this message, but if you want to see it context, go to section 
4.4.3 of the annotated spec and click on the "H".

Having said that, Paul did raise a valid concern about the SDD (too bad
this issue wasn't pointed out before the spec was frozen).  Having said
*that*, I think, for reasons that are on the record in the same place,
that the problem the SDD exists to solve will essentially never
arise in real operational scenarios anyhow. -Tim

=================

>From the annotated spec at http://xml.com/axml/axml.html

Why Are External Entities
Included Optionally?

In discussion of external entities,
we realized that the semantics of
external text entities (compulsory
inclusion at the point where they
are encountered) are deeply
incompatible with the desired
behavior of Web browsers.
Consider the following example of
the beginning of an XML
document:

<?xml version='1.0'?>
<!DOCTYPE doc 
[ <!ENTITY MSA SYSTEM "http://www.microsoft.com/press/311.xml">
  <!ENTITY NSA SYSTEM "http://home.netscape.com/PR/x27.xml">
]>
<doc>Netscape today
announced that &NSA;. In
response, Microsoft
issued the following
statement: &MSA;.
... 

A Web browser is typically
making an aggressive effort to
display text to the user as soon as
possible, in parallel with fetching it
from the network. In the example
above, if a browser were required
to fetch and process all external
entities, it could only display the
first four words before starting
another network fetch operation.
To make things worse, bear in
mind that the replacement text for
the entity NSA could well include
other external entities which in
turn would need to be fetched.

This type of situation is
unacceptable. Hence the rule that
non-validating parsers need not
fetch external entities if they don't
want to.

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo@ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo@ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa@ic.ac.uk)

=======================================================================

From owner-xml-dev@ic.ac.uk Sat May  9 00:00:25 1998
Date: Sat, 09 May 1998 11:36:31 +0700
From: James Clark <jjc@jclark.com>
Subject: Re: SDD bogus
Sender: owner-xml-dev@ic.ac.uk
To: Paul Prescod <papresco@technologist.com>
Cc: xml-dev <xml-dev@ic.ac.uk>
Reply-to: James Clark <jjc@jclark.com>


Paul Prescod wrote:

> I'm concerned because I believe
> this to be a valid XML document:
> 
> <?xml version="1.0" standalone="yes"?>
> <!DOCTYPE MEMO SYSTEM "http://www.sgmlsource.com/memo.dtd" [
> <!ENTITY % mess-everything-up SYSTEM "mess.ent">
> <!ATTLIST MEMO SECURITY CDATA "TOP-SECRET">
> ]>
> <MEMO></MEMO>
> 
> In my opinion, section 5.1 will require the non-validating parser to skip
> the attribute list declaration, even if memo.dtd is an empty file.

This is very good point.

Your example isn't quite right: the entity must be referenced.  Also a
non-validating parser only has to skip the ATTLIST declaration if it
skips the entity reference.  Apart from this, your interpretation of 5.1
is the obvious one.  Expat behaves consistently this.

I think this is a serious problem, because it breaks the principle that
if you declare your document as standalone=yes and validate it, then you
will get the same result when you parse it with any non-validating
parser (which to me is the point of the SDD).

I think a bit of creative interpretation would be in order. Section 5.1
says:

 [Non-validating processors] must not process
 entity declarations or attribute-list declarations encountered after a 
reference to a parameter entity that is not read, since the
 entity may have contained overriding declarations. 

The "since" clause is false when standalone=yes, so I think this can
fairly be said to be an inconsistency in the spec (rather than simply a
poor design choice), which should be resolved by not applying this
requirement when standalone=yes.

The other way to fix this would be to tweak the definition of standalone
to say that declarations after the first reference to an external
parameter entity count as external for the purposes of determining
whether the document is standalone.

This is clearly needs to be fixed one way or the other.

> Despite its reputation to the contrary,
> XML is intricate and deep and I may have missed something important.

Yes. Entities and the SDD are both tricky: the interaction of the two is
particularily so.

James


xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo@ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo@ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa@ic.ac.uk)

=======================================================================