Draft of Position on SOAP's use of XML Internal Subset
From: firstname.lastname@example.org To: email@example.com [David Fallside] Cc: firstname.lastname@example.org Date: Fri, 6 Dec 2002 15:29:50 -0500 Subject: Re: Draft of position on SOAP's use of XML Internal subset
December 06, 2002 note: I promised David a final draft by COB today. Here it is. Unless there are problems, I expect David will send this out officially on our behalf later today.
December 02, 2002 note: A few days ago I was asked to draft a note that would explain to the Tag and other concerned members of the W3C community some of the reasons behind SOAP's restrictions on the use of XML. A first draft was discussed on our telcon today, and I was asked to reflect a few additional points. This is a second draft, attempting to incorporate those points, and a few other changes that I hope will be viewed as editorial or clarifying. My understanding is that email review is open until Friday, at which point David will send on behalf of the WG. Per today's call, all substantive requests for change should be in the form of proposed revised text. Thanks.
==========Start of Draft============
The XML Protocols Workgroup appreciates this opportunity to clarify our design decisions regarding use of XML features such as the Internal Subset (for those not familiar with the term, "Internal Subset" is the official term for a DTD that appears within an XML document).
Before discussing our (lack of) use of DTDs, it's helpful to briefly clarify what SOAP is, as well as to review some use cases that influenced our decision making. The following are not necessarily official use cases and requirements, but they are representative of the considerations that many implementers considered important:
Informally, SOAP is a specification that describes certain aspects of the creation, transmission, and processing of messages. SOAP messages originate at a node called the "initial sender", flow along a message path through zero or more "intermediary" nodes, eventually reaching (in the absence of errors) an "ultimate receiver". SOAP sets out the rules for initial construction of a message, rules by which messages are processed when received at an intermediary or ultimate destination, and rules by which portions of the message can be inserted, deleted or modified by the actions of an intermediary. Thus, SOAP deals not just with the messages transiting a given hop, but with the manipulation of those messages as they go through successive intermediaries.
SOAP is a framework that's intended to be useable for a broad range of applications, on a variety of devices, and in a broad range of performance regimes. Among the goals is for SOAP to be useable as a replacement for certain high performance binary protocols such as EDI, at least in certain applications. Accordingly, the ability to run in a performance regime of hundreds or thousands of messages per second per node is highly desirable.
SOAP is designed to be hostable on a variety of so-called underlying protocols. A binding to HTTP is provided and we expect that it will be widely deployed, but the specification provides the mechanisms necessary for users (or the W3C) to create bindings to other protocols, or to create alternative bindings to HTTP.
Message Infosets and Bindings
SOAP messages are specified as XML Infosets -- see  (Note, references are to a snapshot of the latest editors' copies of the SOAP specs, reflecting some resolutions to last call issues. I believe the version I am referencing is the latest that is stable in W3C URI "date space". It is later than our last official WD.) The initial sender prepares a SOAP message in the form of what the Infoset Recommendation calls a "synthetic infoset" . In other words, the initial sender typically does not have a document to parse to produce the infoset; rather, the initial sender establishes, using programming structures of its choosing (could be something like DOM or SAX) the elements, attributes and other content of the outgoing message.
The purpose of a binding, such as the HTTP binding, is to provide a means for moving the message Infoset from one node to the next. The way in which the message is represented on the wire is completely at the discretion of the binding, and is not otherwise visible in the architecture. The HTTP binding supplied with SOAP uses an XML 1.0 serialization of the Infoset. It sends that serialization in an HTTP POST or RESPONSE, typically as MIME type application/soap+xml.
Note that, because SOAP is Infoset based, in a situation where two nodes share a memory (run on the same processor or tightly coupled MP), it is perfectly sensible to build a binding that does its work by just passing around DOMs, SAX streams, or other in-memory representations of the Infoset. In these cases, no serialization or parsing need ever be done. Also: implementations can in principle use compressed or encrypted forms, possibly by compressing or encrypting the <...> serialization, but also possibly by using other compressed or encrypted representations. In principle bindings could also be written to send parts of the Infoset out of order, in parallel over multiple links to improve bandwidth on large messages, etc.
Use of DTD Internal Subsets
Thus, we must consider several related issues:
Q. Do DTD internal subsets or other DTD-related features appear in a SOAP
A. By definition, they do not. See , which says:
"A SOAP message is specified as an XML Infoset that consists of a document information item with exactly one member in its [children] property, which MUST be the SOAP Envelope element information item (see 5.1 SOAP Envelope). This element information item is also the value of the [document element] property. The [notations] and [unparsed entities] properties are both empty. The [base URI], [character encoding scheme] and [version] properties can have any legal value. The [standalone] property either has a value of "yes" or has no value.
The XML infoset of a SOAP message MUST NOT contain a document type declaration information item."
So, to the extent the Infoset recommendation is capable of reflecting the presence of DTDs, SOAP rules them out. SOAP messages do not contain DTDs. SOAP messages also must not reference external DTDs.
Q. Can DTD's or schema validation be used to supply defaults or otherwise
augment or alter the contents of a SOAP message?
A. No, not insofar as such augmentation would change the results of SOAP processing. SOAP makes clear that the values of all elements and attributes pertinent to SOAP itself must be carried explicitly in each message -- neither Schema nor DTD (nor any other) validation can be used to establish defaults for SOAP's attributes, though in certain cases SOAP directly defines what the behavior will be if optional attributes are left out. That said, applications can do whatever they want with data received from SOAP bodies or header entries. If an application chooses to infer information from schema validation of information received in a SOAP message, that is the business of the application.
Q. Can a binding use DTDs in its "on the wire" format?
A. In principle, yes. Somebody could write a binding that, for example, declares entities in an internal subset, perhaps to represent commonly appearing substrings, and could call for their expansion upon receipt. Note, however, that such use of a DTD must be completely private to the binding; upon receipt an Infoset must in all cases be reconstructed to be identical to the one provided for transmission, and by definition that does not contain a DTD (see above).
Q. Does the HTTP binding provided with SOAP use DTDs as described above?
A. No. The SOAP HTTP binding uses the obvious no DTD serialization of the SOAP message Infoset.
Q. If a DTD is present and the SOAP HTTP binding is used, what does a
receiving node do?
A. If an implementation of the SOAP HTTP binding receives a message that contains a DTD, then it knows that it is talking to an erroneous implementation at the sender. It SHOULD send a so-called env:SENDER fault.
Why did we make these decisions?
That's how SOAP works. The question is, of course: why? Primarily, the reasons are (a) performance and (b) keep it simple. In the high performance regimes where some SOAP implementations will operate, the parsers will likely be tuned for SOAP message handling. Doing general entity substitution beyond that mandated by XML 1.0 (e.g. <) implies a degree of buffer management, often data copying, etc. which can be a noticeable burden when going for truly high performance. This performance effect has been reported by workgroup members who are building high performance SOAP implementations.
Furthermore, a DTD in the Infoset would become another piece of the message. We would have questions to answer: what are the rules for relaying through an intermediary? If something comes into an intermediary as an entity reference, must it go out as an entity reference? If that header is removed by the intermediary, must one check whether it is the last use of the entity and should the outbound DTD have the definition removed? What does all this do to digital signatures? If we allowed an internal subset, should we change our rules to allow attributes to be defaulted? All of this is complication. So, in addition to performance, leaving out DTDs keeps things simpler, which by the way tends to avoid other performance problems.
Security is another concern. Although we have not formally demonstrated that XML with internal subset is less secure, several members of the workgroup shared an intuition that entity substitution, attribute defaulting, and other manipulation of the message content was more likely to lead to security exposures, denial of service attacks (e.g. the billion laughs entity attack), etc.
Our reasons for disallowing reference to external DTDs were similar to those given above for the internal subset. In addition, we felt that it would not in general be appropriate to require a SOAP processor to open a connection to the Web in order to retrieve external DTDs.
Of course, the counter argument to all this is: XML allows internal subsets and external subsets, lots of off the shelf parsers would implement them for you, and indeed some might not report the presence of the DTD at all. First of all, SOAP is not the only application of XML that requires parsers to report the presence of DTDs. Surely an XML editor would as well. Indeed, there is no W3C specification for what a general purpose processor must be, just for what XML is. It is important to note that our HTTP binding does go to some trouble to ensure that all messages are XML-conformant. You CAN parse all legal SOAP messages from our HTTP binding with any XML processor. If your processor doesn't report the presence of DTDs or entity references, then you have an error checking problem. Get a processor that meets your needs. Again, many high performance SOAP implementations will have highly optimized parser implementations tuned for SOAP...our choices are designed in part to make such implementations practical.
Still, we are aware of the trade-off: our decision to limit use of constructions such as the internal subset is likely to reduce the performance of and otherwise negatively impact implementations and applications which would have otherwise been able to use certain general purpose processors; in many cases, those implementations will have to resort to additional scanning and reporting to deal with the features that we disallow.
Does SOAP define an XML Subset for the Rest of the World?
Maybe, but that certainly wasn't a goal, and there's some reason for caution. SOAP places other restrictions on its use of XML. For example (again from ):
"SOAP messages sent by initial SOAP senders MUST NOT contain processing instruction information items. SOAP intermediaries MUST NOT insert processing instruction information items in SOAP messages they relay. SOAP receivers receiving a SOAP message containing a processing instruction information item SHOULD generate a SOAP fault with the Value of Code set to "env:Sender". However, in the case where performance considerations make it impractical for an intermediary to detect processing instruction information items in a message to be relayed, the intermediary MAY leave such processing instruction information items unchanged in the relayed message."
This was the subject of long debate on distApp and in the working group, and this is not the place to reopen that debate. To give some flavor of the reasons why PIs are a problem consider the following SOAP fragment:
<soap:Envelope> <soap:Header> <ns1:h1> ... </ns1:h1> <? your pi here -- does it modify ns2:h2 below ?> <ns2:h2> ... </ns2:h2> <ns3:h3> ... </ns3:h3> </soap:Header> <soap:Body> ... </soap:Body> </soap:Envelope>
Consider an intermediary that processes and removes ns2:h2, the second header. Should it also remove the PI above when relaying the message to the next node? The PI might well be giving information about the element to follow, or else it might not. If we leave it in place, does it wind up inadvertently modifying the third header? The point is that any feature like PIs adds complication. SOAP bases all of its processing and semantics on the tree of elements. The fact that PIs are not tied to that tree in an architecturally robust manner makes it very hard to define simple or stable semantics for PI's as a SOAP message flows through a system. Furthermore, we would have other complications in the WS stack: should WSDL provide rules to describe when PIs are OK and when not? Which PIs? With what parameters? Another mess. Again, we kept it simple by ruling them out.
SOAP uses XML Infosets and serializations to build a framework for messaging. By definition, SOAP envelope Infosets do not contain DTDs or entity references, and external DTDs are disallowed as well. SOAP uses pluggable bindings to move messages on the wire; those bindings have complete discretion as to how to represent the data. Some might try to play games using DOCTYPEs and DTDs on the wire, but our standard HTTP binding does not, and it's probably unlikely that others would.
Few XML applications use all the features of XML (some don't use attributes), but clearly SOAP eschews some features such as DTDs and PIs that are often viewed as relatively general purpose. This note sets out some of our reasons. All SOAP messages are conformant XML Infosets. All messages sent by our HTTP binding are conformant XML 1.0 and can if desired be processed with conformant processors. Like an XML editor, SOAP depends on knowing whether DTDs and PIs are in its XML (in our case, though, only for error checking.) SOAP messages also tend to be processable at relatively high speed by carefully tuned processors. Furthermore, by prohibiting some of these features, we simplified the definition of the SOAP processing model and of description languages used with SOAP. The tradeoff is that we have somewhat complicated things for those who prefer to use certain off-the-shelf processors, and for those who want to insert arbitrary XML into SOAP messages (there are many other problems doing that...a longer story than we have time for here.)
Whether SOAP represents a good start on a general purpose subset of XML is not a question the XMLP group has actively considered. That was not a goal. We consider SOAP to be an application of XML, not a redefinition of it. We do hope the analysis above is useful to those who are indeed thinking about XML subsets, and that it clarifies the reasons for our decisions.
- for the XML Protocols WG -
P.S. Although it played no role that I am aware in the actual decision making of the XMLP team, I'm indebted to Rich Salz for pointing out that the internet draft on "Guidelines for the Use of XML within IETF Protocols"  has some useful perspectives on related issues.
===========End of Draft=============
------------------------------------------------------------------ Noah Mendelsohn Voice: 1-617-693-4036 IBM Corporation Fax: 1-617-693-8676 One Rogers Street Cambridge, MA 02142 ------------------------------------------------------------------