[This local archive copy is from the official and canonical URL, http://www.openxml.org/dev/rfc-wshp.html; please refer to the canonical source document if possible.]

White Space Handling In XML Parsing

Status: RFC first draft
Editor: arkin (arkin@openxml.org)
Original copy: http://www.openxml.org/dev/rfc-wshp.html

1. Abstract

White space handling is an unresolved issue in the present definition of XML parsers, falling outside the scope of both the DOM specification and the SAX API. This is a recommendation for the behavior of XML parsers in regards to white space appearing in the source document, and what portions are to be delivered to the application.

This RFC is published and made available for public review in an open process. We encourage parser developers to take part in formulating the final specification and to abide by it, in an effort to provide a uniform behavioral model that will allow applications and documents to be portable across a variety of parsers.

2. The Problem

White space is defined by XML as any character of the set space, tab and new-line. Carriage-return is always conveyed to the parser as a new-line. (See sections 2.3 and 2.11 in the XML specification).

White space serves two distinct purposes. The first is to introduce spaces and line breaks into the element content in a manner that has semantic significance for the XML application, whether this is to separate word and textual parts, to describe visual formatting, or otherwise.

The second is the use of white space to visually format the document in its source form, e.g. when using a text editor to edit an XML file. Such use of white space is purely to assist the reader or editor of the document. This white space is not part of the information conveyed by the document and bears no semantic significance for the XML application.

An XML parser that regards all white space as part of the element content and as conveying information might, with liberally formatted documents, deliver redundant spaces to the application, affecting performance and memory consumption. In addition, the application must employ special code to remove such white space.

3. Notation

XML Application: An application that manipulates XML information delivered to it as a document model. The XML application is interested in the document model and the information contained in it, but not in the source document proper. This definition is different from the use in the XML specification, where application is used to describe part of the XML parser that builds the document model.

XML Parser: An integral software component that given an XML source document will return a document model representing the information and structure conveyed by that source document. The XML parser is a superset of the XML processor described in the XML specification.

Document Model: The document model returned by the XML parser is equivalent to one created in a programmatic fashion. The DOM document tree is one such document model. The events triggered by a SAX parser are not considered a document model as they demand further processing. However, a document handler may process them and fire different events that can be considered a document model.

4. Scope And Effect

This specification defines a contract between the XML source document and the XML parser. The contract clearly defines what portions of the white space appearing in the source document are a meaningful part of the document content and must be delivered to the application, and what portion of the white space only serve to format the document source and should be ignored.

Given a source document that contains both types of white space, the XML parser aims to produce a document model that does not contain less than or more than the meaningful information expressed in the source document, and that document model should be equivalent to one generated in a programmatic fashion.

This specification is limited to white space appearing in mixed and element content, that is, all characters appearing between the opening and closing tags of an element, that are not part of any markup. White space that appear in attribute values, as well as part of a markup, is outside the scope of this specification.

This specification assumes that the application is not interested in processing redundant white space unless specifically expressed by the application, and that the document itself is capable of distinguishing between relevant and redundant white space. As such this specification has no implication on the handling of white space as defined in XSL, XQL and other processing languages.

The behavior of the parser in regards to white space is to be defined in a clear, consistent and conclusive manner so as to allow applications and documents to be used consistently with different parsers. The same consistency is to be applied to the manner in which the application and document exert control over white space handling.

5. Proposed Handling Behavior

The proposed white space handling behavior is expressed as two rule sets. The first rule set consists of implicit rules that apply if no white space handling behavior is explicitly specified. The second rule set defines such implicit behavior and how to bring it to effect.

5.1. Default Behavior

The first sequence of white space immediately after the opening tag and the last sequence of white space immediately before the closing tag are ignored.
All non-space characters (tab and new-line) are translated into a space character, and all multiple space characters are consolidated into a single space.
Sequence of white space occurring between any two markups (elements, comments, processing instructions, CDATA) except when appearing between two elements, is ignored.
Sequence of white space occurring between two elements is ignored if the element is defined to have element content. If the element is defined to have mixed content, such white space is treated according to the first two rules.
White space introduced through expansion of character references (e.g.  ) or entity references is preserved, and not considered white space per the above rules. However, white space appearing in the entity declaration is subject to the parsing rules at the time of parsing the entity declaration.
CDATA sections preserve all white space occurring between the opening <![CDATA[ and closing ]]>.

5.2. Specified Behavior

An element requests that white space be preserved by specifying the attribute 'xml:space' and using the value 'preserve'. The element may specify this attribute explicitly or inherit it from the document type definition. It is recommended that elements specify this attribute explicitly.
Preserving implies that white space is passed as is to the application, without any transformation of loss, with the exception that, if the first character after the opening tag is a new-line or the last character before the closing tag is a new-line, they are ignored.
Elements that do not specify a value for the 'xml:space' attribute inherit that value from the element in which they are contained up to the root element. If the root element does not specify a value for the 'xml:space' attribute, the value 'default' is assumed.
It is possible to instruct the XML parser to supply the root element with the 'preserve' value for the 'xml:space' attribute, if no value is explicitly specified for it. (The exact mechanism to TBD)
When expanding an entity reference, the value of the 'xml:space' attribute of the element in which the entity is expanded has no affect on the expansion of the entity.

6. Mixed Content vs. Element Content

XML element content is either made up only of element (element content), or consists of both element and text (mixed or any content). In the former case, all white space occurring before, after and between elements in the element content is ignored, and all other characters are reported as validation errors. In the latter case, white space occurring between elements is subject to the preserving or consolidation rules.

This approach is clear and consistent, with the exception that a validating and non-validating parsers will parse the same document differently. In some instances it is beneficial to parse documents without the use of a DTD. In such instances it is recommended that the document be available without redundant spaces that will cause excessive text nodes to be generated.

A. References

Extensible Markup Language (XML) 1.0 W3C Recommendation 10-Feb-98
http://www.w3.org/TR/1998/REC-xml-19980210
SAX 1.0: The Simple API for XML
http://www.megginson.com/SAX/