[This local archive copy is from the official and canonical URL, http://www.openxml.org/dev/rfc-wshp.html; please refer to the canonical source document if possible.]
White Space Handling In XML Parsing
White space handling is an unresolved issue in the present definition of XML parsers, falling outside the scope of both the DOM specification and the SAX API. This is a recommendation for the behavior of XML parsers in regards to white space appearing in the source document, and what portions are to be delivered to the application.
This RFC is published and made available for public review in an open process. We encourage parser developers to take part in formulating the final specification and to abide by it, in an effort to provide a uniform behavioral model that will allow applications and documents to be portable across a variety of parsers.
2. The Problem
White space is defined by XML as any character of the set space, tab and new-line. Carriage-return is always conveyed to the parser as a new-line. (See sections 2.3 and 2.11 in the XML specification).
White space serves two distinct purposes. The first is to introduce spaces and line breaks into the element content in a manner that has semantic significance for the XML application, whether this is to separate word and textual parts, to describe visual formatting, or otherwise.
The second is the use of white space to visually format the document in its source form, e.g. when using a text editor to edit an XML file. Such use of white space is purely to assist the reader or editor of the document. This white space is not part of the information conveyed by the document and bears no semantic significance for the XML application.
An XML parser that regards all white space as part of the element content and as conveying information might, with liberally formatted documents, deliver redundant spaces to the application, affecting performance and memory consumption. In addition, the application must employ special code to remove such white space.
4. Scope And Effect
This specification defines a contract between the XML source document and the XML parser. The contract clearly defines what portions of the white space appearing in the source document are a meaningful part of the document content and must be delivered to the application, and what portion of the white space only serve to format the document source and should be ignored.
Given a source document that contains both types of white space, the XML parser aims to produce a document model that does not contain less than or more than the meaningful information expressed in the source document, and that document model should be equivalent to one generated in a programmatic fashion.
This specification is limited to white space appearing in mixed and element content, that is, all characters appearing between the opening and closing tags of an element, that are not part of any markup. White space that appear in attribute values, as well as part of a markup, is outside the scope of this specification.
This specification assumes that the application is not interested in processing redundant white space unless specifically expressed by the application, and that the document itself is capable of distinguishing between relevant and redundant white space. As such this specification has no implication on the handling of white space as defined in XSL, XQL and other processing languages.
The behavior of the parser in regards to white space is to be defined in a clear, consistent and conclusive manner so as to allow applications and documents to be used consistently with different parsers. The same consistency is to be applied to the manner in which the application and document exert control over white space handling.
5. Proposed Handling Behavior
The proposed white space handling behavior is expressed as two rule sets. The first rule set consists of implicit rules that apply if no white space handling behavior is explicitly specified. The second rule set defines such implicit behavior and how to bring it to effect.
5.1. Default Behavior
5.2. Specified Behavior
6. Mixed Content vs. Element Content
XML element content is either made up only of element (element content), or consists of both element and text (mixed or any content). In the former case, all white space occurring before, after and between elements in the element content is ignored, and all other characters are reported as validation errors. In the latter case, white space occurring between elements is subject to the preserving or consolidation rules.
This approach is clear and consistent, with the exception that a validating and non-validating parsers will parse the same document differently. In some instances it is beneficial to parse documents without the use of a DTD. In such instances it is recommended that the document be available without redundant spaces that will cause excessive text nodes to be generated.