SGML: David G. Durand's Five Paragraphs" on White Space in XML

received="Thu Aug 28 20:09:33 1997 BST"
sent="Thu, 28 Aug 1997 15:07:03 -0500"
name="David G. Durand"
email="dgd@cs.bu.edu"

  ------------------------------------------------------------------------

My 5 paragraphs

See the archive: http://www.lists.ic.ac.uk/hypermail/xml-dev/9708/0147.html

David G. Durand (dgd@cs.bu.edu)
Thu, 28 Aug 1997 15:07:03 -0500

I decided that it wasn't very helpful to punt on my own suggestion. A few
<P>'s on whitespace follow. Adapt/preserve/wrap-fish-in them as you like.

Whitespace in XML is always signficant wherever data may appear. This means
that XML truly treats each character that is not inside a tag or a comment
as a potentially meaningful piece of information. As comments are generally
not significant to most applications (but must be preserved by editors and
other transduction processes), whitespace in a few contexts is generally
best not made the crux of a critical distinction: For instance, in an
element whose sole allowed content according to the DTD is other elements
(SGML term: element content) it is probably a bad idea to base processing
semantics on the presence or absence of whitespace.

However, since linking depends critically on where any characters not in
tags occur, even such space should not be casually deleted, as it may cause
hyperlinks in other documents to break. This happens because hyperlinks
have to be able to count sub-parts of an element's content, and the
whitespace between two elements is such a sub-part.

XML provides some hints as to the treatment of whitespace in the SPACE
attribute. For most applications like browsing or typesetting or importing
into a databse, normalizing such whitespace should be harmless to the
semantic content. You might break links in these cases as well, but any
change can break a link.

If you are creating and editing or transduction application, you should
_not_ change any whitespace without explicit authorization from the author
-- auch changes may damage links and file offsets that the user wants to
preserve. This is the same kind of restriction as you must observe when
editing an XML file with comments or PIs in it. Even if you don't use or
understand these, they must be preserved in the general case.

In sum, there are 2 kinds of applications that can use XML: Those, like
editors, that should preserve all information, they can (including all
whitespace, comments, PIs, etc), unless instructed otherwise. We might call
these transduction applications, because they produce a representation of
the document they read as their output. The other sort of application --
call them processing applications -- is responsible for processing the
results of an XML parser, and may ignore comments and PIs, normalize
whitespace (as warranted by knowledge of the DTD, tags, or XML-SPACE
hints), and so forth. Such applications are generally creating a specific
view or result from the data in an XML document, and may do that in any way
that produces the desired result.

-- David

_________________________________________
David Durand dgd@cs.bu.edu \ david@dynamicDiagrams.com
Boston University Computer Science \ Sr. Analyst
http://www.cs.bu.edu/students/grads/dgd/ \ Dynamic Diagrams
--------------------------------------------\ http://dynamicDiagrams.com/
MAPA: mapping for the WWW \__________________________



xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo@ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa@ic.ac.uk)