[This local archive copy mirrored from the canonical site: http://www.jclark.com/xml/canonxml.html, 980126; links may not have complete integrity, so use the canonical document at this URL if possible.]
This document defines a subset of XML called canonical XML. The intended use of canonical XML is in testing XML processors, as a representation of the result of parsing an XML document.
Every well-formed XML document has a unique structurally equivalent canonical XML document. Two structurally equivalent XML documents have a byte-for-byte identical canonical XML document. Canonicalizing an XML document requires only information that an XML processor is required to make available to an application.
A canonical XML document conforms to the following grammar:
CanonXML ::= Pi* element Pi* element ::= Stag (Datachar | Pi | element)* Etag Stag ::= '<' Name Atts '>' Etag ::= '</' Name '>' Pi ::= '<?' Name ' ' (((Char - S) Char*)? - (Char* '?>' Char*)) '?>' Atts ::= (' ' Name '=' '"' Datachar* '"')* Datachar ::= '&' | '<' | '>' | '"' | ' ' | (Char - ('&' | '<' | '>' | '"' | #xD)) Name ::= (see XML spec) Char ::= (see XML spec) S ::= (see XML spec)
Attributes are in lexicographical order (in Unicode bit order).
A canonical XML document is encoded in UTF-8.
Ignorable white space is considered significant and is treated equivalently to data.
James Clark