[This local archive copy is from the official and canonical URL, http://www.w3.org/1999/07/WD-xml-c14n-19990729.html; please refer to the canonical source document if possible.]

Canonical XML

W3C Working Draft 29-July-1999

This version:: http://www.w3.org/1999/07/WD-xml-c14n-19990729
Latest version:: http://www.w3.org/TR/xml-c14n
Editors:: Tim Bray <tbray@textuality.com>; James Clark <jjc@jclark.com>; James Tauber <jtauber@jtauber.com>

Status of this document

This is a W3C Working Draft for review by W3C members and other interested parties. This draft represents the consensus of the W3C XML Syntax Working Group, which expects to make no further substantive changes. However, review by the W3C and the broader community may reveal new issues and lead to substantive changes. It is inappropriate to use W3C Working Drafts as reference material or to cite them as other than work in progress.

Comments on this WD should be sent to: w3c-xml-c14n-comments@w3.org.

Abstract

This document describes a subset of the information contained in an XML document and a syntax for expressing that subset. This syntax, called Canonical XML, is designed to encode the "logical structure" of XML documents; two XML documents whose Canonical-XML form is identical will be considered equivalent for the purposes of many applications.

Appendices

A. References
B. Acknowledgements (Non-Normative)

1. Introduction

The XML 1.0 Recommendation [XML] describes the syntax of a class of data objects called XML documents. It is possible for XML documents which are equivalent for the purposes of many applications to differ in their physical representation. In particular, they may differ in their entity structure, attribute ordering, and character encoding. This means that much equivalence testing of XML documents cannot be done at the byte-comparison level. This Canonical XML specification aims to introduce a notion of equivalence between XML documents which can be tested at the syntactic level and, in particular, by byte-for-byte comparison. In the syntax it describes, "logically equivalent" documents are byte-for-byte identical.

[Definition:] The syntax described in this specification is called Canonical XML. [Definition:] XML documents may be transformed into Canonical XML (with potentially some information loss) - the result of this transormation is described as the canonical form of the original document. Canonical XML is XML - that is to say, the canonical form of any XML document is an XML document.

There are two essential aspects to the specification of Canonical XML:

Which information from an XML document is included in its canonical form (and which is not).
How information is expressed in Canonical XML.

2. Information Included in Canonical XML

The information in an XML document is that described by the XML Information Set Specification [Infoset]. This section describes, in terms of the information set, what portion of this information is preserved in Canonical XML.

Note that information not included in Canonical XML may still be used producing it. In particular:

Attribute types serve as the basis of the normalization process for attribute values in Canonical XML, but the type of attributes is not preserved in it.
The replacement text of general parsed entities that are referenced is included in Canonical XML, but the information about which entity any character or logical structure came from is not.
Attribute values provided by default are included in Canonical XML, but the fact that the value was provided by default is not.

2.1 The Document Information Item

The canonical form includes none of the optional properties of the document information item, in particular no notation information items or entity information items.

2.2 Element Information Items

The canonical form includes all of the required properties of the element information item except for the reference to unknown entity information item. It includes none of the optional properties of the element information item.

2.3 Attribute Information Items

The canonical form included all of the required properties, but none of the optional properties, of the attribute information item.

2.4 Processing Instruction Information Item

Canonical XML does not include processing instruction information items.

2.5 Reference to Unknown Entity Information Items

Reference to unknown entity information items are not included in the canonical form of a document. Such information items could not appear in Canonical XML because canonicalization requires the reading of declarations for all entities referenced in a document.

2.6 Character Information Items

The canonical form includes all of the required properties of the character information item except for the flag indicating whitespace within element content. None of the optional properties of the character information item are included.

In particular, no CDATA sections occur in the canonical form. They are not necessary since all syntactically-significant characters in Canonical XML are escaped in the fashion described in this specification.

2.7 Comment Information Items

Canonical XML does not include comment information items.

2.8 Document Type Declaration Information Items

Canonical XML does not include document type declaration information items.

2.9 Entity Information Items

Canonical XML does not include entity information items.

2.10 Notation Information Items

Canonical XML does not include notation information items.

2.11 Attribute Declaration Information Items

Canonical XML does not include attribute declaration information items.

3. Document Type Definition Processing

The process of canonicalizing an XML document depends on its standalone document declaration. If the declaration is present and its value is "yes", then assuming the XML document satisfies the Standalone Document Declaration validity constraint, no external portion of the DTD can contain material which affects its canonical form.

In all other cases, the process of canonicalization requires reading the DTD. The following information from the DTD affects the canonical form of an XML document:

Default attribute values.
Declarations of general entities which are referenced in the document.
Attribute type declarations which affect the process of attribute value normalization.

Note that the process of canonicalization is effectively impossible for a non-standalone document for which some external component of the DTD cannot be retrieved. Implementors of software which is designed to produce Canonical XML should provide an interface to users such that this error condition can be signaled.

The canonical form of an XML document is standalone.

4. Entity and Reference Processing

The canonical form of an XML document contains no general entity references - all such references are expanded so that the canonical form contains only the replacement text. Since it contains no DTD, it also contains no parameter entity references.

Suppose a file named "e1.xml" contains the following text, with no trailing newline (#A) character.

Hallelujah, I'm a bum!

then if the following XML document is stored in a file in the same directory

<!DOCTYPE d [ 

     <!ENTITY lsb '['> 

     <!ENTITY rsb ']'> 

     <!ENTITY bum SYSTEM "e1.xml">

    ]>

    <d>&lsb;&bum;&rsb;</d>

its canonical form is

<d>[Hallelujah, I'm a bum!]</d>

5. The Syntax of Canonical XML

This section describes the syntax of Canonical XML. This syntax is a proper subset of the syntax of XML 1.0. The canonical form of an XML document is identical to its original form except as described in this section.

Each Canonical XML document must match the production labeled canonXML in the grammar below, where the notation and the semantics of the word "match" are those described in the XML 1.0 specification.

Canonical XML

[1]	`canonXML`	`::=`	`element #xA`
[2]	`element`	`::=`	`Stag (Datachar \| element)* Etag`
[3]	`Stag`	`::=`	`'<' Name NSDecl? (Att NSDecl?)* '>'`
[4]	`Etag`	`::=`	`'</' Name '>'`
[5]	`NSDecl`	`::=`	`#x20 'xmlns:' Prefix '=' '"' Attvalchar* '"'`
[6]	`Att`	`::=`	`#x20 Name '=' '"' Attvalchar* '"'`
[7]	`Datachar`	`::=`	`'&' \| '<' \| '>' \| ' '`
			`\| (Char - ('&' \| '<' \| '>' \| #xD ))`
[8]	`Attvalchar`	`::=`	`'&' \| '<' \| '"' \| ' ' \| ' ' \| ' '`
			`\| (Char - ('&' \| '<' \| '"' \| #x9 \| #xA \| #xD))`
[9]	`Name`	`::=`	`(Prefix ':')? NCName`
[10]	`Prefix`	`::=`	`'n' [1-9] [0-9]*`

The remainder of this section expresses additional constraints beyond those expressed in the grammar and provides further explanatory material on key aspects of Canonical XML.

5.1 Character Encoding

Canonical XML uses UTF-8 as the character encoding.

For example, consider the following small document:

<?xml version="1.0" encoding="ISO-8859-1"?>

    <lang>Español</lang>

Since it is encoded in ISO-8859-1 ("ISO Latin"), the character "ñ" is stored as #xF1. In Canonical XML, however, that character would be stored using UTF-8 in two bytes whose values are #xC3 and #xB1.

The Unicode standard [Unicode] allows multiple different representations of certain "combining characters" (a simple example is "ç"). Thus two XML documents with content that is equivalent for the purposes of most applications may contain differing character sequences. The W3C has recommended a normalized form for combining-character representation [CharModel] and further recommended that conversion to this form take place at transmission time, a practice called "early normalization". In the presence of early normalization, two Canonical XML documents with Unicode-equivalent content should not exhibit differences due to combining-character representation choices.

5.2 Character Escaping

The XML 1.0 specification requires XML processors to perform certain simple transformations on white-space characters in XML documents, when they serve as line separators and when they appear in attribute values. The Information Set describes the resulting set of characters that are considered to be part of the document. For example:

In a document where two lines are separated by CR-NL (#xD, #xA), the information set contains a single NL (#xA) character.
Where a document contains the string "", the information set contains a single CR (#xD) character.
Where an attribute value contains a TAB (#x9) character, the information set contains either a single space, or nothing at all in the case where the TAB was bordered by other white-space characters.
Where an attribute value contains the string "	", the information set contains a TAB character (#x9).

All characters from the information set appearing in character data or attribute values are represented by their UTF-8 encoding, with the following exceptions:

In character data and attribute values, the characters "<" and "&" are represented by "<" and "&" respectively.
In character data, the character ">" is represented by ">".
In attribute values, the double-quote character (") is represented by """.
In character data, the carriage-return (#xD) character is represented by "".
In attribute values, the characters TAB (#x9), linefeed (#xA), and carriage-return (#xD) are represented by "	", "
", and "" respectively.

For the following element

<x a="ñ">algorithms + data structures => programs</x>

the canonical form is

<x a="ñ">algorithms + data structures => programs</x>

For the following element

<x>markup delimiters=>(< and &), or <[CDATA[< and &]]>.</x>

the canonical form is

<x>markup delimiters=>(< and &), or < and &.</x>

5.3 Prolog

Canonical-XML documents have no prolog. That is to say, the first character is the "<" marking the beginning of the root element's start-tag.

For the following XML document

<!DOCTYPE x PUBLIC "myX" "x.dtd" [

     <!ENTITY a "aVal"> ]>

    <?xml-stylesheet 

      href="mystyle.css" 

      type="text/css" ?> <?rating    mostly-harmless?> <x>y</x>

the canonical form is

<x>y</x>

5.4 Epilog

Canonical-XML documents have a one-character epilog consisting of a single newline (#xA) character, which immediately follows the ">" marking the end of the root element's end-tag.

For the following XML document

<x>y</x><?audio stop here ?>

    <!--

    Local variables:

    mode: xml

    End:

    -->

the canonical form is

<x>y</x>

5.5 Elements

In Canonical XML, all elements have a start-tag and an end-tag. For elements which have no content, the end-tag follows the start-tag with no intervening characters.

For the following element

<x>

    <a n="1"/><b n="2"/>

    <c n="3"/></x>

the canonical form is

<x>

    <a n="1"></a><b n="2"></b>

    <c n="3"></c></x>

5.6 Tags

In Canonical XML, for end-tags and start-tags which contain no attributes, the ">" character closing the tag follows the element type immediately with no intervening white space. Any attributes and namespace declarations follow with each attribute and namespace declaration preceded by one space (#x20) character. When the element type and the attribute names do not have namespaces, the attributes are sorted lexicographically by attribute name (based on Unicode character code points); the ordering when namespaces are present is described in "".

The canonical form of an XML document includes all its attributes, whether provided explicitly or by default in the original document.

For the following element

<x a="Earth"

       ñ="Wind"

       z="Fire"

    >!!</x

    >

the canonical form is

<x a="Earth" z="Fire" ñ="Wind">!!</x>

5.7 Attributes

In the canonical form of an XML document, attribute values are normalized in the fashion required of an XML processor.

In Canonical XML, attribute names and values are separated by a single "=" character and no spaces. All attribute values are delimited by double-quote (") characters. Within attribute values, all occurrences of double-quote are replaced by """.

For the following start-tag

<x a = '"Don't!", he cried.' b = "'>'">

the canonical form is

<x a=""Don't!", he cried." b="'>'">

5.8 Namespaces

In Canonical XML, namespace prefixes always have the form n1, n2 and so on. The positive integer following the n is called the index of the prefix.

A start-tag always contains namespace declarations for exactly those prefixes that are used in the element type and the attribute names occurring in the start-tag. Namespace declarations are never inherited.

Note: This approach was chosen so that canonicalization is context-independent: the canonical form of an element is independent of where it occurs in the document.

The default namespace is never used. An attribute name never has the same prefix as the element type or another attribute name. The namespace declaration for a prefix immediately follows the element type or attribute that uses the prefix. Attributes are ordered primarily by the lexicographic order of the namespace URI with which the prefix of the attribute name is associated, and secondarily by the lexicographic order of the local part of the attribute name. A null namespace URI is considered to precede a non-null namespace URI: thus all attributes without prefixes precede all attributes with prefixes.

In the start-tag namespace prefixes occur in order of prefix index. The index of the first namespace prefix in the start-tag is always 1. The indices of the prefixes occurring in the start-tag are always consecutive integers. Thus if the element type has a prefix, its prefix will be n1; the prefix of the first attribute name in the start-tag that has a prefix will be n2 if the element type has a prefix, and n1 otherwise; for subsequent attributes, the index of the prefix of the attribute name will be one greater than the index of the prefix of the name of the preceding attribute.

For example, for the following element

<doc xmlns:x="http://w3.org/2" xmlns:y="http://w3.org/1">

    <x:e a="a"/>

    <x:e x:a="x:a"/>

    <e x:a="x:a"/>

    <e x:a="x:a" y:a="y:a"/>

    <e x:a="x:a" a="a"/>

    <e x:a="x:a" x:b="x:b"/>

    </doc>

the canonical form is

<doc>

    <n1:e xmlns:n1="http://w3.org/2" a="a"></n1:e>

    <n1:e xmlns:n1="http://w3.org/2" n2:a="x:a" xmlns:n2="http://w3.org/2"></n1:e>

    <e n1:a="x:a" xmlns:n1="http://w3.org/2"></e>

    <e n1:a="y:a" xmlns:n1="http://w3.org/1" n2:a="x:a" xmlns:n2="http://w3.org/2"></e>

    <e a="a" n1:a="x:a" xmlns:n1="http://w3.org/2"></e>

    <e n1:a="x:a" xmlns:n1="http://w3.org/2" n2:b="x:b" xmlns:n2="http://w3.org/2"></e>

    </doc>

Appendices

A. References

CharModel: Character Model for the World Wide Web, ed. Martin J. Dürst. Available at http://www.w3.org/TR/WD-charmod.
Infoset: XML Information Set, eds. John Cowan and David Megginson. Available at http://www.w3.org/TR/WD-xml-infoset.
Namespaces: Namespaces in XML, eds. Tim Bray, Dave Hollander, and Andrew Layman. Available at http://www.w3.org/TR/REC-xml-names.
Unicode: The Unicode Consortium. The Unicode Standard, Version 2.0. Reading, Mass.: Addison-Wesley Developers Press, 1996.
XML: Extensible Markup Language (XML) 1.0, eds. Tim Bray, Jean Paoli, and C. M. Sperberg-McQueen. 10 February 1998. Available at http://www.w3.org/TR/REC-xml.

B. Acknowledgements (Non-Normative)

The work of producing this specification was accomplished by the membership of the W3C XML Syntax Working Group:

Joel Nava, Adobe (Co-chair); Tim Bray, Invited Expert (Co-chair); James Clark, Invited Expert (Co-editor); James Tauber, Invited Expert (Co-editor); Bert Bos, W3C (W3C Liaison); Joseph Reagle, W3C (W3C Liaison); Gary Bisaga, Mitre; Tim Boland, NIST, Invited Expert; Charles Frankston, Microsoft; Paul Grosso, Arbortext; Eduardo Gutentag, Sun Microsystems; Michael Hyman, Microsoft; Murata Mokoto, Fuji Xerox; Michael Sperberg-McQueen, U. Ill. and W3C; Steph Tryphonas, Microstar; François Yergeau, Alis