[Cache from http://www.uddi.org/pubs/SchemaCentricCanonicalization-20020213.htm; please use this canonical URL/source if possible.]
Copyright © 2000-2002 by Accenture, Ariba, Inc., Commerce One, Inc., Compaq Computer Corporation, Fujitsu Limited, Hewlett-Packard Company, i2 Technologies, Inc., Intel Corporation, International Business Machines Corporation, Oracle Corporation, SAP AG, Sun Microsystems, Inc., VeriSign, Inc., and / or Microsoft Corporation. All Rights Reserved.
These documents are provided by the companies named above ("Licensors") under the following license. By using and/or copying this document, or the document from which this statement is linked, you (the licensee) agree that you have read, understood, and will comply with the following terms and conditions:
Permission to use, copy, and distribute the contents of this document, or the document from which this statement is linked, in any medium for any purpose and without fee or royalty under copyrights is hereby granted, provided that you include the following on ALL copies of the document, or portions thereof, that you use:
THIS DOCUMENT IS PROVIDED "AS IS," AND LICENSORS MAKE NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NON-INFRINGEMENT, OR TITLE; THAT THE CONTENTS OF THE DOCUMENT ARE SUITABLE FOR ANY PURPOSE; NOR THAT THE IMPLEMENTATION OF SUCH CONTENTS WILL NOT INFRINGE ANY THIRD PARTY PATENTS, COPYRIGHTS, TRADEMARKS OR OTHER RIGHTS.
LICENSORS WILL NOT BE LIABLE FOR ANY DIRECT, INDIRECT, SPECIAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF ANY USE OF THE DOCUMENT OR THE PERFORMANCE OR IMPLEMENTATION OF THE CONTENTS THEREOF.
Existing XML canonicalization algorithms such as Canonical XML and Exclusive XML Canonicalization suffer from several limitations and design artifacts (enumerated herein) which significantly limit their utility in many XML applications, particularly those which validate and process XML data according to the rules of and flexibilities afforded by XML Schema. The Schema Centric Canonicalization algorithm addresses these concerns.
This is an editors' working draft copy circulated for general review, comment, and feedback.
The design of the XML-Signature Syntax and Processing specification requires the execution of a canonicalization algorithm as part of the signature generation process. To date, two different (but closely related) canonicalization algorithms have been broadly proposed:
The Exclusive XML Canonicalization suffers from the problem that no means is provided therein by which the default XML namespace can be listed in the InclusiveNamespacePrefix list parameter to the algorithm, thus, inappropriately relegating that prefix to a second-class status.
Additionally, both of these algorithms (collectively "the existing algorithms") share some characteristics which cause problems, some considerable, to applications considering their use:
With the advent of XML Schema, it is in fact now increasingly rare to find XML documents for which validation is accomplished using a DTD, or, indeed, due to the weak expressiveness of DTDs, to find XML documents for which a DTD which describes the content models of the elements of the document (instead of merely defining entities and the like) can in fact ever be constructed. Thus, the existing algorithms are becoming less and less useful to practical applications of XML.
The XML Schema Recommendation permits a considerable number of additional liberties of representation, including (but not limited to) the following:
within the content of an element of complex type which has a {content type} of element-only, the semantic insensitivity to the order within a sequence of elements that validates against a model group whose compositor is all. (In contrast, when such occurs within a {content type} of mixed, and so there may be non-whitespace interspersed between these elements, the elements may not reasonably be reordered, as their relationship to such characters may have semantic significance to applications.)
There are further data type canonicalization issues which appear to have been overlooked by XML Schema Datatypes:
That these limitations are indeed considerably problematic can be more readily appreciated by considering the implications to certain types of application. One increasingly common and important application of XML is that of so-called "web services". For our purposes here, web services can be thought of as networked applications where the payloads conveyed between network nodes are XML documents, often SOAP requests or responses, which in turn have XML subdocuments in their headers and body. It is observed to be the case that, almost universally, the specification of what constitutes correct and appropriate XML in such circumstances is accomplished using XML Schema.
On the server side of web service applications, it is very often the case that the semantic information conveyed by a request needs to be decomposed, analyzed, and persistently stored, often making use of an underlying relational database to do so. To the extent that such a database is used for storage and indexing purposes, this database gets populated from data received in the body of XML "update" requests. Such population is carried out by "shredding" the semantic information of the XML into a corresponding representation in relational form, losing thereafter the history of that information as having originated in an XML form. Conversely, XML "get" requests are serviced by performing relational operations against the database, then forming an appropriate XML response based on the retrieved data and the schema to which the response must conform.
Certain web service applications will wish to support the use of digital signatures on content which is manipulated by the application. In order to reasonably support such usage, and, in particular, in order to continue to reasonably allow for the shredding of data into an underlying relational store, the signatures in question need to be canonicalized with respect to the full range of liberties of representation afforded by XML Schema. In particular, the problems with the existing algorithms enumerated in the previous section cause especially difficult implementation conundrums in these situations.
The Schema Centric Canonicalization Algorithm is intended to address these concerns.
The Schema Centric Canonicalization algorithm is intended to be complementary in a hand-in-glove manner to the processing of XML documents as carried out by the assessment of schema validity by XML Schema, canonicalizing its input XML instance with respect to all those representational liberties which are permitted thereunder. Moreover, the specification of Schema Centric Canonicalization heavily exploits the details and specification of the XML Schema validity-assessment algorithm itself.
In XML Schema, the analysis of an XML instance document requires that the document be modeled at the abstract level of an information set as defined in the XML Information Set recommendation. Briefly, an XML document's information set consists of a number of information items connected in a graph. An information item is an abstract description of some part of an XML document: each information item has a set of associated named properties. By tradition, infoset property names are denoted in square brackets, [thus]. There are eleven different types of information items:
Properties on each of these items, for example the [children] property of element information items, connect together items of different types in an intuitive and straightforward way.
The representation of an XML document as an infoset lies in contrast to its representation as a node-set as defined in XPath. The two notions are conceptually quite similar, but they are not isomorphic. For a given node-set it is possible to construct a semantically equivalent infoset without loss of information; however, the converse is not generally possible. It is the infoset abstraction which is the foundation of XML Schema, and it is therefore the infoset abstraction we use here as the foundation on which to construct Schema Centric Canonicalization algorithm.
The Schema Centric Canonicalization algorithm consists of a series of steps: creation of the input as an infoset, character model normalization, processing by XML-Schema assessment, additional infoset transformation, and serialization.
As was mentioned, the algorithm requires that the data it is to process be manifest as an infoset. If such is not provided directly as input, the data provided must be converted thereto. Two mechanisms for carrying out this conversion are defined:
In addition to the data itself, the canonicalization process requires the availability of appropriate XML Schemas and an indication of the relevant components thereof to which the data purports to conform. In order to be able to successfully execute the canonicalization algorithm, all the data must be valid with respect to these components; data which is not valid cannot be canonicalized.
The Unicode Standard allows diverse representations of certain "precomposed characters" (a simple example is "ç"). Thus two XML documents with content that is equivalent for the purposes of most applications may contain differing character sequences. However, a normalized form of such representations is also defined by the Unicode Standard.
Schema Centric Canonicalization requires that both its input infoset and all the schema components processed by the XML Schema-Assessment process be transformed as necessary so that all string-valued properties and all sequences of character information items therein be normalized into the Unicode Normalization Form C.
The third step of the Schema Centric Canonicalization requires that the input infoset be transformed into the so-called "post-schema-validation infoset" (the "PSVI") in the manner defined by the XML Schema Structures recommendation, amended as set forth below. In XML Schema, as the schema assessment process is carried out, the input infoset is augmented by the addition of new properties which record in the information items various pieces of knowledge which the assessment process has been able to infer. For example, attribute information items are augmented with a [schema normalized value] property which contains the result of, among other things, the application of the appropriate schema-specified default-value to the attribute information item (the full list of such augmentation is tabulated in the appendix to XML Schema Structures).
The PSVI output from XML Schema is next further transformed into what we define here as the "schema-canonicalized infoset" by rules of this specification that are designed to address a few remaining canonicalization issues:
Finally, the schema-canonicalized infoset is serialized into an XML text representation in a canonical manner, and this serialization forms the output of the algorithm.
The output of the Schema Centric Canonicalization algorithm whose input is the infoset of an entire XML document is well-formed XML. However, if some items in the infoset are logically omitted (that is, their [omitted] property is true), then the output may or may not be well-formed XML, depending on exactly which items are omitted (consider, for example, omitting some element information items but retaining their attributes). However, since the canonical form may be subject to further XML processing, most infosets provided for canonicalization will be designed to produce a canonical form that is a well-formed XML document or external general parsed entity. Note that the Schema Centric Canonicalization algorithm shares these issues of well-formedness of output with the existing canonicalization algorithms.
In such cases where the output of the Schema Centric Canonicalization algorithm is well-formed, then the canonicalization process is idempotent: if x is the input infoset, and C represents the application of the Schema Centric Canonicalization algorithm, then C(x) is identical to C(C(x)). Moreover, in such cases C(x) is valid with respect to the same schema component(s) as is x (modulo the character sequence length issue noted in the next section).
The Schema Centric Canonicalization algorithm suffers from some of the limitations of Canonical XML. Specifically, as in Canonical XML, the [base URI] of the infoset has no representation in the canonicalized representation, the consequences of which are as in Canonical XML. However, unlike Canonical XML, Schema Centric Canonicalization does not suffer from the loss of notations and external unparsed entity references (these are canonicalized and preserved) nor from the loss of the typing of data (since, in XML Schema, the association of a schema with an XML instance is outside the scope of the specification and therefore is (trivially) preserved by the Schema Centric Canonicalization algorithm).
Also, Schema Centric Canonicalization suffers (but arguably in a minor way) from the fact that XML schema-assessment is not strictly speaking deterministic: when an element or attribute information item is validated against a wildcard whose {process contents} property is lax, the exact schema-assessment processing of the item which takes place depends on whether "the item, or any items among its [children] if it's an element information item, has a uniquely determined declaration available", where the term "available" here provides a degree of discretion to the validating application and thus a degree of non-determinism to the schema-assessment process. Because Schema Centric Canonicalization makes integral use of the information garnered during schema-assessment, if an item has been skipped due to a wildcard with a {process contents} of lax or skip, the output of the algorithm for that item must necessarily be different than if the item has not been skipped. Thus, the non-determinism caused by lax results directly in non-determinism of the output of the algorithm.
[Note: feedback on what, if anything, should be done to deal with this situation is specifically solicited by the editors. Perhaps lax wildcard schema components could be decorated with a new schema annotation that independently of {process contents} controls the assessment process when that is carried out in the context of the Schema Centric Canonicalization algorithm. Perhaps a parameter to the algorithm could control this in a blanket fashion. Perhaps a combination of these or other approaches is warranted.]
In the canonicalized form, the lengths of certain sequences of character information items may differ from that which was input to the algorithm, due to both processing by Unicode character model normalization and to namespace attribute normalization (the latter only occurs for expressions written in embedded languages such as XPath). This length adjustment can in certain circumstances affect the validity of the altered data, and can affect the ability to reference into the data with XPointer character-points and ranges.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
The specification of the Schema Centric Canonicalization algorithm defines a few items residing in an XML namespace known as the Schema Centric Canonicalization algorithm namespace. The URI of this namespace is:
urn:uddi-org:schemaCentricC14N:2002-02-13
It should be clearly understood that all the details of the present document are a matter solely of the specification of the behavior of the Schema Centric Canonicalization algorithm, not its implementation. Implementations are (of course) free to pursue any course of implementation they choose so long as in all cases the output they yield for a given input corresponds exactly to that as is indicated herein. At times the details and language in this specification may have been optimized to attempt to make the presentation and specification more clear and straightforward perhaps at the performance expense of an implementation that were to robotically follow the literal wording thereof. In this regard, attention is specifically drawn to the connection of the this specification with the details of the specification of the XML Schema validity-assessment, the PSVI augmentation process, and the augmentation of the PSVI found in §3.3: depending on the existing software artifacts and other resources upon which they can rely, good implementations are likely to significantly optimize their treatment of these matters especially.
The Schema Centric Canonicalization algorithm manipulates the semantic information of an XML instance as represented in the form of an XML Information Set. As such, if the input to the algorithm is not already in this form then it must be converted thereto in order for the algorithm to proceed. This document normatively specifies the manner in which this conversion is to be carried out for two such alternative input data-types (other specifications are free to define additional, analogous conversions). These two data-types are exactly those defined by the XML Signature Syntax and Processing recommendation as being the architected data-types for input to a Transform.
As is noted in the XML Information Set recommendation, it is not intrinsically the case that the [in-scope namespaces] property of an element information item in an infoset will be consistent with the [namespace attributes] properties of the item and its ancestors, though this is always true for an information set resulting from parsing an XML document. However, it is REQUIRED that this consistency relationship hold for the infoset input to the Schema Centric Canonicalization algorithm.
If the input to the canonicalization algorithm is an octet stream, then it is to be converted into an infoset by parsing the octet stream as an XML document in the manner described in the specification of [XML-Infoset].
Note that this is exactly the same conversion process that must be carried out by software attempting to assess the schema validity of XML data according to the XML Schema Structure recommendation.
The conversion of a node-set to an infoset is straightforward, if somewhat more lengthy to describe.
A node-set is defined by the XPath recommendation as "an unordered collection of nodes without duplicates." In this context, the term "node" refers to the definition provided in the data model section of the recommendation. In that section, it is noted that XPath operates on an XML document as a tree, and that there are seven types of node that may appear in such trees:
The nodes in a given node-set must (by construction; that is, rules that would allow otherwise are lacking in XPath) all be nodes from the same underlying tree instance. If N is a node-set, then let T(N) be this tree, and let r(T(N)) be the root node of that tree. The conversion process to an infoset first converts T(N) into an equivalent infoset I(T(N)), then decorates that infoset to denote which information items therein correspond to nodes originally found in N.
Conversion of an XPath node-tree to an infoset is defined recursively in terms of the conversion of individual nodes to corresponding information items. Let n be an arbitrary XPath node, and let {n} be a node-set containing just the node n. Let i be the function taking a node as input and returning an ordered list of nodes as output which is defined as follows:
Having defined the function i, we now return to completing the specification of the details of the node-set to infoset conversion process.
Let N be a node-set, and consider the document information item returned by the function invocation i(r(T(N)). Define the infoset I(T(N)) to be that set of information items which are transitively reachable from i(r(T(N)) through any of the properties defined on any of the information items therein. This infoset represents the conversion of the node tree T(N) into a corresponding infoset.
Recall that the node-set N is in fact a subset of T(N). This relationship therefore needs to be represented in I(T(N)). To that end, we here define a new boolean infoset property called [omitted]. Unless otherwise indicated by some specification, the value of the [omitted] property of any information item is always to be taken to be 'false'. The present specification, however, defines that for all information items in I(T(N)) the value of [omitted] is 'true' except those items which, for some n in N, are members of the list returned from i(n).
This completes the specification of the node-set to infoset conversion process.
The Unicode Standard allows diverse representations of certain "precomposed characters" (a simple example is "ç"). Thus two XML documents with content that is equivalent for the purposes of most applications may contain differing character sequences. However, a normalized form of such representations is also defined by the Unicode Standard.
It is REQUIRED in Schema Centric Canonicalization that both the input infoset provided thereto and all the schema components to processed by the XML Schema-Assessment process used therein be transformed as necessary so that all string-valued properties and all sequences of character information items therein be normalized into the Unicode Normalization Form C as specified by the algorithm defined by the Unicode Standard.
As a (non-normative) note of implementation, in the case where the to-be-canonicalized XML instance and the XML schema specifications thereof are input to the canonicalization process as physical files, this normalization can usually be most straightforwardly accomplished simply by normalizing the characters of these files first before commencing with the remainder of the canonicalization process.
Once the input infoset is normalized with respect to its character model, the Schema Centric Canonicalization algorithm carries out schema assessment by appealing to the third approach listed in §5.2 Assessing Schema-Validity of the XML Schema recommendation and attempting to carry out strict assessment of the element information item which is the value of the [document element] property of the document information item of the infoset.
In XML Schema, as the schema assessment process is carried out, the infoset input to that process is augmented by the addition of new properties which record in the information items various pieces of knowledge which the assessment process has been able to discern. For example, attribute information items are augmented with a [schema normalized value] property which contains the result of, among other things, the application of the appropriate schema-specified default-value to the attribute information item.
The Schema Centric Canonicalization algorithm makes use of this augmentation. Specifically, suppose I is the character-normalized version of the infoset which is input to the algorithm, possibly after conversion from another data-type. Then the next step of the algorithm forms the so-called "post-schema-validation infoset" (the "PSVI", or more precisely, PSVI(I)) in exactly the manner prescribed as a consequence of the assessment process defined in the XML Schema Structures specification as amended in the manner set forth below. If PSVI(I) cannot be so formed, due to, for example, a failure of validation, then the Schema Centric Canonicalization algorithm terminates with a fatal error.
In XML Schema Structures, the augmentation process of schema assessment fails to record a small number of pieces of information which it has learned and which we find crucially necessary to have knowledge of here. Accordingly, the PSVI generation process referred to by this specification is exactly that of the XML Schema Structures recommendation as amended as follows:
3.8.5 Model Group Information Set Contributions
If the schema-validity of an element information item has been assessed as per Element Sequence Valid (§3.8.4) by a model group whose {compositor} is all, then in the post-schema-validation infoset it has the following property:
PSVI Contributions for element information items
- [validating model group all]
- An ·item isomorphic· to the model group component involved in such assessment.
The fourth step of the Schema Centric Canonicalization algorithm further augments and transforms the PSVI to produce the "schema-canonicalized infoset". This involves a pruning step, a namespace prefix desensitization step, a namespace attribute normalization step, and a data-type canonicalization step.
Some information items in the PSVI in fact do not actively participate in the schema assessment process of XML Schema. They are either ignored completely by that process, or used in an administrative capacity which is not central to the outcome. Thus, these items need to be pruned from the PSVI in order that they not affect the output of canonicalization. Similarly, declarations of notations and unparsed entities which are not actually referenced in the canonicalized representation should also be removed.
To this end, the [omitted] property is set to 'true' for any information item info in the PSVI for which at least one of the following is true:
The goal of namespace prefix desensitization is to first identify that data in the infoset which contains information representing an expression written in an embedded language which makes use of the XML Namespaces specification, and next to annotate the infoset in order to indicate exactly where and in what manner uses of particular XML namespace prefixes in fact occur. That is, desensitization is a two-step process: a data location step, followed by an annotation step.
Note that the notion of embedded language used here includes not only languages (such as XPath) which are represented in XML as the content of certain strings but also those (such as XML Query) which make use of structured element content. In all cases, in order to be namespace-desensitizeable it is ultimately REQUIRED that all problematical references to XML namespace prefixes do in fact lie in information identified as being of a simple type (usually strings). It is, however, permitted that these prefixes may be found in simple types which are attributes and / or the content of elements perhaps deep in the sub-structure of the element rooting the occurrence of the embedded language.
Moreover, in order to be namespace-desensitizeable, it is REQUIRED that the semantics of each embedded language not be sensitive to the specific namespace prefixes used, or the character-count length thereof: one MUST be permitted to (consistently) rewrite any or all of the prefixes used in an occurrence of a language with arbitrary other (appropriately declared) prefixes, possibly of different length, without affecting the semantic meaning in question.
Each particular embedded language for which namespace desensitization is to be done MUST be identified by a name assigned to it by an appropriate authority. It is REQUIRED that this name be of data-type anyURI. This specification assigns the URI http://www.w3.org/TR/1999/REC-xpath-19991116 as the name of the embedded language which consists of strings which conform to the any of the grammatical productions of the XPath 1.0 specification.
The data location step of desensitization makes use of canonicalization-specific annotations to XML Schema components. It is the case in XML Schema that the XML representation of all schema components allows the presence of attributes qualified with namespace names other than the XML Schema namespace itself; this is manifest in the schema-for-schemas as the presence of an
xs:anyAttribute namespace="##other" processContents="lax"/><
definition in the schema for each of the various schema components. As is specified in XML Schema Structures, such attributes are represented in the infoset representation of the schema inside the {attributes} property of an Annotation schema component which in turn is the value of the {annotation} property of the annotated schema component in question (i.e.: the Annotation is the {annotation} of the Attribute Declaration, the Element Declaration, or whatever). Within the Schema Centric Canonicalization algorithm namespace, we define a couple of attributes intended for use as such annotations to schema components:
In order to specify how these attributes are used, we define an auxiliary function in order to model the inheritance of annotations in schemas from types to elements and attributes and from base types to derived types. Let i be an information item, a be a string (representing the name of an attribute), and ns be either a URI (representing the name of an XML namespace) or the value absent. Define the function getAnnot(i, a, ns) as follows:
The data location step of desensitization is carried out as follows. Let sccns be the Schema Centric Canonicalization namespace. Consider in turn each attribute and element information item x in the pruned PSVI. If x is an element information item and getAnnot(x, "embeddedLangAttribute", sccns) is not absent, then x is identified as being associated with the embedded language which is the value of the attribute of x whose name is the value of getAnnot(x, "embeddedLangAttribute", sccns). Otherwise, if getAnnot(x, "embeddedLang", sccns) is not absent, then x is identified as being associated with the embedded language which is is the value thereof. Otherwise, x is not associated with any embedded language.
Further, in order to enhance the usefulness of Schema Centric Canonicalization, any schema component representing the type of the element named "XPath" contained in elements of type "TransformType" defined in the namespace http://www.w3.org/2000/09/xmldsig# (which is the XML Signature Syntax and Processing namespace) is to be considered by definition as possessing an embeddedLang attribute with value http://www.w3.org/TR/1999/REC-xpath-19991116 in the {attributes} property of its {annotation} property.
Following the data location step, the processing of the attribute and element information items identified as being associated with embedded languages is carried out by the annotation step of namespace prefix desensitization in what is necessarily an embedded-language-specific manner. Implementations of the Schema Centric Canonicalization algorithm will need to understand the syntax and perhaps some semantics of each of the embedded languages whose uses they encounter as they carry out canonicalization. Should an embedded language which is not appropriately understood be encountered, the Schema Centric Canonicalization algorithm terminates with a fatal error. All implementations of the Schema Centric Canonicalization algorithm MUST in this sense fully understand the (XPath) embedded language identified as http://www.w3.org/TR/1999/REC-xpath-19991116.
In all cases the execution of the annotation step is manifest in the augmented PSVI in a uniform manner. Specifically, let x be an attribute or element information item which is identified by the language-specific processing as containing one or more uses of XML namespace prefixes in its [schema normalized value] property y. If any of these uses of XML namespace prefixes in y is in a form other than a occurrence of a QName, then a fatal error occurs. Otherwise, x is additionally augmented by the language-specific processing with a [prefix usage locations] property which contains, corresponding to the sequence of QNames in y, an ordered sequence of one or more triples (offset, prefix, namespace URI) where
and these triples occur in increasing order by offset.
This concludes the specification of the namespace prefix desensitization step.
The next step in the series of infoset transformations carried out by the Schema Centric Canonicalization algorithm is that of normalizing the actual XML namespace prefix declarations in use. The XML namespace recommendation allows namespaces to be multiply declared throughout an XML instance, possibly with several and different namespace prefixes used for the same namespace. In the canonical representation, we remove this flexibility, declaring each namespace just as needed, and using a deterministically constructed namespace prefix in such declaration. In this procedure, we borrow heavily from some of the similar work carried out in the Exclusive XML Canonicalization recommendation. We begin with some definitions.
The execution of the namespace attribute normalization step adds [normalized namespace attributes] properties to certain element information items in the infoset. Let e be any element information item whose [omitted] property is false. Then the [normalized namespace attributes] property of e is that unordered set of attribute information items defined recursively as follows.
Let Ne be the set of all namespace information items n in the [in-scope namespaces] property of e where n is visibly utilized by e. Let NAp be the set of attribute information items in the [normalized namespace attributes] property of any self-ancestor of p, where p is the output parent of e and if p is not no value, or the empty set if no such output parent exists. Let namespaces(Ne) be the set of strings consisting of the [namespace name] properties of all members of Ne, and let namespaces(NAp) be the set of strings consisting of the [normalized value] properties of all members of NAp.
For each namespace URI u in namespaces(Ne) - namespaces(NAp) (so, the name of each namespace with a prefix newly utilized at e), the [normalized namespace attributes] property of e contains an attribute information item whose properties are as follows:
XML namespace prefixes used in the [normalized namespace attributes] property (which are manifest in the [local name] properties of the attribute information items therein) are chosen as follows. Let e be any element containing a [normalized namespace attributes] property. Let l be the ordered list resulting from sorting the [normalized namespace attributes] property of e according to the sort function described below. Let k be the maximum over all the ancestors a of e of the integers used per (b) above to form the [local name] property of any attribute item in the [normalized namespace attributes] property of a, or -1 if no such attribute items exist. Then the attributes of l, considered in order, use, in order, the integers k+1, k+2, k+3, and so on in the generation of their [local name] as per (b) above, excepting only that if wildcardOutputRoot(e) is true, then (in order to avoid collisions) any integer which would result in a [local name] property which was the same as the [prefix] property of some namespace item in the [in-scope namespaces] property of e is skipped.
Now that the declaration of necessary namespace attributes has been successfully normalized (and, canonically, the default namespace has been left undeclared), we apply these declarations in the appropriate places by defining appropriate [normalized prefix] and [prefix & schema normalized value] properties. Let info be any information item in the infoset whose [omitted] property is false. Then:
Moreover, if info contains a [prefix usage locations] property, then info also contains a [prefix & schema normalized value] property which is identical to the [schema normalized value] property of info except for differences formed according to the following procedure. Consider in turn each triple t found in the [prefix usage locations] property of info. Let normalizedPrefixUse(t) be those characters of the [prefix & schema normalized value] property of info which correspond to the characters of the [schema normalized value] property of info whose zero-based character-offsets lie in the semi-open interval [offset, offset+cch-1+z), where
Then the characters of normalizedPrefixUse(t) are determined as follows:
This completes the specification of the namespace attribute normalization step.
The XML Schema Datatypes specification defines for a certain set of its built-in data-types a canonical lexical representation of the values of each of those data types. To that identified set of canonical representations Schema Centric Canonicalization adds several new rules; in some cases, it refines those rules provided by XML Schema.
The most complicated part of data type canonicalization lies in dealing with character sequences which are as a matter of application-level schema design considered to be case insensitive. It is important that case-insensitivity of application data be integrated into the canonicalization algorithm: if it were not, then applications may be forced to remember the exact case used for certain data when they otherwise would not need to, a requirement which may well be in tension with the application semantic of case-insensitivity, and thus quite possibly a significant implementation burden.
The relevant technical reference for case-mapping considerations for Unicode characters is a technical report published by the Unicode Consortium. Case-mapping of Unicode characters is more subtle than readers might naively intuit from their personal experience. The mapping process can at times be both locale-specific (Turkish has special considerations, for example) and context-dependent (some characters case-map differently according to whether they lie at the end of a word or not). Mapping of case can change the length of a character sequence. Upper and lower cases are not precise duals: there exist pairs of strings which are equivalent in their upper case-mapping but not in their lower case, and visa versa.
In order to accommodate these flexibilities, we define several attributes within the Schema Centric Canonicalization algorithm namespace in order to assist with the identification of data which is to be considered case-insensitive and the precise manner in which that is to be carried out. As was the case for the embeddedLang and embeddedLangAttribute attributes previously defined, these attributes are intended to be used as annotations of relevant schema components.
The caseMap attribute, which is of type language, is defined in the Schema Centric Canonicalization algorithm namespace. When used as an attribute annotation to a schema component, a caseMap attribute indicates that case-mapping is to be performed on data which validates against the schema component according to the case-mapping rules of the fixed locale identified by the value of the attribute.
The caseMapAttribute attribute, which is of type QName, is defined in the Schema Centric Canonicalization algorithm namespace. When used as an attribute annotation to a schema component, a caseMapAttribute attribute indicates that an information item which validates against the schema component in question is to be case mapped during the canonicalization process according to the rules of the locale which is dynamically indicated in the information item (necessarily an element information item) as the value of a certain attribute thereof, namely the attribute whose qualified name is indicated in the value of the caseMapAttribute attribute (which must be of type language or a restriction thereof).
The caseMapKind attribute, which is of type string but restricted to the enumerated values "upper", "lower", and "fold", is defined in the Schema Centric Canonicalization algorithm namespace. When used as an attribute annotation to a schema component, a caseMapKind attribute indicates whether upper-case or lower-case mapping or case-folding is to be carried out as part of the canonicalization process. If this attribute is contextually absent but at least one of caseMap or caseMapAttribute is contextually present, upper-case mapping is carried out.
Traditional ASCII-like case insensitivity can be most easily approximated by simply specifying "fold" for the caseMapKind attribute and omitting both caseMap and caseMapAttribute. Also, schema designers are cautioned to be careful in combining case-mapping annotations together with length-limiting facets of strings and URIs, due to the length-adjustment that may occur during canonicalization.
The data-type canonicalization step of Schema Centric Canonicalization is carried out according to the following rules:
Thus, a canonical lexical representation for all non-union simple types is defined.
The function caseMap takes three input parameters:
The upper-case or lower-case mapping process of the caseMap function is carried out in the context of the indicated locale according to the (respectively) UCD_upper() or UCD_lower() functions as specified by the Unicode Consortium. The case-folding process is carried out by mapping characters through the CaseFolding.txt file in the Unicode Character Database as specified by the Unicode Consortium.
To carry out the data-type canonicalization step in the Schema Centric Canonicalization algorithm, the [schema normalized value] property of all element and attribute information items in the output of the namespace attribute normalization step whose [member type definition] (if present) or [type definition] (otherwise) property is a simple type is replaced by the defined canonical lexical representation of the member of the relevant value space which is represented by the [schema normalized value].
The infoset which is output from the data-type canonicalization step is the schema-canonicalized infoset.
The final step in the Schema Centric Canonicalization algorithm is the serialization of the schema-canonicalized infoset into a sequence of octets.
In the description of the serialization algorithm that follows, at various times a statement is made to the effect that a certain sequence of characters is to be emitted or output. In all cases, it is to be understood that the actual octet sequences emitted are the corresponding UTF‑8 representations of the characters in question. The character referred to as "space" has a character code of (decimal) 32, the character referred to as "colon" has a character code of (decimal) 58, and the character referred to as "quote" has a character code of (decimal) 34.
Also, the algorithm description makes use of the notation "info[propertyName]". This is to be understood to represent the value of the property whose name is propertyName on the information item info.
The serialization of the schema-canonicalized infoset, and thus the output of the overall Schema Centric Canonicalization algorithm, is defined to be the octet sequence that results from the function invocation serialize(d), where d is the document information item in the schema-canonicalized infoset, and serialize is the function defined below.
The serialize function is defined recursively in terms of the serialization of individual types of information item. Let the functions recurse, sort, escape, wildcarded, and wildcardOutputRoot be defined as set forth later. Let info be an arbitrary information item. Let serialize be the function taking an information item as input and returning an sequence of octets as output which is defined as follows.
The function recurse is a function which takes as input an ordered list infos of information items and proceeds as follows.
First, character information items in infos whose [omitted] property is 'true' are pruned by removing them from the list. Next, the pruned list is divided into an ordered sequence of sub-lists l1 through lk according to the rule that a sub-list which contains character items may not contain other types of information items, but otherwise k is as small as possible. The result of recurse is then the in-order concatenation of processing in order each sub-list li in turn in the following manner:
The function escape is that function which takes as input a string s and returns a copy of s where each occurrence of any of the five characters & < > ' " in s is replaced by its corresponding predefined entity.
The function sort takes as input an unordered set or an ordered list of information items and returns an ordered list of those information items arranged in increasing order according to the function compare, unless some of the information items do not have a relative ordering, in which case a fatal error occurs.
The function compare takes two information items a and b as input and returns an element of {less than or equal, greater than or equal, no relative ordering} as output according to the following:
The function wildcard takes an element or an attribute information as input and returns a boolean indicating whether validation was not attempted on that item. In the Schema Centric Canonicalization algorithm, validation of an information item will only not be attempted as a consequence of the item or a parent thereof being validated against a wildcard whose {process contents} property is either skip or lax.
Let i be the information item input to wildcarded. The function is then defined as follows:
The function wildcardOutputRoot takes an element item as input and returns a boolean indicating whether the item is an appropriate one on which to place the contextual namespace declarations necessary for dealing with wildcarded items contained therein. Let e be the information item input to wildcardOutputRoot. The function is then defined as follows:
The XML-Signature Syntax and Processing recommendation (XML DSIG) defines the notion of a canonicalization algorithm together with the use of URIs as identifiers for such algorithms. In XML DSIG, the use of canonicalization algorithms is architected in three places:
XML Encryption makes similar use of these algorithms.
This specification asserts that the URI of the Schema Centric Canonicalization algorithm namespace is the identifier (in the sense of XML DSIG) of a canonicalization algorithm. This identifier denotes the Schema Centric Canonicalization algorithm. The algorithm does not require or permit any explicit parameters.
As is discussed in Exclusive XML Canonicalization, many applications from time to time find it useful to be able to change the enveloping context of a subset of an XML document without changing the canonical form thereof.
In such situations, if Schema Centric Canonicalization is the algorithm of relevance, then applications SHOULD avoid references to notations or unparsed entities in the document subset in question, since the canonical representation of the notation and entity declarations referred to (which must, for security, be part of the canonical form) are defined in a document type declaration, the presence of which significantly complicates the task of re-enveloping.
This section discusses a few key decision points as well as a rationale for each decision.
Several of the eleven different types of information items either can never appear in an infoset which successfully validates according to XML Schema or can in no way affect the outcome thereof. Accordingly, representations of such information items never appear in the output of the Schema Centric Canonicalization algorithm. These types of information item are the following:
Believing their reasoning to be sound, we adopt the attitude of Canonical XML towards the processing of whitespace in character content, namely that no special processing is carried out:
"All whitespace within the root document element MUST be preserved (except for any #xD characters deleted by line delimiter normalization). This includes all whitespace in external entities."
Moreover, for analogous reasons, we adopt the attitude of Exclusive XML Canonicalization towards the lack of special processing of the xml:lang and the xml:space attributes.
The Unicode Technical Report on Case Mappings distinguishes case-mapping from a similar process termed case-folding. Unlike case-mapping, case-folding is a locale-independent operation, and does not encounter the issue that strings may be equal or differ depending on the direction in which they are case-mapped. As is clear in the report, case-folding suffers from being only an approximation to language-specific rules of processing, and is primarily aimed at legacy systems where locale information simply is not feasibly available with which to do a more complete processing.
The Schema Centric Canonicalization algorithm supports the use of either case-mapping or case-folding in user schemas.
XML Schema Datatypes does not define a canonical lexical representation for data of type anyURI. In the present specification, thought was given to reconsidering this position. As is described in the specification of Uniform Resource Identifiers, various aspects of the syntactic structure of URIs are considered case insensitive: the scheme part of the URI is an example (or probably is one: contrast §3.1 with §6 in RFC2396 with respect to this point), and various particular schemes have substructure that is so. However, the lack of crisp clarity of specification on the issue, the intrinsic inability for any one Schema Centric Canonicalization implementation to understand the universe of possible URI schemes it might encounter (and so case-canonicalize them all appropriately), and the lack of compelling pragmatic problems caused by simply having all anyURI data canonicalize to itself do not seem to muster enough of a concern to warrant differing from XML Schema Datatypes on this issue.
None at present.