Common XML - Final Review Draft Specification, 27 July 2000 (Review ends 1 September 2000) 1. Introduction - What is Common XML? 1.1 Common XML Development 2. Common XML Core 2.1 Basic Requirements for Common XML Documents 2.2 Elements 2.3 Attributes 2.4 Namespaces 2.5 Textual Content 3. Extending Common XML 3.1 Comments 3.2 Processing Instructions 3.3 CDATA Sections 3.4 XML Declaration 3.5 Document Type Declaration 3.5.1 Internal Subset 3.5.2 External Subset 3.5.3 Categories of Declarations and their Uses 3.6 Other XML Extensions 1. Introduction - What is Common XML? XML provides a solid foundation on which very different but throughly interoperable data processing systems can be built. However, XML includes a number of features and options that make it difficult to ensure that varying applications will receive the same view of a document. These difficulties may appear even in cases where all applications involved share a complete understanding of the vocabulary used by the document. The situation is complicated further when documents must pass through multiple levels of parsing and processing on their way to a target application. Common XML begins with a frequently used and thoroughly reliable subset of the features provided by the XML 1.0 and Namespaces in XML W3C Recommendations. Common XML defines a very small core, but allows developers to move beyond that core if needed. Additional features from XML 1.0 and Namespaces are still available. This specification includes descriptions of the impact of features beyond the core on interoperability. Documents created using the core Common XML feature set should always present the same information set when processed by a non-validating parser that conforms to the XML 1.0 specification, and should present consistent information in both namespace-aware and non-namespace aware environments. While some developers may be discouraged to find their favorite features left out, they should be aware that Common XML doesn't prohibit the use of anything in XML 1.0 - it just provides warnings about possible issues involving interoperability. By sticking to the core of Common XML, developers can ensure consistency, but it may be appropriate to move beyond the core of Common XML to meet particular needs. Common XML is effectively a set of guidelines for what can and cannot be counted on in XML processing. It is not a set of rules for creating parsers or other software. Common XML is intended for use within the existing framework of XML 1.0 processors and applications. 1.1 Commmon XML Development This is a final review draft, with comments requested until 1 September 2000. There may be minor revisions during that period, all of which will be preserved at http://www.egroups.com/files/sml-dev/Work/. At the end of that period the SML-DEV mailing list will decide whether to declare this document complete, continue work on it, or end the project. Common XML is a project of the SML-DEV mailing list. Archives and additional information are available at http://www.egroups.com/group/sml-dev/info.html. This is a draft specification and subject to change. Comments and suggestions are welcome, either on the SML-DEV mailing list or addressed to the current editor at simonstl@simonstl.com. 2. Common XML Core The Common XML Core describes a set of features that can be used to transfer information reliably among applications using conformant non-validating XML 1.0 parsers. All XML parsers should report the same document content to the application, and tools writing XML documents back out generally preserve this information, making it easier and safer to roundtrip information through multiple layers of processing. If XML features are not described in this section, they are not part of the Common XML Core. They may, however, be used with Common XML as described in Section 3. 2.1 Basic Requirements for Common XML Documents All Common XML documents are conformant XML 1.0 documents as defined in [XML]. Common XML documents have a single root element. Elements may have attributes and contain textual content and/or other elements. Namespaces may be used as defined in [Namespaces], though with an extra restriction described in section 2.4 below. 2.2 Elements Elements are identified through markup exactly the same way as in XML 1.0. Start tags identify the beginnings of elements and may contain attribute values. End tags identify the ends of elements and may not contain attribute values. Empty tags may be used to represent elements without content and may contain attribute values. The rules for naming elements are the same in common XML as they are in XML 1.0, as are the rules for nesting elements and handling whitespace inside of elements. 2.3 Attributes Attributes in Common XML are used exactly as they are in XML 1.0. Attributes may appear inside of start or empty tags and provide additional information regarding the element defined by that start or empty tag. The rules for naming attributes and describing acceptable attribute content remain the same in Common XML as they do in XML 1.0. Either single- or double- quotes may be used to demarcate attribute values, but it is generally easier to pick one style and stick with it for easier processing using text-manipulation tools. Attributes will be normalized per section 3.3.3 of [XML]. Documents that have no document type definition will have their attributes normalized as described for CDATA. Further information on normalization when DTDs are present is available in section 3.5 of this specification. To simplify some kinds of processing, it is strongly recommended that developers using Common XML avoid using the same name for an attribute and a child element of the same element. This avoids issues caused by the distinction between attributes and child elements in contexts where such distinctions aren't important. The distinctions between elements and attributes also raise some complex questions of which approach to use when. While it is possible to create documents that store all of their textual data inside attribute values rather than as element content, this approach is often hard to read and can be place severe limits on future document structure. Developers should consider a balance between element and attribute structures when designing XML documents. 2.4 Namespaces Namespaces are defined within Common XML documents using the mechanisms defined in [Namespaces] - attribute values defined using xmlns* attributes. Because Namespaces is an addition to XML 1.0, and not all applications or parsers process namespaces the same way, the Common XML Core adds one restriction to ease interoperability among these systems. The easiest way to avoid problems with namespace conflicts within documents is to make all namespace declarations on the root element of the document. While this denies document creators the use of the scoping features built into Namespaces in XML, it simplifies processing tremendously. Processors can see at the root level of the document all of the namespaces they will have to manage, prefix conflicts are made impossible (because XML 1.0 doesn't permit multiple attributes with the same name to appear in well-formed documents), and applications are relieved of the overhead involved in tracking namespaces on a per-element basis. Prefixes, including the default prefix, should only have one mapping to a URI within a document. While the Namespaces in XML specification explicitly permits reuse of locally-scoped prefixes, this can make it difficult to work with documents using namespaces in environments that don't recognize namespaces per se or which have to re-serialize these documents after processing. Prefixes are still only significant within the scope for which they are declared - this guideline does not propose to give prefixes global applicability across a document. In general, it is a wise idea to use the same prefix for the same namespace across multiple documents. This makes it much easier to process documents with XML processors that aren't namespace aware. Similarly, the use of multiple prefixes (or prefixes and the default namespace) to refer to the same URL may create complexity as programs that don't recognize namespaces mistake these differing prefixes for different types. The use of namespace prefixes with attributes can also raise ambiguities. In general, namespace prefixes should only be used on attributes when that attribute is in a different namespace than the element to which it is applied. If external entities are used in XML document assembly (a possibility noted in 3.5, below), those external entities should contain all the namespace declarations they need, rather than relying on the parent document to to provide the declarations. This removes potential amibiguity and should reduce the number of cases where namespace values change unexpectedly. Finally, the use of URIs as namespace identifiers has raised complex issues involving the use of features like relative URIs to identify namespaces. Because relative URIs may have different values depending on the location of the document and new features under development at the W3C like [XBase], the use of absolute URIs is strongly recommended. 2.5 Textual Content Textual content within Common XML documents is defined the same way as in XML 1.0. Documents written to conform to the guidelines of the Common XML Core should only use UTF-8 and UTF-16 encodings (defined in [Unicode]), as these are the only encodings that all conformant XML parsers are required to support. Common XML Core documents may use the built-in entities (<, >, &, ', ") as well as both decimal and hexadecimal character references. The XML 1.0 specification provides fairly complex rules for whitespace handling by XML processors, which may or may not be followed by XML applications, especially through transformations. Document authors should avoid using multiple spaces to establish semantic meaning, relying instead on markup structures. Single spaces between words generally survive. There are a wide variety of applications, notably XHTML, that discard or otherwise normalize most whitespace during processing. In cases where whitespace must be preserved in element content, document authors should always use the xml:space attribute described in Section 2.10 of [XML]. Whitespace beyond single spaces cannot be reliably preserved in attribute values. 3. Extending Common XML While the core of Common XML is designed for maximum interoperability and reliability, there are many features in XML which developers may need for their applications. Common XML permits the use of these additional features. The sections below describe various XML features and identify potential problems with regard to interoperability. By moving beyond the Common XML Core, some reliability is lost, but that may be acceptable in a wide variety of situations. Document creators may use these features at their own risk. 3.1 Comments Comments provide additional human-readable information within documents in a way that doesn't affect the core document structure. [XML] Section 2.5 explicitly permits XML processors (parsers) to drop comments from the information they pass to applications, making it impossible to count on comments surviving a round trip from XML file through parser and application to XML file. Developers can achieve most of the functionality of comments with elements in situations where semantics ("this is a comment") are understood, and this brings some other potential benefits. Comments declared as elements are more likely to survive a round-trip through various parsers, and can easily be presented (or not) in browsers using style sheets. Comment elements lack the freedom to appear anywhere in document structures during validation that proper XML comments have, however. 3.2 Processing Instructions Unlike comments, processing instructions must be reported to the application by XML parsers. However, they occupy a similarly ambiguous postion, as they are "not part of the document's character data". The only processing instruction to be standardized by the W3C (in [Associating]) comes with a notice "The W3C does not anticipate recommending the use of processing instructions in any future specification." These difficulties are compounded by the lack of generally used mechanism for identifying how processing instruction should be used. While NOTATION declarations may be used to provide additional information about processing instructions, there is no infrastructure provided for using that information. Many simple XML applications discard or ignore processing instruction information and may not treat it as important, reducing the likelihood that the information will be preserved in a roundtrip from document to application and back. As with comments, developers can achieve most of the functionality of processing instructions with elements in situations where semantics ("this is a processing instruction") are understood, and this brings some other potential benefits. Processing instructions declared as elements are more likely to survive a round- trip through various parsers, even those which don't understand the content. Processing instruction elements may also rely on XML parsers to break components of a PI into separate parts, while traditional processing instructions require additional interpretation at the application level. Processing instruction elements lack the freedom to appear anywhere in document structures during validation that proper XML processing instructions have, however. Processing instructions like that defined by the W3C in [Associating], which appear in the prolog of a document, may be ignored, but introduce fewer questions of scope. Generally, these processing instructions are considered to supply information regarding the processing of the document as a whole. 3.3 CDATA Sections CDATA sections provide a useful mechanism for keeping information from being interpreted as markup within XML 1.0. Preserving CDATA sections in roundtrips is unusual, however. Some applications have used CDATA to indicate semantic meaning as well. Because many applications normalize information from CDATA sections to appear like regular character content, CDATA sections should not be used in cases where the existence of the CDATA section is considered important. 3.4 XML Declaration The XML Declaration is optional in XML 1.0, and is only needed for Common XML when encodings other than UTF-8 or UTF-16 are being used. (It could become important again if Common XML tracks a future version of XML and needs to identify version information.) The declarations within the XML declaration either do very little within XML 1.0 (version and standalone), or identify situations where interoperability is no longer guaranteed (encoding). The XML Declaration should be used in all cases where the encoding is no longer UTF-8 or UTF-16. 3.5 Document Type Declaration and Document Type Description XML 1.0's handling of Document Type Declarations and their contents has created a wide variety of interoperability issues. In general, these issues arise from two features of XML 1.0 1) Non-validating XML processors are required to read the internal subset of the document type definition, but not the external subset or any external resources described by the internal subset. There is no mechanism permitting documents to require retrieval of external resources. 2) The contents of the document type declaration may have an impact on the content of the document through attribute defaults, normalization, and entity declarations. As a result of these two features, different XML processors may legitimately present applications with very different views of documents depending on whether they are validating or non-validating parsers. Because non-validating parsers may optionally retrieve external resources, even two non-validating parsers may return very different portrayals of a document. (These issues are described in more detail in [XPDL].) While use of the document type declaration may be necessary in certain cases, particularly where validating XML 1.0 parsers must be used at some point in document processing, they can cause significant problems. The guidelines below present some usages of these tools and suggest ways to minimize interoperability issues. 3.5.1 Internal DTD Subset Because all XML processors, validating or non-validating, are required to read and process the internal DTD subset, at least up to the point where it makes reference to external resources, the the internal DTD subset can be a useful way to include things like commonly used entities within a document, or to specify attribute defaults. This should be done carefully if an external DTD subset is also in use, as attribute list and entity declarations in the internal subset will override declarations of attributes or entities made in the external subset. If the declarations change the rules in ways that applications relying on validation are not prepared for, those applications may fail when presented with documents using the external subset. The internal subset is in many ways an 'escape clause', and should be used with caution. 3.5.2 External Subset and Identifiers The external DTD subset makes possible some of XML's most powerful features, but its use creates interoperability, reliability, and even security problems. There are several problems facing the use of the external subset, even with processors that attempt to retrieve it. There is no guarantee that the resource described by a system identifier will in fact be available at all times it is needed, or that it won't be changed. Public identifiers may allow applications to keep their own versions of DTDs available, but public identifiers have to rely on an infrastructure that is built application-by-application, and is not guaranteed to be available across XML implementations. These trade-offs may still be acceptable to developers who need these features. 3.5.3 Categories of Declarations and their Uses Some categories of declarations (element type declarations, attribute list declarations that don't specify default values, notation declarations) only provide additional information about a document or its purported structure without actually modifying its content. Others (attribute list declarations, entity declarations) may change the document content in ways that will be lost should the document be processed by a non-validating parser. Developers who want to maintain interoperability with both validating and non- validating parsers should consider creating separate declaration sets for structural declarations and content-modifying declarations, putting the latter in the internal subset when possible. 3.6 Other XML Extensions The W3C and other organizations are at work on a number of other extensions to XML 1.0 that may affect it's information set, notably [Schemas] and [XInclude]. These extensions will likely emerge over time in various XML processing tools, but the transition period may be lengthy. Developers who want to maintain maximum interoperability with the widest range of XML processing environments should avoid using these tools in ways which modify the document information set. It may be possible to convert the information set-modifying content of both [Schemas] and [XInclude] to forms that work that can be included in the internal DTD subset for interchange with applications where these tools are not yet supported. 4. References [Associating] - Associating Style Sheets with XML Documents. James Clark, ed. Available at http//www.w3.org/TR/xml-stylesheet. [Namespaces] - Namespaces in XML, Tim Bray, Dave Hollander, and Andrew Layman, eds. Available at http//www.w3.org/TR/REC-xml-names. [Schemas] - XML Schema, Parts 0-2. Henry Thompson, Paul Biron, David Fallside, et al. eds. Available at http//www.w3.org/TR/xmlschema-0 (Primer), http//www.w3.org/TR/xmlschema-1 (Structures), and http//www.w3.org/TR/xmlschema- 2 (Datatypes). [Unicode] - The Unicode Consortium. The Unicode Standard, version 3.0. ISBN 0- 201-61633-5. Described at http//www.unicode.org/unicode/standard/versions/Unicode3.0.html. [XInclude] - XML Inclusions (XInclude). Jonathan Marsh and David Orchard, eds. Available at http//www.w3.org/TR/xinclude. [XBase] - XML Inclusions (XInclude). Jonathan Marsh, ed. Available at http//www.w3.org/TR/xmlbase. [XPDL] - XML Processing Description Language (XPDL). Simon St.Laurent, ed. Available at http//purl.oclc.org/NET/xpdl. [XML] - Extensible Markup Language (XML) 1.0, Tim Bray, Jean Paoli, and C. M. Sperberg-McQueen, eds.. 10 February 1998. Available at http//www.w3.org/TR/REC- xml.