Common XML - Final Review Draft Specification, 27 July 2000
(Review ends 1 September 2000)

1. Introduction - What is Common XML?
	1.1 Common XML Development
2. Common XML Core
	2.1 Basic Requirements for Common XML Documents
	2.2 Elements
	2.3 Attributes
	2.4 Namespaces
	2.5 Textual Content
3. Extending Common XML
	3.1 Comments
	3.2 Processing Instructions
	3.3 CDATA Sections
	3.4 XML Declaration
	3.5 Document Type Declaration
		3.5.1 Internal Subset
		3.5.2 External Subset
		3.5.3 Categories of Declarations and their Uses
	3.6 Other XML Extensions

1. Introduction - What is Common XML?

XML provides a solid foundation on which very different but throughly 
interoperable data processing systems can be built.  However, XML includes a 
number of features and options that make it difficult to ensure that varying 
applications will receive the same view of a document.  These difficulties may 
appear even in cases where all applications involved share a complete 
understanding of the vocabulary used by the document. The situation is 
complicated further when documents must pass through multiple levels of parsing 
and processing on their way to a target application.

Common XML begins with a frequently used and thoroughly reliable subset of the 
features provided by the XML 1.0 and Namespaces in XML W3C Recommendations.  
Common XML defines a very small core, but allows developers to move beyond that 
core if needed.  Additional features from XML 1.0 and Namespaces are still 
available. This specification includes descriptions of the impact of features 
beyond the core on interoperability.

Documents created using the core Common XML feature set should always present 
the same information set when processed by a non-validating parser that conforms 
to the XML 1.0 specification, and should present consistent information in both 
namespace-aware and non-namespace aware environments.

While some developers may be discouraged to find their favorite features left 
out, they should be aware that Common XML doesn't prohibit the use of anything 
in XML 1.0 - it just provides warnings about possible issues involving 
interoperability.  By sticking to the core of Common XML, developers can ensure 
consistency, but it may be appropriate to move beyond the core of Common XML to 
meet particular needs.  

Common XML is effectively a set of guidelines for what can and cannot be counted 
on in XML processing.  It is not a set of rules for creating parsers or other 
software.  Common XML is intended for use within the existing framework of XML 
1.0 processors and applications.


1.1 Commmon XML Development

This is a final review draft, with comments requested until 1 September 2000.  
There may be minor revisions during that period, all of which will be preserved 
at http://www.egroups.com/files/sml-dev/Work/.  At the end of that period the 
SML-DEV mailing list will decide whether to declare this document complete, 
continue work on it, or end the project.

Common XML is a project of the SML-DEV mailing list.  Archives and additional 
information are available at http://www.egroups.com/group/sml-dev/info.html.  
This is a draft specification and subject to change.  Comments and suggestions 
are welcome, either on the SML-DEV mailing list or addressed to the current 
editor at simonstl@simonstl.com.


2. Common XML Core

The Common XML Core describes a set of features that can be used to transfer 
information reliably among applications using conformant non-validating XML 1.0 
parsers.  All XML parsers should report the same document content to the 
application, and tools writing XML documents back out generally preserve this 
information, making it easier and safer to roundtrip information through 
multiple layers of processing.

If XML features are not described in this section, they are not part of the 
Common XML Core.  They may, however, be used with Common XML as described in 
Section 3.


2.1 Basic Requirements for Common XML Documents

All Common XML documents are conformant XML 1.0 documents as defined in [XML].  
Common XML documents have a single root element.  Elements may have attributes 
and contain textual content and/or other elements.  Namespaces may be used as 
defined in [Namespaces], though with an extra restriction described in section 
2.4 below.


2.2 Elements

Elements are identified through markup exactly the same way as in XML 1.0. Start 
tags identify the beginnings of elements and may contain attribute values. End 
tags identify the ends of elements and may not contain attribute values.  Empty 
tags may be used to represent elements without content and may contain attribute 
values.

The rules for naming elements are the same in common XML as they are in XML 1.0, 
as are the rules for nesting elements and handling whitespace inside of 
elements.


2.3 Attributes

Attributes in Common XML are used exactly as they are in XML 1.0.  Attributes 
may appear inside of start or empty tags and provide additional information 
regarding the element defined by that start or empty tag.  The rules for naming 
attributes and describing acceptable attribute content remain the same in Common 
XML as they do in XML 1.0.  Either single- or double- quotes may be used to 
demarcate attribute values, but it is generally easier to pick one style and 
stick with it for easier processing using text-manipulation tools.

Attributes will be normalized per section 3.3.3 of [XML].  Documents that have 
no document type definition will have their attributes normalized as described 
for CDATA.  Further information on normalization when DTDs are present is 
available in section 3.5 of this specification. 

To simplify some kinds of processing, it is strongly recommended that developers 
using Common XML avoid using the same name for an attribute and a child element 
of the same element.  This avoids issues caused by the distinction between 
attributes and child elements in contexts where such distinctions aren't 
important.

The distinctions between elements and attributes also raise some complex 
questions of which approach to use when.  While it is possible to create 
documents that store all of their textual data inside attribute values rather 
than as element content, this approach is often hard to read and can be place 
severe limits on future document structure.  Developers should consider a 
balance between element and attribute structures when designing XML documents.


2.4 Namespaces

Namespaces are defined within Common XML documents using the mechanisms defined 
in [Namespaces] - attribute values defined using xmlns* attributes.  Because 
Namespaces is an addition to XML 1.0, and not all applications or parsers 
process namespaces the same way, the Common XML Core adds one restriction to 
ease interoperability among these systems.

The easiest way to avoid problems with namespace conflicts within documents is 
to make all namespace declarations on the root element of the document.  While 
this denies document creators the use of the scoping features built into 
Namespaces in XML, it simplifies processing tremendously.  Processors can see at 
the root level of the document all of the namespaces they will have to manage, 
prefix conflicts are made impossible (because XML 1.0 doesn't permit multiple 
attributes with the same name to appear in well-formed documents), and 
applications are relieved of the overhead involved in tracking namespaces on a 
per-element basis. 

Prefixes, including the default prefix, should only have one mapping to a URI 
within a document.  While the Namespaces in XML specification explicitly permits 
reuse of locally-scoped prefixes, this can make it difficult to work with 
documents using namespaces in environments that don't recognize namespaces per 
se or which have to re-serialize these documents after processing.  Prefixes are 
still only significant within the scope for which they are declared - this 
guideline does not propose to give prefixes global applicability across a 
document.

In general, it is a wise idea to use the same prefix for the same namespace 
across multiple documents.  This makes it much easier to process documents with 
XML processors that aren't namespace aware.

Similarly, the use of multiple prefixes (or prefixes and the default namespace) 
to refer to the same URL may create complexity as programs that don't recognize 
namespaces mistake these differing prefixes for different types.  

The use of namespace prefixes with attributes can also raise ambiguities.  In 
general, namespace prefixes should only be used on attributes when that 
attribute is in a different namespace than the element to which it is applied.

If external entities are used in XML document assembly (a possibility noted in 
3.5, below), those external entities should contain all the namespace 
declarations they need, rather than relying on the parent document to to provide 
the declarations.  This removes potential amibiguity and should reduce the 
number of cases where namespace values change unexpectedly.

Finally, the use of URIs as namespace identifiers has raised complex issues 
involving the use of features like relative URIs to identify namespaces.  
Because relative URIs may have different values depending on the location of the 
document and new features under development at the W3C like [XBase], the use of 
absolute URIs is strongly recommended.

2.5 Textual Content

Textual content within Common XML documents is defined the same way as in XML 
1.0. Documents written to conform to the guidelines of the Common XML Core 
should only use UTF-8 and UTF-16 encodings (defined in [Unicode]), as these are 
the only encodings that all conformant XML parsers are required to support.  
Common XML Core documents may use the built-in entities (&lt;, &gt;, &amp;, 
&apos;, &quot;) as well as both decimal and hexadecimal character references.

The XML 1.0 specification provides fairly complex rules for whitespace handling 
by XML processors, which may or may not be followed by XML applications, 
especially through transformations.  Document authors should avoid using 
multiple spaces to establish semantic meaning, relying instead on markup 
structures. Single spaces between words generally survive.  There are a wide 
variety of applications, notably XHTML, that discard or otherwise normalize most 
whitespace during processing.  In cases where whitespace must be preserved in 
element content, document authors should always use the xml:space attribute 
described in Section 2.10 of [XML].  Whitespace beyond single spaces cannot be 
reliably preserved in attribute values.


3. Extending Common XML

While the core of Common XML is designed for maximum interoperability and 
reliability, there are many features in XML which developers may need for their 
applications.  Common XML permits the use of these additional features.  The 
sections below describe various XML features and identify potential problems 
with regard to interoperability.  By moving beyond the Common XML Core, some 
reliability is lost, but that may be acceptable in a wide variety of situations.  
Document creators may use these features at their own risk.


3.1 Comments

Comments provide additional human-readable information within documents in a way 
that doesn't affect the core document structure.  [XML] Section 2.5 explicitly 
permits XML processors (parsers) to drop comments from the information they pass 
to applications, making it impossible to count on comments surviving a round 
trip from XML file through parser and application to XML file.

Developers can achieve most of the functionality of comments with elements in 
situations where semantics ("this is a comment") are understood, and this brings 
some other potential benefits.  Comments declared as elements are more likely to 
survive a round-trip through various parsers, and can easily be presented (or 
not) in browsers using style sheets.  Comment elements lack the freedom to 
appear anywhere in document structures during validation that proper XML 
comments have, however.


3.2 Processing Instructions

Unlike comments, processing instructions must be reported to the application by 
XML parsers.  However, they occupy a similarly ambiguous postion, as they are 
"not part of the document's character data".  The only processing instruction to 
be standardized by the W3C (in [Associating]) comes with a notice "The W3C does 
not anticipate recommending the use of processing instructions in any future 
specification."

These difficulties are compounded by the lack of generally used mechanism for 
identifying how processing instruction should be used.  While NOTATION 
declarations may be used to provide additional information about processing 
instructions, there is no infrastructure provided for using that information.  
Many simple XML applications discard or ignore processing instruction 
information and may not treat it as important, reducing the likelihood that the 
information will be preserved in a roundtrip from document to application and 
back.

As with comments, developers can achieve most of the functionality of processing 
instructions with elements in situations where semantics ("this is a processing 
instruction") are understood, and this brings some other potential benefits.  
Processing instructions declared as elements are more likely to survive a round-
trip through various parsers, even those which don't understand the content.  
Processing instruction elements may also rely on XML parsers to break components 
of a PI into separate parts, while traditional processing instructions require 
additional interpretation at the application level.  Processing instruction 
elements lack the freedom to appear anywhere in document structures during 
validation that proper XML processing instructions have, however.

Processing instructions like that defined by the W3C in [Associating], which 
appear in the prolog of a document, may be ignored, but introduce fewer 
questions of scope.  Generally, these processing instructions are considered to 
supply information regarding the processing of the document as a whole.


3.3 CDATA Sections

CDATA sections provide a useful mechanism for keeping information from being 
interpreted as markup within XML 1.0.  Preserving CDATA sections in roundtrips 
is unusual, however.  Some applications have used CDATA to indicate semantic 
meaning as well.  Because many applications normalize information from CDATA 
sections to appear like regular character content, CDATA sections should not be 
used in cases where the existence of the CDATA section is considered important.


3.4 XML Declaration

The XML Declaration is optional in XML 1.0, and is only needed for Common XML 
when encodings other than UTF-8 or UTF-16 are being used.  (It could become 
important again if Common XML tracks a future version of XML and needs to 
identify version information.)

The declarations within the XML declaration either do very little within XML 1.0 
(version and standalone), or identify situations where interoperability is no 
longer guaranteed (encoding).  The XML Declaration should be used in all cases 
where the encoding is no longer UTF-8 or UTF-16.


3.5 Document Type Declaration and Document Type Description

XML 1.0's handling of Document Type Declarations and their contents has created 
a wide variety of interoperability issues.  In general, these issues arise from 
two features of XML 1.0

	1) Non-validating XML processors are required to read the 
	internal subset of the document type definition, but not 
	the external subset or any external resources described by
	the internal subset.  There is no mechanism permitting 
	documents to require retrieval of external resources.

	2) The contents of the document type declaration may have
	an impact on the content of the document through attribute
	defaults, normalization, and entity declarations.

As a result of these two features, different XML processors may legitimately 
present applications with very different views of documents depending on whether 
they are validating or non-validating parsers.  Because non-validating parsers 
may optionally retrieve external resources, even two non-validating parsers may 
return very different portrayals of a document.  (These issues are described in 
more detail in [XPDL].)

While use of the document type declaration may be necessary in certain cases, 
particularly where validating XML 1.0 parsers must be used at some point in 
document processing, they can cause significant problems.  The guidelines below 
present some usages of these tools and suggest ways to minimize interoperability 
issues.

3.5.1 Internal DTD Subset

Because all XML processors, validating or non-validating, are required to read 
and process the internal DTD subset, at least up to the point where it makes 
reference to external resources, the the internal DTD subset can be a useful way 
to include things like commonly used entities within a document, or to specify 
attribute defaults.  

This should be done carefully if an external DTD subset is also in use, as 
attribute list and entity declarations in the internal subset will override 
declarations of attributes or entities made in the external subset.  If the 
declarations change the rules in ways that applications relying on validation 
are not prepared for, those applications may fail when presented with documents 
using the external subset.  The internal subset is in many ways an 'escape 
clause', and should be used with caution.

3.5.2 External Subset and Identifiers

The external DTD subset makes possible some of XML's most powerful features, but 
its use creates interoperability, reliability, and even security problems.  
There are several problems facing the use of the external subset, even with 
processors that attempt to retrieve it.  There is no guarantee that the resource 
described by a system identifier will in fact be available at all times it is 
needed, or that it won't be changed.  Public identifiers may allow applications 
to keep their own versions of DTDs available, but public identifiers have to 
rely on an infrastructure that is built application-by-application, and is not 
guaranteed to be available across XML implementations.

These trade-offs may still be acceptable to developers who need these features.

3.5.3 Categories of Declarations and their Uses

Some categories of declarations (element type declarations, attribute list 
declarations that don't specify default values, notation declarations) only 
provide additional information about a document or its purported structure 
without actually modifying its content.  Others (attribute list declarations, 
entity declarations) may change the document content in ways that will be lost 
should the document be processed by a non-validating parser.

Developers who want to maintain interoperability with both validating and non-
validating parsers should consider creating separate declaration sets for 
structural declarations and content-modifying declarations, putting the latter 
in the internal subset when possible.


3.6 Other XML Extensions

The W3C and other organizations are at work on a number of other extensions to 
XML 1.0 that may affect it's information set, notably [Schemas] and [XInclude].  
These extensions will likely emerge over time in various XML processing tools, 
but the transition period may be lengthy.  Developers who want to maintain 
maximum interoperability with the widest range of XML processing environments 
should avoid using these tools in ways which modify the document information 
set.  It may be possible to convert the information set-modifying content of 
both [Schemas] and [XInclude] to forms that work that can be included in the 
internal DTD subset for interchange with applications where these tools are not 
yet supported.


4. References

[Associating] - Associating Style Sheets with XML Documents.  James Clark, ed. 
Available at http//www.w3.org/TR/xml-stylesheet.

[Namespaces] - Namespaces in XML, Tim Bray, Dave Hollander, and Andrew Layman, 
eds. Available at http//www.w3.org/TR/REC-xml-names. 

[Schemas] - XML Schema, Parts 0-2.  Henry Thompson, Paul Biron, David Fallside, 
et al. eds. Available at http//www.w3.org/TR/xmlschema-0 (Primer), 
http//www.w3.org/TR/xmlschema-1 (Structures), and http//www.w3.org/TR/xmlschema-
2 (Datatypes).

[Unicode] - The Unicode Consortium. The Unicode Standard, version 3.0. ISBN 0-
201-61633-5. Described at  
http//www.unicode.org/unicode/standard/versions/Unicode3.0.html. 

[XInclude] - XML Inclusions (XInclude).  Jonathan Marsh and David Orchard, eds. 
Available at http//www.w3.org/TR/xinclude.

[XBase] - XML Inclusions (XInclude).  Jonathan Marsh, ed. Available at 
http//www.w3.org/TR/xmlbase.

[XPDL] - XML Processing Description Language (XPDL).  Simon St.Laurent, ed. 
Available at http//purl.oclc.org/NET/xpdl.

[XML] - Extensible Markup Language (XML) 1.0, Tim Bray, Jean Paoli, and C. M.
Sperberg-McQueen, eds.. 10 February 1998. Available at http//www.w3.org/TR/REC-
xml.