XML Node Inclusion Mechanism

Abstract

This note describes the syntax and semantics of a simple node inclusion mechanism for XML. Inclusion allows documents and parts of documents to be reused automatically in multiple documents. It should be considered a structured alternative to XML's text entity mechanism.

The note builds upon Web Characterization [Webchar], the XML Information Set [InfoSet] and the XLink and XPointer specifications.

Scope

There are many reasons for wanting to include text. We divide them up into two categories: reuse for text management versus quotation with a rhetorical or sociological purpose. For the purposes of this document, we will refer to the former as inclusion and the latter as transclusion.

For instance you might want to include boilerplate text for a copyright. In that case, the original context is not useful. The fact that the text was dynamically assembled is only an implementation detail. There is no rhetorical or sociological reason for the reuse. It is just an efficient way of structuring the text. This is inclusion.

In another case, you might want to quote a verse from Hamlet. In this case there are various rhetorical and sociological implications of the reuse. The fact that the verse comes from Shakespeare is relevant to readers of the document. In this case, you would need the rendition to indicate clearly that the text is included. It may also provide a means for recovering the verse's original context. This is transclusion.

This specification is too simple to completely cover rhetorically motivated quotation (transclusion). We assert that in general transclusion requires intelligent rendition decisions which can only be handled with sophisticated stylesheet support. Another complication with transclusion is that it must be possible to transclude diverse media types. Inclusion does not have that requirement.

Processing Model

Overview

An inclusion process takes an XML document information set as input and produces a result information set. This process may involve many individual inclusions (i.e., copyright boilerplate could be included from one source and an introductory paragraph from another source).

Inclusion Links

All XLinks in the input document with a link type of xml:include are inclusion links. Inclusion links explicitly or implicitly have two anchors: source and content.

Source Anchor

The source anchor may be identified as an anchor described in a locator with the role "source". It must address a single node in the same document as the link. If an inline link has no locator named "source", then the local resource serves as the source anchor.

Target Anchor

The target anchor may be identified by a locator with the role "content". If the link has no such locator but it has only one single remote resource then that resource may be used as the content anchor.

Note: According to these rules, the simplest inclusion reference (without using defaulted attributes) uses this syntax:

<xml:include xml:link="simple" href="blah blah blah"/>

Software called an inclusion processor works from the information set for the input document, the documents containing the included nodes and generates an information set called the result.

Process

The processing is recursive. It starts with the document node and progresses down to inclusion nodes and their inclusions.

The result of processing the input document node is a result document node.

The children of the result document node are the nodeset result of processing the input document's content (prolog, document element and epilog).

The result of processing a node that serves as the source of an inclusion is a copy of the nodeset.

The result of processing a document type declaration, processing instruction or character is an identical DTD, PI or character, as long as the node was not the source of an inclusion.

The result of processing any non-source element node is a result element node with the same generic identifier and attributes. The content of the result element node is the result of processing the content of the source element.

Iterations

The process of evaluating each node from the document down to the leaves (other than the children of source nodes) is called an iteration of the process. In many contexts it will make sense to process the result tree and the result of that process and so forth until there are no more source elements in a result tree. This is called a deep inclusion process.

Note: a deep inclusion can be implemented in a single pass but the specification describes it as multiple passes because in some cases this may be convenient.

Addressing

Although the Web is designed to allow anchors into manifestations of documents, it does not define a syntax that differentiates between links into the resource and links into the various client and server generated manifestations of the document, including transformation result trees. Typically, any reference is interpreted as being valid in all manifestations but this is not always the case and is specifically not the case with links into including documents.

Until a generalized syntax is defined, we define an extension to XPointer and the XSL query language that allows us to do inclusions.

include() is a function that takes a single document node as an argument and returns a nodeset representing the result of the inclusion process. By default it works upon the current node.

Note: Therefore a reference such as somedoc.xml#include()//TITLE refers to all titles in the result of an inclusion process applied to somedoc.xml.

deep-include() is a function that takes a single document node as an argument and returns a nodeset representing the result of the deep inclusion process. By default it works upon the current node.

Limitations

The result of inclusion may not be valid according to the input DTD. This mechanism does not provide specific support to ensure that it will be. This responsibility is placed upon the creator of the including document. This is no more onerous than the same responsibility in the XML text entity mechanism. There is probably a market for authoring and validation software that will follow inclusion references and ensure that the logical result document will be valid.

In the long term, it would be useful to have an XPointer/XSL QL extension that changed the document type declaration on a document node. Then result document types could be different from source document types.

The mechanism does not preserve authorship information. The underlying XML data model does not support this concept. In other words, the technology does not defend against plagerism. In our opinion, this strictly mechanical layer is not the correct place to enforce a high level concept like ownership. People who want to plagerise can use many other techniques just as easily as they can use this one.

IDs in the included documents must be chosen so that they do not clash. Future versions of SGML and XML schemas will probably support ID scopes to avoid this problem.

Future Work

It is only possible to include parts of other resources that have an information set that is compatible with XML's. The term "compatible with" is loosely defined at this point but could be made more explicit if there were information sets for multiple media types and if those information sets could build upon each other through subtyping. Right now, neither HTML nor generic SGML have information sets. Parts of those documents cannot be included.

[NB: Please do not cross-post responses to this proposal because unfortunately discussions do not span mailing lists very easily. I look forward to opinions on both of the included lists ([email protected], [email protected]).]

Paul Prescod  - ISOGEN Consulting Engineer speaking for only himself
 http://itrc.uwaterloo.ca/~papresco

Prepared by Robin Cover for the The SGML/XML Web Page archive.

SEARCH Advanced Search ABOUT Site Map CP RSS Channel Contact Us Sponsoring CP About Our Sponsors NEWS Cover Stories Articles & Papers Press Releases CORE STANDARDS XML SGML Schemas XSL/XSLT/XPath XLink XML Query CSS SVG TECHNOLOGY REPORTS XML Applications General Apps Government Apps Academic Apps EVENTS LIBRARY Introductions FAQs Bibliography Technology and Society Semantics Tech Topics Software Related Standards Historic	XML Node Inclusion Mechanism Date: Fri, 28 May 1999 19:21:25 -0500 From: Paul Prescod <[email protected]> Reply-To: [email protected] Subject: xlxp-dev: XML Inclusion Proposal XML Node Inclusion Mechanism Abstract This note describes the syntax and semantics of a simple node inclusion mechanism for XML. Inclusion allows documents and parts of documents to be reused automatically in multiple documents. It should be considered a structured alternative to XML's text entity mechanism. The note builds upon Web Characterization [Webchar], the XML Information Set [InfoSet] and the XLink and XPointer specifications. Scope There are many reasons for wanting to include text. We divide them up into two categories: reuse for text management versus quotation with a rhetorical or sociological purpose. For the purposes of this document, we will refer to the former as inclusion and the latter as transclusion. For instance you might want to include boilerplate text for a copyright. In that case, the original context is not useful. The fact that the text was dynamically assembled is only an implementation detail. There is no rhetorical or sociological reason for the reuse. It is just an efficient way of structuring the text. This is inclusion. In another case, you might want to quote a verse from Hamlet. In this case there are various rhetorical and sociological implications of the reuse. The fact that the verse comes from Shakespeare is relevant to readers of the document. In this case, you would need the rendition to indicate clearly that the text is included. It may also provide a means for recovering the verse's original context. This is transclusion. This specification is too simple to completely cover rhetorically motivated quotation (transclusion). We assert that in general transclusion requires intelligent rendition decisions which can only be handled with sophisticated stylesheet support. Another complication with transclusion is that it must be possible to transclude diverse media types. Inclusion does not have that requirement. Processing Model Overview An inclusion process takes an XML document information set as input and produces a result information set. This process may involve many individual inclusions (i.e., copyright boilerplate could be included from one source and an introductory paragraph from another source). Inclusion Links All XLinks in the input document with a link type of `xml:include` are inclusion links. Inclusion links explicitly or implicitly have two anchors: source and content. Source Anchor The source anchor may be identified as an anchor described in a locator with the role "source". It must address a single node in the same document as the link. If an inline link has no locator named "source", then the local resource serves as the source anchor. Target Anchor The target anchor may be identified by a locator with the role "content". If the link has no such locator but it has only one single remote resource then that resource may be used as the content anchor. Note: According to these rules, the simplest inclusion reference (without using defaulted attributes) uses this syntax: `<xml:include xml:link="simple" href="blah blah blah"/>` Software called an inclusion processor works from the information set for the input document, the documents containing the included nodes and generates an information set called the result. Process The processing is recursive. It starts with the document node and progresses down to inclusion nodes and their inclusions. The result of processing the input document node is a result document node. The children of the result document node are the nodeset result of processing the input document's content (prolog, document element and epilog). The result of processing a node that serves as the source of an inclusion is a copy of the nodeset. The result of processing a document type declaration, processing instruction or character is an identical DTD, PI or character, as long as the node was not the source of an inclusion. The result of processing any non-source element node is a result element node with the same generic identifier and attributes. The content of the result element node is the result of processing the content of the source element. Iterations The process of evaluating each node from the document down to the leaves (other than the children of source nodes) is called an iteration of the process. In many contexts it will make sense to process the result tree and the result of that process and so forth until there are no more source elements in a result tree. This is called a deep inclusion process. Note: a deep inclusion can be implemented in a single pass but the specification describes it as multiple passes because in some cases this may be convenient. Addressing Although the Web is designed to allow anchors into manifestations of documents, it does not define a syntax that differentiates between links into the resource and links into the various client and server generated manifestations of the document, including transformation result trees. Typically, any reference is interpreted as being valid in all manifestations but this is not always the case and is specifically not the case with links into including documents. Until a generalized syntax is defined, we define an extension to XPointer and the XSL query language that allows us to do inclusions. `include()` is a function that takes a single document node as an argument and returns a nodeset representing the result of the inclusion process. By default it works upon the current node. Note: Therefore a reference such as `somedoc.xml#include()//TITLE` refers to all titles in the result of an inclusion process applied to somedoc.xml. `deep-include()` is a function that takes a single document node as an argument and returns a nodeset representing the result of the deep inclusion process. By default it works upon the current node. Limitations The result of inclusion may not be valid according to the input DTD. This mechanism does not provide specific support to ensure that it will be. This responsibility is placed upon the creator of the including document. This is no more onerous than the same responsibility in the XML text entity mechanism. There is probably a market for authoring and validation software that will follow inclusion references and ensure that the logical result document will be valid. In the long term, it would be useful to have an XPointer/XSL QL extension that changed the document type declaration on a document node. Then result document types could be different from source document types. The mechanism does not preserve authorship information. The underlying XML data model does not support this concept. In other words, the technology does not defend against plagerism. In our opinion, this strictly mechanical layer is not the correct place to enforce a high level concept like ownership. People who want to plagerise can use many other techniques just as easily as they can use this one. IDs in the included documents must be chosen so that they do not clash. Future versions of SGML and XML schemas will probably support ID scopes to avoid this problem. Future Work It is only possible to include parts of other resources that have an information set that is compatible with XML's. The term "compatible with" is loosely defined at this point but could be made more explicit if there were information sets for multiple media types and if those information sets could build upon each other through subtyping. Right now, neither HTML nor generic SGML have information sets. Parts of those documents cannot be included. [NB: Please do not cross-post responses to this proposal because unfortunately discussions do not span mailing lists very easily. I look forward to opinions on both of the included lists ([email protected], [email protected]).] Paul Prescod - ISOGEN Consulting Engineer speaking for only himself http://itrc.uwaterloo.ca/~papresco Prepared by Robin Cover for the The SGML/XML Web Page archive.