[This local archive copy mirrored from the canonical site: http://www.chilli.net.au/~ricko/XML-cut-n-paste.htm 980204; links may not have complete integrity, so use the canonical document at this URL if possible.]

Note-xml-cnp-19980131


A Cut and Paste Infrastructure for XML

Note 1-February-1998

Author:
Rick Jelliffe
<ricko@allette.com.au>


Status of this document

This document is a NOTE for discussion by the W3C XML-related groups. It is a personal critique of XML 1.0 PR, especially in the light of the namespace proposal, RDF and XML-data. It may be seen as a contribution to user requirements for an XML 1.1.


Abstract

XML Cut'n'Paste is a proposal for various conventions which address many sophisticated uses in Extensible Markup Language (XML) while retaining true to its underlying model, as an application of SGML.

The proposals are:


A Cut and Paste Infrastructure for XML

Version 0.4

Table of Contents

1. Motivation and Summary

XML was originally conceived with an emphasis on delivering SGML documents over the WWW. However, documents and document fragments have a life after delivery: they will be edited and reused. This is the "cut'n'paste" problem: how can document validity be maintained when a document is made from arbitrary parts of other documents, each of which may have their own markup declarations? (Indeed, is the notion of document type reasonable in such circumstances? Yes: because the SGML notation of type is a synthetic one-- a type is whatever the markup declarations and additional syntactic requirements construct, not some external fact.)

The cut'n'paste problem applies equally whether the fragments are physically placed in the document, or referred to by entity references, or by external links. XML 1.0 uses the SGML Convention of locating declarations in the prolog (i.e., forcing constants into headers): this convention complicates matters in this environment.

A further issue which the cut'n'paste problem raises is how to prevent name clashes.

Without a viable and simple cut'n'paste solution, XML document fragments will suffer with entropy, as they lose the interesting declarations in the prolog during their travels, and bloat, as constants are taken from the prolog and are marked-up explicitly in the instance.

This proposal is therefore an infrastructure of enhancements and conventions which will simplify future cut'n'paste solutions. In particular, more attention has been paid to the paste issue: how to insert data or declarations into a stream (or, rather, how to wrap some data in a new element) without irreparably harming content-model validity. It is a conservative mechanism, in that it avoids reinventing any wheels.

The constraints on this proposal are

1.1 Contemporaneous Discussions

At the current time, many of the same issues as this proposal raises are being discussed in other contexts. The primary sources of discussion are:

There is a substantial overlap between these documents. Because of this overlap, this proposal also addresses some issues which are strictly outside the area of cut'n'paste, but which are elaborated to show that this proposal also meets some other groups' technical requirements. (In particular, the DATA attribute type.)

RDF can be seen as requiring cut'n'paste with weaker content models. XML-data can be seen as requiring cut'n'paste with stronger typing of data. In my view, adopting XML cut'n'paste, as proposed here, will substantially clean up XML-data and RDF: integrating them into future cut-and-paste XML editors, and providing a more powerful toolkit for them to work with. This proposal should not be seen as an alternative to RDF and XML-data, but merely an infrastructure for them which will reduce the areas they must cover, which lets them concentrate on satisfying their user requirements. In a similar display of constructive neutrality, this proposal also reconciles the namespace proposals with Architectural forms, allowing multiple schemas to be specified on the same element structure.

My concern about some of the discussions are that they propose solutions which increase the gaps between XML well-formedness and XML validity and between XML validity and SGML validity, and also compromise the underlying SGML model. In my view, the SGML model is the main reason for its success and power: too much concentration on syntactical issues merely averts attention from this.

Current proposals weaken entities (the Namespace PIs point to resources that are not entities), PIs (the XML-data proposal uses elements for declarations and PIs), and elements (RDF has chosen a tagging scheme does not fit in with XML content models). This proposal aims to show that, in a large part, there are solutions possible which improve and unify XML rather than fragment it confusingly.

2. XML Element Type

This section defines an element type which can be used in any place where elements are allowed.

XML Element Type
<!ELEMENT xml ANY >
<!ATTLIST xml
 xml:id ID #IMPLIED
 xml:lang NMTOKEN #IMPLIED
 xml:alt CDATA #IMPLIED
 xml:class CDATA #IMPLIED >

This element type can be used:

See RDF Re-expressed using XML Cut'n'Paste for an example.

SGML Conformance Note: In SGML terminology, this element type is an implied inclusion of the content model of the document type element. XML does not allow inclusions: this is element type is the exception to overcome this. The XML DTD of a valid XML document which used this element type could be converted into a valid SGML DTD trivially by adding the following to the content model of the document element type.

+( xml )

The xml element is not really an inline SUBDOC, because it exists in the same name space as the exterior instance. Also, the exterior content model can be resumed inside the xml element, in any element which has a declared content type of ANY. (To do: adapt the SGML Extended Facilities "transparent element" system to cope with this.)

3. Instance Declarations

This section defines a convention to allow any XML markup declaration to be declared in the instance using a processing instruction (PI). This mechanism allows a pasted document to carry its declarations with it as part of an instance.

A pasted document should be placed inside an xml element to maintain content model validity. The effect of instance declarations is not scoped by this element; just as with any XML declaration, the first declaration wins. However, any user-defined declarations used for additional syntactic constraints can scope themselves by the containing xml element. (Note: Name conflicts are solved by using prefixes. )

Examples of Instance Declarations
<?xml:element a ( #PCDATA ) ?>
<?xml:attlist b c CDATA #IMPLIED ?>
<?xml:entity d SYSTEM "e.xml" ?>
<?xml:notation f SYSTEM "g.pl" ?>

These do not extend the kinds of declarations available in XML. They merely make it possible to declare them at the place they are found in the instance.

This locality of declaration may be useful for stream-based systems which annotate XML documents on the fly, for which writing and resolving the appropriate markup declarations back in the prolog may give unacceptable server performance.

SGML Conformance Note: Because an XML declaration is already an SGML declaration, these PIs can be converted into an SGML markup declaration trivially, and moved into the prolog. The SGML rules disallowing redefinition are respected.

Furthermore, the system of allowing such PIs to act as declarations has been raised at ISO JTC1/WG4, and may find its way into a future revision of the SGML standards. Already HyTime uses PIs in a consonant way, for declarations.

4. DATA Attribute Type

XML is weaker than users will expect in the area of providing a fixed but extensible set of type for data content and attributes.

The XML PR should be corrected to bring out the existing feature that a notation attribute on an element defines the notation of that element (in particular, of its data content, but informed by any attributes). For example, the following declaration declares that the contents of an element are a date:

<!NOTATION is8601
 PUBLIC "ISO/IEC 8601//NOTATION date//EN">
<!ELEMENT date ( #PCDATA ) >
<!ATTLIST date
 notation NOTATION ( is8601 ) "is8601" >

Web SGML introduced a new type of attribute to set the notation used for attributes. XML should adopt this. The keyword DATA is used.

<!NOTATION is8601
 PUBLIC "ISO/IEC 8601//NOTATION date//EN">
<!ELEMENT person ( #PCDATA ) >
<!ATTLIST person
 born DATA is8601 #IMPLIED >
<person born="1960-05-18">Rick Jelliffe</person>

Adopting this convention substantially lessens the need for some external schema syntax, such as proposed by XML-data. This attribute can be added to XML with trivial impact on XML syntax and parsers, compared to XML-data, for example. Futhermore, it clarifies the use of notations, which people coming from XML do not understand. (Notations are analogous to MIME media-types, but may be on a much finer grain.)

The system identifier of a notation for an XML document should be the location of a validator for that notation rather than a rendering engine per se.

SGML Conformance Note: Data attributes on attribute definitions using DATA keyword are not being proposed here. This is a conforming use, as an additional requirement.

4.1 Regular Expressions

The SGML Extended Facilities also include a standard way to specify the lexical type of an attribute or of an element's data content using POSIX regular expressions. This would might be very handy for developers. It is mentioned here as a convention that could usefully be made for XML, but is not part of these proposals.

5. Name Space Prefixes

In this proposal, responsibility for preventing namespace clashes belongs to the creator of documents. There are two ways to do this. The first way is by owner prefixing. The second way is by entity name prefixing.

The term "namespace" is used to mean different things. These two mechanisms clarify which is meant.

5.1 Owner Prefixing

Owner prefixing is found in SGML Formal Public Indentifiers, in XML PI targets, in the Namespace proposal http://www.w3.org/TR/1998/NOTE-xml-names, and in SGML architecture specifications.

This section proposes that the last two should be reconciled in some convenient way, since their intent is identical: the namespace proposal adds the prefixing convention; SGML architectures bring in many useful ideas for schemas. However, since the difference in the actual declarations is purely cosmetic, the issue can be resolved elsewhere without impacting this proposal.

For the prefix, the delimiter ":" is proposed for this purpose. As in the namespace proposal, there can only be one owner. The owner will typically be a schema definition. ("Owner" is used analogically, and does not indicate any property rights.)

An element may belong to several schemas. The primary schema is the one indicated by the GI (element type name). This, by default, has no particular name, since it belongs to the DTD. Therefore any convenient prefix can be used. ISO's SGML Extended Facilities' "Architectural Forms Definition Requirements" allows multiple schemas to be associated with an element structure. It should be adopted, with whatever cosmetic changes are required.

5.2 Entity-Name Prefixing

In just the same way as owner prefixing is useful in some circumstances, but directory (hierarchical) name resolution is useful in others, this section proposes both kinds.

The subject of modules and name spaces has been discussed at ISO JTC1/WG4 for several years, in particular in response to a Japanese proposal. The current version of this proposal, as it has developed, is to use the parameter entity name as a prefix for resolving declarations.

The ISO 9070 delimiter "::" is proposed for this purpose. A name may have multiple prefixes, since a parameter entity could itself contain other parameter entities.

For example, if there is some schema (notation) which defines date formats, in this case with an explicit notation declaration in some external entity:

<!-- This declaration sits in entity "date.xml" -->
<!NOTATION is8601
 PUBLIC "ISO/IEC 8601//NOTATION date//EN">

I can use this as follows:

<!ENTITY % RJ SYSTEM "date.xml" >
%RJ;
<!ELEMENT person ( #PCDATA ) >
<!ATTLIST person
born DATA RJ::is8601 #IMPLIED >
<person born="1960-05-18">Rick Jelliffe</person>

In the previous example, the parameter entity was referenced. So the declarations are part of the direct document type. However, it is possible to have parameter entities defined but not references. Entity-name prefixing should still apply to these: they may contain declarations for parallel schemas (architectures).

The entity name prefix system is required as a companion to parameter entity declarations using instance syntax. It is not possible to put parameter entity references in data, so this is an alternative way to invoke them.

Aside: The ability to display directories as if they were HTML pages, with predefined header text, already shows that it is possible to treat a directory as an entity containing other entities and data content. The current directory system is primitive, only allowing header text, or, for FTP directories, annotations to each file. Entity-name prefixing can be seen as reconciling directories with documents, to a certain extent. However, file-system directories are storage-manager-level structures, and entities are an entity-manager-level structure, so one starts where the other leaves off.

SGML Conformance Note: This is currently not valid SGML. However, the parameter entity containing the declarations only needs to be trivially transformed to use fully qualified names to become valid SGML. Entity-name prefixing has been favorably received by ISO JTC1/WG4 as a good approach to this problem which enhances the value of the entity mechanism.

The SGML proposal is to introduce a special keyword on parameter entities "MODULE", to register the entity with the prefix resolution system. This proposal does not require this at this stage: all parameter entities are candidate.

Open Questions: It may well be that there is the requirement for some more systematic name remapping convention required for XML. Whether this is along the like so ISO architectural name remapping, or uses a syntax like Cascading Stylesheets (though inside Processing Instructions!), the proposals here have tried to be modest, and not over-engineered. It is well to keep in mind the enormous success of HTML without strong typing, or the ability to construct elaborate schemas.

Entity-name prefixes are a kind of pointer mechanism to locate declarations through the entity structure. XLL extended pointers might usefully have some kind of location mechanism based on this. In particular, the ability to locate a PI in a particular entity might be useful.

<!ENTITY % RJ SYSTEM "date.xml" >
%RJ;
<!ELEMENT person ( #PCDATA ) >
<!ATTLIST person
born DATA RJ::is8601 #IMPLIED >
<person born="1960-05-18">Rick Jelliffe</person>

6. First-class Processing Instructions

This section proposes that PIs should be further defined to maximize their usefulness, and to prevent elements being used in their place. Using elements instead of PIs complicates the cut'n'paste issue.

SGML Conformance Note: Some SGML systems will complain if there is no ID attribute (on an an element) for a corresponding IDREF. However, there is some recognition at ISO JTC1/WG4 that SGML may need to further define SGML PIs to make them more useful in the future.

7. Documentation

If a document can have fragments cut out and pasted, not only will the declarations also have to be readily cut-out-able and paste-able, but also the documentation should be available at the same level of granularity as the declarations. These (links to) documentation should be bundle-able in with the declarations. The convention proposed here makes this convenient.

A special version of the XML inline declaration PIs should be made, using the keyword SEEALSO and an XLL link, and is terminated by the keyword of the declaration (and the % if it is a parameter entity) and the name declared. In this way, documentation can be added to any declaration, whether in the instance or the header.

The XLL link points to human-readable documentation for the declarations.

The x*l:name attribute can be used to associate a public identifier as well, which can also be used to key additional syntactal contraints on the element, without having to specify them within the document. This is another route for associating syntactic schema information with declarations.

<?xml:entity % RJ SYSTEM "date-types.xml" ?>
<?xml:seealso href="std-date-types.html" ENTITY % RJ ?>

Note that this creates a new type of addressing: every current declaration is addressable by its type (ENTITY, ENTITY %, ATTLIST, ELEMENT, NOTATION) and the name given. We can use this unique name rather than attempting to attach ID attributes or imposing a scoping or order restriction. This may look strange, but is no stranger than anything else. See Example 2.

The xml:group processsing instruction allows a short form which is less ugly.

<?xml:group ?>
  <?xml:entity % RJ SYSTEM "date-types.xml" ?>
  <?xml:seealso href="std-date-types.html" ?>
<?xml:end-group ?>

SGML Conformance Note: The SEEALSO parameter was recently introduced in Web SGML, but is currently available only on the SGML declaration. A finer grain will be useful. This use of PIs is OK by SGML standard.

8. Example 1

Here is an example combining several of the proposals: the xml element, instance declarations, the data attribute type with validators, owner-prefixing, and nested entity-name prefixing. It can be seen how these features together allow a very clean bundling together of a document fragment. Cleanliness is next to pastiness.

<!-- These declarations sit in entity "date-types.xml" -->

<!-- Schema defined using namespace PI: "owner" is is8601 -->
<?xml:namespace as="8601" name="Universal Date Format" ?>

<!-- Equivalent declaration using ISO architectures: "owner" is is8601 -->
<?is10797:AFDR name="8601" system-id="Universal Date Format" ?>

<!-- The actual declarations. The system identifier is a validator. -->
<!NOTATION is8601:simple
 PUBLIC "ISO/IEC 8601//NOTATION simple date yyyymmdd//EN"
 "http://www.somewhere.qq/simple-date.java">

<!NOTATION is8601:complex
 PUBLIC "ISO/IEC 8601//NOTATION complex date yyymmdd:mmss//EN"
 "http://www.somewhere.qq/complex-date.java">
<!-- This declaration sits in entity "rixdex.xml", to give a meatier example -->
<!ENTITY % date SYSTEM "date-types.xml">

<!-- It does not really matter if the parameter entity is referenced or not, in this case -->
%date;
<?XML encoding="8859-1"?>
<!-- This is the example document -->
<!DOCTYPE thing SYSTEM "SOMEWHERE" >
<thing>
...
<xml>
<?xml:entity % RJ SYSTEM "rixdex.xml" ?>
<?xml:element myfrag:person ( #PCDATA ) ?>
<?xml:attlist myfrag:person
  born DATA RJ::date::is8601:simple #IMPLIED ?>
<my-frag:person born="1960-05-18">Rick Jelliffe</my-frag:person>
</xml>
...
</thing>

9. Example 2

This example is a slightly different version of the previous example. In this example, the first entity date.xml collects together all the kinds of notations that we can use for dates: they form an entity set but this does not seem to be what is meant by a schema. The second entity rixschema.xml is the actual schema proper.

<!-- These declarations sit in entity "date-types.xml" -->

<!-- Schema defined using namespace PI: "owner" is is8601 -->
<?xml:namespace as="8601" name="Universal Date Format" ?>

<!-- The actual declarations. The system identifier is a validator. -->
<!NOTATION is8601:simple
  PUBLIC "ISO/IEC 8601//NOTATION simple date yyyymmdd//EN"
  "http://www.somewhere.qq/simple-date.java">

<!NOTATION is8601:complex
  PUBLIC "ISO/IEC 8601//NOTATION complex date yyymmdd:mmss//EN"
  "http://www.somewhere.qq/complex-date.java">
<!-- These declarations sit in entity rixschema.xml -->

<?xml:namespace as="myschema" ?>

<?xml:group?>
  <?!ENTITY % date SYSTEM "date-types.xml" ?>
  <?xml:seealso href="std-date-types.html" ?>
<?xml:end-group ?>

<?xml:group ?>
  <!ELEMENT myschema:person ( #PCDATA ) ?>
  <?xml:seealso href="myschema-person.html" ?>

  <!-- We can do XML-data supertypes here if we want to, too -->
  <?xml:namespace as="xml-data" href="http://www.w3.org/TR/1998/NOTE-XML-data"?>
  <?xml-data:supertype type="some-other-element" ?>

  <!-- We can specify all sort sorts of other things without disturbing validity too -->
  <?xml:namespace as="naughty-watch" href="http://www.concerned-parents.org/"?>
  <?naughty-watch:sanitise cusswords="strip" ?>
<?xml:end-group ?>

<?xml:group?>
  <!ATTLIST myschema:person
    born DATA date::is8601:simple #IMPLIED >
  <?xml:seealso href="myschema-person-atts.html" ?>
<?xml:end-group ?>

...
<?XML encoding="8859-1"?>
<!-- This is the example document -->
<!DOCTYPE thing SYSTEM "SOMEWHERE" >
<thing>
...
<xml>
<?xml:entity % RJ SYSTEM "rixschema.xml" ?>
<RJ::myschema:person born="1960-05-18">Rick Jelliffe</RJ::myschema:person>
</xml>
...
</thing>

Appendixes

A. Does RDF Need Anything Extra?

The current drafts of RDF use a content model that cannot be expressed in standard XML declarations. There are two approaches to this:

The first is the one I have taken before, which is to say that the RDF information could be marked up in a completely standard way, and does not require abandoning content models. If it can, maybe it should. If RDF WG don't want to, that is fine, but it is a style call, not because XML is incapable of representing the information.

The second way, which may be more constructive, is to ask whether in fact RDF elements are actually processing instructions. If we regard RDF markup as being information for an RDF application and ipso facto independent of the formal structure of a document, we are left with a much cleaner interaction of RDF with the logical structure of the document. RDF also drops out of consideration as a test case for XML-data and XML cut'n'paste: it does not require any changes whatsoever to XML 1.0.

A processing instruction (PI) has the following characteristics:

These seem pretty close to what RDF requires: a clean superimposed layer.

A further side benefit is that it allows RDF serializations to be asynchronous with the element structure: to overlap each other and to end in a branch different to (but ultimately contiguous with the one in which it started. This may be useful, it may not.

In practical terms, RDF's last example becomes this:

<?namespace href="http://www.nist.gov/RDFschema" as="NIST"?>
<?namespace href="http://www.w3.org/schemas/rdf-schema" as="RDF"?>
<?RDF:serialization?>
  <?RDF:assertions href="John_Smith"?>
       <NIST:weight>
           <?RDF:resource id="weight_001"?>
               <NIST:units href="#pounds"/>
               <?RDF:PropValue?>200<?RDF:end-PropValue?>
           <?RDF:end-resource?>
       </NIST:weight>
   <?RDF:end-assertions?>
<?RDF:end-serialization?>

Where the putative content models

   <!ELEMENT NIST:weight ( NIST:UNITS | #PCDATA )* >
   <!ELEMENT NIST:units EMPTY >

are still satisfied, and everyone is clean and happy. I think it might be more salable to vendors and users too, in that nothing extra is required for it other than vanilla XML 1.0.

In my book (The SGML Cookbook: Document Patterns for SGML and XML, due out in a month) I comment that people are looking too much at "Is this syntax or semantics?" or "Is this data or metadata?" rather than "Is this a pattern in elements, processing instructions or entities?" I think this might be such a case.

B. XML-data Simulated (Rough: incomplete)

The goals of XML-data seem to be to be able to support syntactical schema (i.e. markup languages) and semantic schema (i.e, infromation modeling). In the case of semantic schema, there are many existing languages and approaches. Why not just use one of those? Or at least, why not take entity-relationship and UML and all the database methodologies and use them to generate user requirements. It seems a little cart-before-the-horse to make a data modeling language without a data modeling theory. Perhaps this is RDF's goal. Maybe I am missing the point. Perhaps they already have done this as part of their development.

To do:

But the smarter things that SGML has provided already: link, conref, the "&" connector, even strong typing and lexical typing are obviously not such a major user requirement that people are lining up in the street crying out for them.

Indeed, HTML has none of these things and is incredibly popular. Some say that useful validation is often either simple types (date etc) or very complicated rules which are better expressed in programs. So the cut'n'paste system provides access to validators: giving links rather than declarations.

Since PIs can be upgraded to exhibit nesting structure, and to perform inline declarations as well as header declarations, and since it is merely a simple matter to say, for example, that all PIs inside an xml:group are to be parsed as elements, the debate over whether to use elements or PIs/declarations comes down to a debate about what a PI is and what an element is. In my view, PIs represent the great escape from the tethers of validity: they allow arbitrary extra information to be added about validity and modeling and yet do not upset the requirements for strict content models.

B. Acknowledgements