[This local archive copy mirrored from the canonical site: http://www.chilli.net.au/~ricko/XML-bind.htm; links may not have complete integrity, so use the canonical document at this URL if possible.]
This document is a NOTE for discussion by the W3C XML-related groups. It is primarily an alternative to the namespace proposal and a contribution to defining XLink requirements. RDF and XML-data designers may also find it relevant.
Current linking systems are based on links from elements (i.e., instances of element types in a document). This paper holds that several of the technologies under development by the W3C working groups are better characterized as links from types and names. Suggestions are made for a general architecture to handle this, and for how this can be integrated into XLink. Namespaces, parts of RDF, and SGML Open Catalogs are re-characterized as XLinks from types and names.
At the current time, the following technologies are being worked on by various working groups:
Furthermore, the following technologies are widely deployed in SGML publishing:
Various other deficiencies in XML have been noted:
All these proposals have in some part the need to associate one name or identifier with another. (From here on, I will use "identifier" to mean an SGML system or public identifier, e.g., a URI, and "name" to mean an internal XML name, e.g., the Generic Identifier of an element type or an ID in an ID or IDREF attribute.)
XLink is part of the XML effort, and gives standard atributes and general semantics for link elements. Associating one name or identifier with another can be viewed as a kind of linking: in particular it is a linking by name, and in particular linking by type name (from here on, I will assume that an RDF bag is a synthetic type, included in the general term "type".)
XLink does not currently allow "type linking", though the XML-data proposal provide easier anchors for linking to, using just element links. This paper suggests that if type linking is added to XLink, then the various technologies (namespace, parts of RDF, etc) can be fruitfully re-characterized as XLinks. This unifies the syntax, and also clarifies the essential differences and overlaps of these technologies.
All these technologies ask the fundamental question "Which name is meant?", which we will separate from the question "How should this be expressed?".
As an aside, this paper limits itself to linking from element type names and from IDREF attributes, however these are merely particular cases of linking by patterns in attribute name/value pairs. I have not explored this more general mechanism here, since I am concentrating on linking from element types (either direct types using the GO, or from synthetic types constructed from bagging together all IDREF attributes with the same value.)
First, let us set a general architecture. This architecture is in place primarily to allow us to characterize various proposals, but also as a serious suggestion. Any real documents may have richer or simpler variants on the basic model.
name -1-> local catalog -2-> remote catalog(s) |
In this model, we have a name (a GI, an ID or maybe attibute names or enuerated values) which is associated by some declaration or convention (1) to a "local catalog" (an XLink within the document itself) which contains Xlinks to (2) one or more "remote catalogs", at least one of which is the "owner's catalog". The remote catalogs are themselves Xlinks.
Various forms of minimization and defaulting are possible, but not an essential part of this model. For this paper, I use element syntax for all parts of this model, but I also use PIs to provide shorthand "macros" which maybe considered to expand into the equivalent elements. This is the primary kind of minimization.
Starting backward, the Remote Catalog is merely an XLink, which is presumably an Extended Xlink. The most important Remote Catalog is the Owner's Catalog. This is kept at the site nominated by the nominal owner of the name (in a creative sense, not strictly a legal sense) and allows the owner to register various links of interest, which then become available to all conforming documents. The deployed documents do not need revision to make use of the remote catalogs as they are updated. A remote catalog is known by a URN, to give maximum location-independence.
I recommend that W3C give create standard type names for all technologies it introduces, to allow standardized software. In particular, the following types should be made available:
This mechanism can be extended to allow signature validation and many other uses. It allows the developer of a vocabulary to add resources appropriate to the vocabulary at will, and for these resources to be immediately available to users of the documents.
The Remote Catalog is a resource, and it could also be formally declared with an entity declaration. Like other questions of whether to use entity declarations or HREFs, it is a matter of appropriateness. In this paper, I use just direct HREFs, without bias against entity declarations.
<!-- A remote catalog. An extended link pointing at all sorts |
The Local Catalog is an XLink, which is just a simple link to the owner's catalog in the default case. However, it can be made into an extended link, and point ot other remote catalogs. For example, if the owner is CML, and Microsoft wished to also support CML in its browsers, it could also have a remote catalog for this, and put link in to this remote catalog in the local catalog.
The Local Catalog allows users of the document to add links to various other stakeholders and interested parties. In particular, it stops a technology-enforced monopoly of defining what a type is (in as much as a type may be considered in part to be defined by the operations which can be performed on it.)
The Local Catalog is a resource, and it could also be formally declared with an entity declaration. Like other questions of whether to use entity declarations or HREFs, it is a matter of appropriateness. In this paper, I use just direct HREFs, without bias against entity declarations.
<!-- A local catalog. Here, a simple link pointing to a |
Here is an example of a local catalog, given just as attributes on an element. (The "subtree interpretation" does not mean that there is a change in notation in the contained elements, nor that there is any form of name minimization intended.)
<!-- A local catalog. Here, a simple link pointing to a |
The act of associating a name with a local catalog I call "binding". The act of binding in particular allows a name to be formally asssociated with its namespace. However, name clashes are not prevented by binding, but by unique names in the first place. This is the only mechanism which allows simple editing of documents, where one element can be cut from one document and pasted into another without requiring renaming (of GIs and IDs).
Using this architecture, we can characterize the namespace proposal as being the binding of a name to a simple Xlink, where the nature of the remote resource is user-defined, and where a PI syntax is used.
As an aside, I note the near failure and compatibility problems of PIs in SGML which I attribute to the underdefinition of their syntax -- if the resource to which the namespace links is undefined, as has been raised in discussions, then it is probable that market forces will decide the use: in the current climate, and with US anti-trust laws, this may even end up in a steady state with one dominant technology and one subsiduary one, as is so typically the case with US-based technology.
Using this architecture, SGML Open Catalogs can be seen as local catalog, primarily to link a public identifier to a system identifier. A simple syntax is used by SO catalogs.
Using this architecture, the initial xml-bind mechanism I proposed can be seen as being the binding of all names to extended Xlinks, where the nature of the remote resource is fixed by convention (i.e., no remote catalog is used) and the binding mechanism uses a text-substitution mechanism ($1 for the owner part of the name before ":" and $2 for the other part).
RDF is more tricky in where it fits in. RDF seems more aimed at linking individual elements rather than element types, but it could be extended to handle element types. The bag mechanism in fact creates a synthetic type. The architecture above can apply to the names of synthetic types (e.g., BAG_IDs). And since RDF uses namespace, the architecture is a tool underlying RDF's XML implementation anyway. (More to be done on this.) RDF is further more complicated in that it allows un-named types (i.e., by marking up the element with surrounding tags, rather than using attributes or GIs of the element: RDF shows in this that it does not expect documents to have a regular structure, but to be just elements thrown together choatically, to some extent.)
Using this architecture, XML-data can be more readily used (as indeed can any other architectue schema) since xml-bind provides a name remapping system. XML-data does not provide any mechanism of going from a name to a declaration except directly.
Using this architecture, the DOCTYPE declaration can be seen as in elision of the local and remote catalog, with just a declaration for binding all names to a resource.
The new xml:link type of "xml-bind" should be introduced. It signifies that the link is a binding.
The new xml:link attribute "as" should be introduced. This attribute allows "*" for any string. In the absense of any ":", the string is taken to be the owner. If the string is empty, or starts with a ":", it applies to all un-prefixed elements. In an HREF, $1 and $2 can be used to substitute for the left and right sides of the ":" in a name.
<x xml:link="simple" type="bind" as="rj.com" href="..." .../> |
equivalent to
<?xml-bind xml:link="simple" type="bind" as="rj.com:*" href="..." ...?> |
and
<?xml-bind xml:link="simple" type="bind" as="rj.com:$2" href="..." ...?> |
These allow bindings to individual elements, to groups of elements with the same owner, and to other names with different conventions, or to restricted matches of names.
Note that a DOCTYPE declaration
<!DOCTYPE #IMPLIED SYSTEM "xxx.dtd"< |
could have the equivalent form
<?xml-bind xml:link="simple" type="bind" as="" href="xxx.dtd" role="xml:dtd"?> |
The doctype declaration is both a kind of entity declaration and a binding.
This paper uses the same colon-delimited prefix which the namespace proposal has accepted. The prefix can be called the "owner prefix", where owner does not imply any property rights.
The simplest mechanism for keeping unique names is to prefix them. The only way to have prefixes which can withstand simple editing is to have a discplined ownership regime. This is standard practise in ISO 9070 public identifiers, SGML public identifiers, URNs, MIME media types, Internet addresses, and even in Java package names.
I define simple editing as cutting from one document and pasting to another using a simple text editor by a naive user who does not understand the namespace implications and therefore will not also cut and paste any declarations or PIs.
The namespace proposal should be robust enough that simple editing will not result in name or ID clashes. In the current namespace proposal, the binding of the name to the URL takes place at the point of the namespace declaration. The name prefixes themselves have no mechanism to prevent nameclashes. This is not robust against simple editing.
The proposal in this paper is that the name prefix must be an owner prefix, and that the appropriate owner prefix is to adopt the owner mechanism of URIs. The prefix should therefore be an domain name, with subdirectories. This mechanism is robust against simple editing.
The xml-bind proposal therefore requires the addition of "/" as a NAME character. However, the name must not start with a "/" (nor end with one), since "/" is also used in delimiters. (Note that due to the SGML longest-delimiter-matches rule, an end-tag will never be interpreted as a start-tag with a GI starting with "/" in any case. However, it could be confusing to implementors and users.)
Various forms of minimization are worth considering. Those I propose here are suggestions, and are not essential to the xml-bind proposal.
I do not recommend any form of minimization which uses partial names, nested inside declarations. This form of minimization is not robust against simple editing, and though appealing in the abstract, is frought with danger, since it give documents which are nominally well-formed, but actually badly marked up.
PIs can be used like macros in text processing languages. This is a subsequent stage to the formal parsing of the XML, of course. A PI-macro post-processor can expand PI-macros into elements.
In the particular case of xml-bin, it means that syntaxes such as the namespace PI can be adopted, but with the semantic that they expand by a (nominal) post-process into element nodes which are type-links in the parse tree (GROVE).
In this way, the xml-bind proposal is not in conflict with the namespace proposal, but merely regards it as a minimized syntax which may be useful.
PI syntax has the great advantage that PIs can be added to a document without making the document invalid against its XML markup declarations. However, elements may be favoured by some, especially those from the HTML world who may not understand or accept the rationale for PIs. Element nodes may be easier to navigate and process using Xptrs: this remains to be seen. PIs also have the disadvantage of being point tags -- ranges have to delimited using less visually pleasing conventions than elements, which turns many people off.
As an aside, I note that RDF may gain the same advantage by using PI-macros instead of elements: a document may be decorated with assertions about elements without foregoing element validity, but the PIs can be post-processed into elements for conventient processing.
The major method of minimization proposed is the use of defaulting. The simple convention of cgi-bin shows that a predefined name in which executables can be put is useful: the name cgi-bin can be remapped at the server. So there is not a compelling practical reason why partial URIs cannot be predefined by W3C, and mapped at the server to appropriate locations. Remapping is not bad practise, it is common practise and useful.
This paper proposes that if there is no local catalog, an xml-bind propocessor defaults to constructing a local catalog (i.e., a single link) using to the href urn:$1/xml-bind/$2. Most importantly, this mechanism is robust in the face of simple editing.
In practise, it means that I can create a vocabulary and make documents with it (using the owner prefixing as above), and that I do not need any explicit local catalog. I can put all the resources relevant to the name on a remote catalog on my server. Any user can cut and paste the elements: the owner prefixes assures uniquesness and portability, and the defaulting rules look after binding the name (i.e., by synthesizing a local catalog from the type name).
With this defaulting, a local catalog only needs to be constructed if the local user wishes to annotate the type with their own resources, or if they wish to override the default mechanism.
SGML allows the end tag to be minimized to ">", meaning the most recently opened element closes. This is not valid or well-formed in XML currently. And Gavin Nicol has noted in the XML SIG that compression algorithms used by modems and other intermediate data transmission devices may in fact already compress the data more than this mechanism provides.
Another paper of interest may be A Cut and Paste Infrastructure for XML.