Standard Deviations from Norm

If You Can Name It, You Can Claim It!

04 Apr 2000

Issue 3

Table of Contents

Curses, Foiled Again!
Names Are Not Addresses

Public Identifiers
Uniform Resource Names

Resolving Names
Catalog Files
Understanding Catalog Files
Adding Catalog Support to Your Applications
Supporting XML Catalogs
Catalogs In Action

catalog
eresolve
Catalogs in XT

May All Your Names Resolve Successfully!

System identifiers suck! The fact that XML requires me to supply system identifiers for external references, and the fact that these identifiers are required to be Uniform Resource Identifiers (URIs) is a frequent source of considerable irritation. In this column, we'll explore how you can use OASIS Catalog files (or their XML equivalent) to avoid these difficulties.

Using Catalog files became a lot easier earlier this month when Arbortext released its Java Catalog classes to the XML community. Using these classes, it's simple to add Catalog support to your favorite Java parser. (Equivalent support for parsers in other languages should be fairly easy to construct from the free and Open Source of the Java classes, although Arbortext has no immediate plans to undertake this effort.)

You can download the classes or view the JavaDoc API Documentation online. You can also read Arbortext's press release about the code.

But first, let's consider the scope of the problem.

Curses, Foiled Again!

There are several common ways that the system identifier problem raises its ugly head:

I have an XML document that I want to publish on the web or include in the distribution of some piece of software. On my system, I keep the doctype of the document in some local directory, so my doctype declaration reads:
```
<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.0//EN"
                  "file:///n:/share/doctypes/docbook/xml/docbookx.dtd">
```
As soon as I distribute this document, I immediately begin getting error reports from customers who can't read the document because they don't have DocBook installed at the location identified by the URI in my document. Drat!
Or I remember to change the URI before I publish the document:
```
<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.0//EN"
                  "http://www.oasis-open.org/docbook/xml/4.0/docbookx.dtd">
```
And the next time I try to edit the document, I get errors because I happen to be working on my laptop on a plane somewhere and can't get to the net. Blast!
Just as often, I get tripped up this way: I'm working collaboratively with a colleague. She's created initial drafts of some documents that I'm supposed to review and edit. So I grab them and find that I can't open or publish them because I don't have the same network connections she has or I don't have Epic installed in the same place. And if I change the system identifiers so they work on my system, she has the same problems when I send them back to her. Drat and blast!

All of this makes me want to pull my hair out because there's a perfectly good solution for this problem: public identifiers. They're defined in XML, they just aren't used very effectively because currently users cannot rely on applications resolving them in an interoperable manner.

Public identifiers provide global, unique names for entities independent of their storage location.

Names Are Not Addresses

Despite opinions to the contrary^[1], I maintain that names and addresses are distinct. If I claim that I want the version 3.1 of the DocBook DTD, or the 1911 edition of Webster's dictionary, or Issue 2 of Standard Deviations from Norm, that's what I want, irrespective of its location on the net (or even if it's available on the net). While it is possible to view a URL as an address, I don't think that's the natural interpretation.

There are currently two ways that I might reasonably assign an address-independent name to an object: public identifiers or Uniform Resource Names (URNs)^[2].

Uniform What?

There are three sorts of universal identifiers that are commonly discussed: URLs, URIs, and URNs. Here's the lowdown on what these acronyms really mean:

URL

URL is short for Uniform Resource Locator and is defined by RFC 1738. A URL is a resource on the web: a particular object, on a particular server, accessed with a particular protocol. For example, the URL for Standard Deviations from Norm is http://www.arbortext.com/Think_Tank/Norm_s_Column/norm_s_column.html. That means the document (or more generally resource) "/Think_Tank/Norm_s_Column/norm_s_column.html" on the server www.arbortext.com accessed via the HTTP protocol. A URL is an address.

URN

URN is short for Uniform Resource Name and is defined by RFC 2141. A URN is a globally unique name for a resource; it consists of the prefix "urn:", followed by a Namespace Identifier (NID) and a Namespace Specific String (NSS). Official namespace identifiers are assigned by the Internet Assigned Numbers Authority and each NID owner is responsible for subdividing the namespace using NSS. Experimental NIDs begin with "x-" and there is no guarantee of uniqueness among experimental URNs.

The experimental URN urn:x-oasis:docbook-xml-v4.0 consists of the NID "x-oasis" and the NSS "docbook-xml-v4.0." A URN is a name.

URI

URI is short for Uniform Resource Identifier and is defined by RFC 2396. The term URI refers to the general syntax of the sorts of strings that are used to create URLs, URNs, etc. It describes the compact syntax used for URLs, URNs, and other uniform names.

URIs blur the distinction between names and addresses in an unfortunate way. We now have specifications that use URIs that appear syntactically to be URLs for the purpose that URNs would have been best suited.

Public Identifiers

Public identifiers are part of XML 1.0. They can occur in any form of external entity declaration. They allow you to give a globally unique name to any entity. For example, the XML version of DocBook V4.0 is identified with the following public identifier:

-//OASIS//DTD DocBook XML V4.0//EN

You'll see this identifier in the two doctype declarations I used earlier. This identifier gives no indication of where the resource (the DTD) may be found, but it does uniquely name the resource. That public identifier, now and forever refers to the XML version of DocBook V4.0.

Uniform Resource Names

URNs are a form of URI. Like public identifiers, they give a location-neutral, globally unique name to an entity. For example, OASIS might choose to identify the XML version of DocBook V4.0 with the following URN:^[3]

urn:x-oasis:docbook-xml-v4.0

Like a public identifier, a URN can now and forever refer to a specific entity in a location-independent manner.

Resolving Names

Having extolled the virtues of location-independent names, it must be said that a name isn't very useful if you can't find the thing it refers to. In order to do that, you must have a name resolution mechanism that allows you to determine what resource is referred to by a given name.

One important feature of this mechanism is that it can allow resources to be distributed, so you don't have to go to http://www.oasis-open.org/docbook/xml/4.0/docbookx.dtd to get the XML version of DocBook V4.0, if you have a local copy.

There are a few possible resolution mechanisms:

The application just "knows". Sure, it sounds a little silly, but this is currently the mechanism being used for namespaces. Applications know what the semantics of namespaced elements are because they recognize the namespace URI.
OASIS Catalog files provide a mechanism for mapping public and system identifiers, allowing resolution to both local and distributed resources. This is the resolution scheme we're going to consider for the balance of this column.
Many other mechanisms are possible. There are already a few for URNs, including at least one built on top of DNS, but they aren't widely deployed.

Catalog Files

Catalog files are straightforward text files that describe a mapping from names to addresses. Here's a simple one:

PUBLIC "-//OASIS//DTD XML DocBook V4.0//EN"
       "docbook/xml/docbookx.dtd"
SYSTEM "urn:x-oasis:docbook-xml-v4.0"
       "docbook/xml/docbookx.dtd"
DELEGATE "-//Arbortext//" "file:///c:/epic/doctypes/catalog"

This file maps both the public identifier and the URN I mentioned earlier to a local copy of DocBook on my system. If the doctype declaration uses the public identifier for DocBook, I'll get DocBook regardless of the (possibly bogus) system identifier! Likewise, my local copy of DocBook will be used if the system identifier contains the DocBook URN.

The DELEGATE entry instructs the resolver to use the catalog "c:\epic\doctypes\catalog" for any public identifier that begins with "-//Arbortext//". The advantage of DELEGATE in this case is that I don't have to parse that catalog file unless I encounter a public identifier that I reasonably expect to be in there.

Understanding Catalog Files

Catalog files are officially defined by OASIS Technical Resolution TR9401, but for our purposes, the following informal description will suffice^[4].

A Catalog is a text file that contains a sequence of entries. Of the 13 types of entries that are possible, we'll consider only the following six in this article: BASE, CATALOG, OVERRIDE, DELEGATE, PUBLIC, and SYSTEM:

BASE uri

Catalog entries can contain relative URIs. The BASE entry changes the base URI for subsequent relative URIs. The initial base URI is the URI of the catalog file.

CATALOG catalogURI

Adds the catalog file specified by the catalogURI to the end of the current catalog. This allows one catalog to refer to another.

OVERRIDE YES|NO

The OVERRIDE setting determines whether or not system identifiers specified in the catalog are to be used in favor of system identifiers supplied in the document. Suppose you have an entity in your document for which both a public identifier and a system identifier has been specified, and the catalog only contains a mapping for the public identifier (e.g., a matching PUBLIC catalog entry). If OVERRIDE is YES, the system identifier supplied in the matching PUBLIC catalog entry will be used. If it is NO, the system identifier in the document will be used. (If the catalog contained a matching SYSTEM catalog entry giving a mapping for the system identifier, that mapping would have been used, the public identifier would never have been considered, and the setting of OVERRIDE would have been irrelevant.)

Generally, the purpose of catalogs is to override the system identifiers in XML documents, so override should be enabled in your catalogs.

DELEGATE partialPublicId catalogURI

The DELEGATE entry specifies that public identifiers that begin with partialPublicId should be resolved using the catalog specified by the catalogURI. If multiple DELEGATE entries match the public identifier, they will each be searched, starting with the longest partialPublicId and continuing to the shortest.

The DELEGATE entry differs from the CATALOG entry in the following way: alternate catalogs referenced with a CATALOG entry are parsed and included in the current catalog. Delegated catalogs are only considered, and consequently only loaded and parsed, if necessary. Delegated catalogs are also used instead of the current catalog, not as part of the current catalog.

PUBLIC publicId systemId

Maps the public identifier publicId to the system identifier systemId.

SYSTEM systemId otherSystemId

Maps the system identifier systemId to the alternate system identifier otherSystemId.

Catalog resolution occurs in the following order:

If a SYSTEM entry matches the specified system identifier, it is used.
If a PUBLIC entry matches the specified public identifier and either OVERRIDE is YES or no system identifier is provided, it is used.
If no exact match was found for the public identifier, but it matches one or more of the partial public identifiers specified in DELEGATE entries, the delegated catalogs are searched for a matching public identifier. (Note that the system identifier is never provided to the delegated catalogs, so a SYSTEM entry in a delegated catalog that would have matched the system identifier of the entity in question is never considered.)
If there's still no match, ENTITY, DOCTYPE, and NOTATION entries are considered. (These entries aren't discussed in this article, but are fully described in the technical resolution.)

Adding Catalog Support to Your Applications

If you work with Java applications using a parser that supports the SAX Parser interface, adding Catalog support to your applications is a snap. The SAX Parser interface includes an entityResolver hook designed to provide an application with an opportunity to do this sort of indirection. The com.arbortext.catalog package implements the full OASIS Catalog semantics and provides an appropriate class that implements the SAX entityResolver interface.

All you have to do is setup a com.arbortext.catalog.CatalogEntityResolver on your parser's entityResolver hook. The code listing in Example 1. demonstrates how straightforward this is:

Example 1. Adding a CatalogEntityResolver to Your Parser

import com.arbortext.catalog.*;

...
    CatalogEntityResolver cer = new CatalogEntityResolver();
    Catalog myCatalog = new Catalog();
    myCatalog.loadSystemCatalogs();
    cer.setCatalog(myCatalog);
...
    yourParser.setEntityResolver(cer)

The system catalogs are loaded from the system catalog path, stored in the System property xml.catalog.files. (For all the gory details about these classes, consult the API documentation.) You can explicitly parse your own catalogs (perhaps taken from command line arguments or a Preferences dialog) instead of or in addition to the system catalogs:

myCatalog.parseCatalog(catalogFile);

Supporting XML Catalogs

The Catalog class can also load XML Catalogs. At present, the only XML Catalog format recognized is John Cowan's XML Catalog format (formerly XCatalogs). XML Catalogs are indistinguishable from OASIS Catalogs to your application, all you have to do to enable XML Catalog processing is supply the name of a class that implements the SAX Parser interface. In Example 2., the Apache XML Project's Xerces parser is used.

Example 2. Adding Support for XML Catalogs

import com.arbortext.xml.*;

...
   CatalogEntityResolver cer = new CatalogEntityResolver();
   Catalog myCatalog = new Catalog();
   myCatalog.setParserClass("com.ibm.xml.parsers.SAXParser"); // support XML Catalogs
   myCatalog.loadSystemCatalogs();
   cer.setCatalog(myCatalog);
...
   yourParser.setEntityResolver(cer)

Catalogs In Action

The Arbortext Catalogs distribution includes two test programs that you can use to see how this all works. In order to use these programs, you must have the catalog.jar and catalog-apps.jar files on your CLASSPATH. The eresolve program also requires a recent version of Xerces on your CLASSPATH.

The README file in the catalog distribution describes each of the demonstration programs in more detail.

catalog

The catalog program takes several catalogs and a request and displays the system identifier returned by the Catalog.

You can see this program in action in Example 3..

Example 3. Using the catalog Command

>java catalog -d 0 -c /share/doctypes/catalog PUBLIC "-//OASIS//DTD DocBook XML V4.0//EN"
Ignoring system catalogs.
Set debug to: 0
Adding catalog: /share/doctypes/catalog
Resolving PUBLIC:
        Public: -//OASIS//DTD DocBook XML V4.0//EN
        System: null

Resolved: file:/share/doctypes/docbook/xml/docbookx.dtd

eresolve

The second program, eresolve, uses the CatalogEntityResolver class. A complete test environment is provided in the test directory:

catalog

This is a Catalog with a few simple entries:

OVERRIDE YES
PUBLIC "-//Arbortext//TEXT Test Public Identifier//EN" "testpub.xml"
SYSTEM "urn:x-arbortext:test-system-identifier" "testsys.xml"

OVERRIDE NO
PUBLIC "-//Arbortext//TEXT Test Override//EN" "override.xml"

test.xml

This is a test document that contains several external entities:

<!DOCTYPE test [
<!ENTITY testpub PUBLIC "-//Arbortext//TEXT Test Public Identifier//EN"
                 "bogus-system-identifier.xml">
<!ENTITY testsys SYSTEM "urn:x-arbortext:test-system-identifier">
<!ENTITY testovr PUBLIC "-//Arbortext//TEXT Test Override//EN"
                 "testovr.xml">
]>
<test>
&testpub;
&testsys;
&testovr;
</test>

This XML document demonstrates several Catalog features:

If parsed without a catalog, the parse will fail since bogus-system-identifier.xml won't be found (and neither would the URN, unless you happen to have some other URN resolution mechanism running).

If parsed with the included catalog, the following substitutions will be made:

&testpub; will be replaced with the contents of testpub.xml, due to the mapping provided by the first PUBLIC entry in the catalog.
&testsys; will be replaced with the contents of testsys.xml, due to the mapping provided by the SYSTEM entry in the catalog.
&testovr; will be replaced with the contents of testovr.xml, due to the system identifier given in its entity declaration; the mapping provided by the second PUBLIC entry in the catalog is not used because the entity declaration did provide a system identifier and the matching public identifier occurs where OVERRIDE is NO.

You can see this process in action in Example 4..

Example 4. Using the eresolve Command

>java eresolve -d 2 -c test\catalog test\test.xml
Set debug to 2
Adding catalog: test\catalog
Loading catalog: test\catalog
Parsing test\test.xml
Resolved: -//Arbortext//TEXT Test Public Identifier//EN
        file:/N:/viewstores/nwalsh_saffron/Epic/src/xml/catalog/test/testpub.xml

Resolved: urn:x-arbortext:test-system-identifier
        file:/N:/viewstores/nwalsh_saffron/Epic/src/xml/catalog/test/testsys.xml

Done parsing test\test.xml

Catalogs in XT

This last example demonstrates Catalog resolution in a real application. The Catalog distribution includes a modified version of the primary driver from XT, com.arbortext.sax.xsl.Driver. It differs from the com.jclark.sax.xsl.Driver class only in the addition of Catalog support. You can use it to convert the document in the test directory to HTML, as shown in Example 4.. You must have the xt.jar and xp.jar files on your CLASSPATH in order to run this example.

Example 5. Using Catalogs in XT

>java -Dxml.catalog.files=test\catalog com.arbortext.xsl.sax.Driver
    test\test.xml test\style.xsl

Note that this example uses the system propert xml.catalog.files to set the catalog path because the Driver does not support a command-line option to specify catalog files.

May All Your Names Resolve Successfully!

We hope that these classes become a standard part of all the major XML Parsers. As XML processors incorporate this code, users will be able to utilize public identifiers in XML documents with the confidence that they will be able to move those documents from one system to another and around the Web knowing that they will also be able to refer to the appropriate external file or Web page.

Norman Walsh lives in beautiful, rural western Massachusetts where he hacks XML for fun and profit. He can name lots of things that he's unable to locate, his car keys, for example.

^[1]The Myth of Names and Addresses, Tim Berners-Lee, December 19, 1996.

^[2]URIs that rely on the domain name system to identify objects (in other words, all URLs) are addresses, not names, even though the domain name provides a level of indirection and the illusion of a stable name.

^[3]This URN uses an experimental namespace identifier (NID). In practice, if OASIS was going to assign URNs, it would go through the process described in RFC 2611, URN Namespace Definition Mechanisms to obtain an official NID.

^[4]There are a few subtleties in catalog processing, especially with respect to CATALOG and DELEGATE entries. These are implemented correctly in the Catalog classes, but the particulars won't be considered in detail here. If you're curious, all the gory details are in TR9401.