If You Can Name It, You Can Claim It!
Copyright © 2000 by Arbortext, Inc.
04 Apr 2000
Issue 3
System identifiers suck! The fact that XML requires me to supply system
identifiers for external references, and the fact that these identifiers are
required to be Uniform Resource
Identifiers (URIs) is a frequent source of considerable irritation.
In this column, we'll explore how you can use OASIS Catalog files (or their
XML equivalent) to avoid these difficulties.
Using Catalog files became a lot easier earlier this month when Arbortext
released its Java Catalog classes to the XML community.
Using these classes, it's simple to add Catalog support to your favorite Java
parser. (Equivalent support for parsers in other languages should be fairly
easy to construct from the free and Open Source of the Java classes, although
Arbortext has no immediate plans to undertake this effort.)
You can download
the classes or view the JavaDoc
API Documentation online. You can also read Arbortext's press release about the code.
But first, let's consider the scope of the problem.
There are several common ways that the system identifier problem raises
its ugly head:
-
I have an XML document that I want to publish on the web or
include in the distribution of some piece of software. On my system, I keep
the doctype of the document in some local directory, so my doctype declaration
reads:
<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.0//EN"
"file:///n:/share/doctypes/docbook/xml/docbookx.dtd">
As soon as I distribute this document, I immediately begin getting error
reports from customers who can't read the document because they don't have
DocBook installed at the location identified by the URI in my document. Drat!
-
Or I remember to change the URI before I publish the document:
<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.0//EN"
"http://www.oasis-open.org/docbook/xml/4.0/docbookx.dtd">
And the next time I try to edit the document, I get errors
because I happen to be working on my laptop on a plane somewhere and can't
get to the net. Blast!
-
Just as often, I get tripped up this way: I'm working collaboratively
with a colleague. She's created initial drafts of some documents that I'm
supposed to review and edit. So I grab them and find that I can't open or
publish them because I don't have the same network connections she has or
I don't have Epic installed in the same place. And if I change the system
identifiers so they work on my system, she has the same problems when I send
them back to her. Drat and blast!
All of this makes me want to pull my hair out because there's a perfectly
good solution for this problem: public identifiers. They're defined in XML,
they just aren't used very effectively because currently users cannot rely
on applications resolving them in an interoperable manner.
Public identifiers provide global, unique names for entities independent
of their storage location.
Despite opinions to the contrary[1], I maintain that names and addresses are distinct. If I claim
that I want the version 3.1 of the DocBook DTD, or the 1911 edition of Webster's
dictionary, or Issue 2 of Standard Deviations from Norm,
that's what I want, irrespective of its location on the net (or even if it's
available on the net). While it is possible to view a URL as an address, I
don't think that's the natural interpretation.
There are currently two ways that I might reasonably assign an address-independent
name to an object: public identifiers or Uniform
Resource Names (URNs)[2].
Public identifiers are part of XML
1.0. They can occur in any form of external entity declaration. They
allow you to give a globally unique name to any entity. For example, the XML
version of DocBook V4.0 is identified with the following public identifier:
-//OASIS//DTD DocBook XML V4.0//EN
You'll see this identifier in the two doctype declarations I used earlier.
This identifier gives no indication of where the resource (the DTD) may be
found, but it does uniquely name the resource. That public identifier, now
and forever refers to the XML version of DocBook V4.0.
URNs are a form of URI. Like public identifiers, they give a location-neutral,
globally unique name to an entity. For example, OASIS might choose to identify
the XML version of DocBook V4.0 with the following URN:[3]
urn:x-oasis:docbook-xml-v4.0
Like a public identifier, a URN can now and forever refer to a specific
entity in a location-independent manner.
Having extolled the virtues of location-independent names, it must be
said that a name isn't very useful if you can't find the thing it refers to.
In order to do that, you must have a name resolution mechanism that allows
you to determine what resource is referred to by a given name.
One important feature of this mechanism is that it can allow resources
to be distributed, so you don't have to go to http://www.oasis-open.org/docbook/xml/4.0/docbookx.dtd to get the XML version of DocBook V4.0, if you have a local copy.
There are a few possible resolution mechanisms:
-
The application just "knows". Sure, it sounds
a little silly, but this is currently the mechanism being used for namespaces.
Applications know what the semantics of namespaced elements are because they
recognize the namespace URI.
-
OASIS Catalog files provide a mechanism for mapping public
and system identifiers, allowing resolution to both local and distributed
resources. This is the resolution scheme we're going to consider for the balance
of this column.
-
Many other mechanisms are possible. There are already a few
for URNs, including at least one built on top of DNS, but they aren't widely
deployed.
Catalog files are straightforward text files that describe a mapping
from names to addresses. Here's a simple one:
PUBLIC "-//OASIS//DTD XML DocBook V4.0//EN"
"docbook/xml/docbookx.dtd"
SYSTEM "urn:x-oasis:docbook-xml-v4.0"
"docbook/xml/docbookx.dtd"
DELEGATE "-//Arbortext//" "file:///c:/epic/doctypes/catalog"
This file maps both the public identifier and the URN I mentioned earlier
to a local copy of DocBook on my system. If the doctype declaration uses the
public identifier for DocBook, I'll get DocBook regardless
of the (possibly bogus) system identifier! Likewise, my local copy of DocBook
will be used if the system identifier contains the DocBook URN.
The DELEGATE entry instructs the resolver to use the catalog "c:\epic\doctypes\catalog"
for any public identifier that begins with "-//Arbortext//".
The advantage of DELEGATE in this case is that I don't have to parse that
catalog file unless I encounter a public identifier that I reasonably expect
to be in there.
Catalog files are officially defined by OASIS
Technical Resolution TR9401, but for our purposes, the following informal
description will suffice[4].
A Catalog is a text file that contains a sequence of entries. Of the
13 types of entries that are possible, we'll consider only the following six
in this article: BASE, CATALOG, OVERRIDE, DELEGATE, PUBLIC, and SYSTEM:
-
BASE uri
-
Catalog entries can contain relative URIs. The BASE entry changes the
base URI for subsequent relative URIs. The initial base URI is the URI of
the catalog file.
-
CATALOG catalogURI
-
Adds the catalog file specified by the catalogURI
to the end of the current catalog. This allows one catalog to refer to another.
-
OVERRIDE YES|NO
-
The OVERRIDE setting determines whether or not system identifiers specified
in the catalog are to be used in favor of system identifiers supplied in the
document. Suppose you have an entity in your document for which both a public
identifier and a system identifier has been specified, and the catalog only
contains a mapping for the public identifier (e.g., a matching PUBLIC catalog
entry). If OVERRIDE is YES, the system identifier supplied in the matching
PUBLIC catalog entry will be used. If it is NO, the system identifier in the
document will be used. (If the catalog contained a matching SYSTEM catalog
entry giving a mapping for the system identifier, that mapping would have
been used, the public identifier would never have been considered, and the
setting of OVERRIDE would have been irrelevant.)
Generally, the purpose of catalogs is to override the system identifiers
in XML documents, so override should be enabled in your catalogs.
-
DELEGATE partialPublicId catalogURI
-
The DELEGATE entry specifies that public identifiers that begin with partialPublicId
should be resolved using the catalog specified by the catalogURI.
If multiple DELEGATE entries match the public identifier, they will each be
searched, starting with the longest partialPublicId
and continuing to the shortest.
The DELEGATE entry differs from the CATALOG entry in the following way:
alternate catalogs referenced with a CATALOG entry are parsed and included
in the current catalog. Delegated catalogs are only considered, and consequently
only loaded and parsed, if necessary. Delegated catalogs are also used instead
of the current catalog, not as part of the current catalog.
-
PUBLIC publicId systemId
-
Maps the public identifier publicId to the
system identifier systemId.
-
SYSTEM systemId otherSystemId
-
Maps the system identifier systemId to the
alternate system identifier otherSystemId.
Catalog resolution occurs in the following order:
-
If a SYSTEM entry matches the specified system identifier,
it is used.
-
If a PUBLIC entry matches the specified public identifier
and either OVERRIDE is YES or no system identifier is provided, it is used.
-
If no exact match was found for the public identifier, but
it matches one or more of the partial public identifiers specified in DELEGATE
entries, the delegated catalogs are searched for a matching public identifier.
(Note that the system identifier is never provided to the delegated catalogs,
so a SYSTEM entry in a delegated catalog that would have matched the system
identifier of the entity in question is never considered.)
-
If there's still no match, ENTITY, DOCTYPE, and NOTATION entries
are considered. (These entries aren't discussed in this article, but are fully
described in the technical
resolution.)
If you work with Java applications using a parser that supports the
SAX Parser interface, adding Catalog support to your applications
is a snap. The SAX Parser interface includes an entityResolver
hook designed to provide an application with an opportunity to do this sort
of indirection. The com.arbortext.catalog package implements
the full OASIS Catalog semantics and provides an appropriate class that implements
the SAX entityResolver interface.
All you have to do is setup a com.arbortext.catalog.CatalogEntityResolver
on your parser's entityResolver hook. The code listing
in Example 1. demonstrates how straightforward this is:
Example 1. Adding a CatalogEntityResolver to Your Parser
import com.arbortext.catalog.*;
...
CatalogEntityResolver cer = new CatalogEntityResolver();
Catalog myCatalog = new Catalog();
myCatalog.loadSystemCatalogs();
cer.setCatalog(myCatalog);
...
yourParser.setEntityResolver(cer)
The system catalogs are loaded from the system catalog path, stored
in the System property xml.catalog.files. (For all the
gory details about these classes, consult the
API documentation.) You can explicitly parse your own catalogs (perhaps
taken from command line arguments or a Preferences dialog) instead of or in
addition to the system catalogs:
myCatalog.parseCatalog(catalogFile);
The Catalog class can also load XML Catalogs. At present, the only XML
Catalog format recognized is John Cowan's XML
Catalog format (formerly XCatalogs). XML Catalogs are indistinguishable
from OASIS Catalogs to your application, all you have to do to enable XML
Catalog processing is supply the name of a class that implements the SAX Parser
interface. In Example 2., the Apache XML Project's Xerces
parser is used.
Example 2. Adding Support for XML Catalogs
import com.arbortext.xml.*;
...
CatalogEntityResolver cer = new CatalogEntityResolver();
Catalog myCatalog = new Catalog();
myCatalog.setParserClass("com.ibm.xml.parsers.SAXParser"); // support XML Catalogs
myCatalog.loadSystemCatalogs();
cer.setCatalog(myCatalog);
...
yourParser.setEntityResolver(cer)
The Arbortext Catalogs distribution includes two test programs that
you can use to see how this all works. In order to use these programs, you
must have the catalog.jar and catalog-apps.jar
files on your CLASSPATH. The eresolve
program also requires a recent version of Xerces
on your CLASSPATH.
The README file in the catalog distribution describes
each of the demonstration programs in more detail.
The catalog program takes several catalogs and a
request and displays the system identifier returned by the Catalog.
You can see this program in action in Example 3..
Example 3. Using the catalog Command
>java catalog -d 0 -c /share/doctypes/catalog PUBLIC "-//OASIS//DTD DocBook XML V4.0//EN"
Ignoring system catalogs.
Set debug to: 0
Adding catalog: /share/doctypes/catalog
Resolving PUBLIC:
Public: -//OASIS//DTD DocBook XML V4.0//EN
System: null
Resolved: file:/share/doctypes/docbook/xml/docbookx.dtd
The second program, eresolve, uses the CatalogEntityResolver
class. A complete test environment is provided in the test
directory:
-
catalog
-
This is a Catalog with a few simple entries:
OVERRIDE YES
PUBLIC "-//Arbortext//TEXT Test Public Identifier//EN" "testpub.xml"
SYSTEM "urn:x-arbortext:test-system-identifier" "testsys.xml"
OVERRIDE NO
PUBLIC "-//Arbortext//TEXT Test Override//EN" "override.xml"
-
test.xml
-
This is a test document that contains several external entities:
<!DOCTYPE test [
<!ENTITY testpub PUBLIC "-//Arbortext//TEXT Test Public Identifier//EN"
"bogus-system-identifier.xml">
<!ENTITY testsys SYSTEM "urn:x-arbortext:test-system-identifier">
<!ENTITY testovr PUBLIC "-//Arbortext//TEXT Test Override//EN"
"testovr.xml">
]>
<test>
&testpub;
&testsys;
&testovr;
</test>
This XML document demonstrates several Catalog features:
If parsed without a catalog, the parse will fail since bogus-system-identifier.xml
won't be found (and neither would the URN, unless you happen to have some
other URN resolution mechanism running).
If parsed with the included catalog, the following substitutions will
be made:
-
&testpub; will be replaced with the
contents of testpub.xml, due to the mapping provided
by the first PUBLIC entry in the catalog.
-
&testsys; will be replaced with the
contents of testsys.xml, due to the mapping provided
by the SYSTEM entry in the catalog.
-
&testovr; will be replaced with the
contents of testovr.xml, due to the system identifier
given in its entity declaration; the mapping provided by the second PUBLIC
entry in the catalog is not used because the entity declaration did provide
a system identifier and the matching public identifier occurs where OVERRIDE
is NO.
You can see this process in action in Example 4..
Example 4. Using the eresolve Command
>java eresolve -d 2 -c test\catalog test\test.xml
Set debug to 2
Adding catalog: test\catalog
Loading catalog: test\catalog
Parsing test\test.xml
Resolved: -//Arbortext//TEXT Test Public Identifier//EN
file:/N:/viewstores/nwalsh_saffron/Epic/src/xml/catalog/test/testpub.xml
Resolved: urn:x-arbortext:test-system-identifier
file:/N:/viewstores/nwalsh_saffron/Epic/src/xml/catalog/test/testsys.xml
Done parsing test\test.xml
This last example demonstrates Catalog resolution in a real application.
The Catalog distribution includes a modified version of the primary driver
from XT, com.arbortext.sax.xsl.Driver. It differs from
the com.jclark.sax.xsl.Driver class only in the addition
of Catalog support. You can use it to convert the document in the test
directory to HTML, as shown in Example 4.. You must have the xt.jar
and xp.jar files on your CLASSPATH
in order to run this example.
Note that this example uses the system propert xml.catalog.files
to set the catalog path because the Driver does not support
a command-line option to specify catalog files.
We hope that these classes become a standard part of all the major XML
Parsers. As XML processors incorporate this
code, users will be able to utilize public identifiers in XML documents
with the confidence that they will be able to move those documents from one
system to another and around the Web knowing that they will also be able to
refer to the appropriate external file or Web page.
Norman
Walsh lives in beautiful, rural western Massachusetts where he hacks
XML for fun and profit. He can name lots of things that he's unable to locate,
his car keys, for example.