SGML DOCTYPE

Date: Tue, 13 Jan 1998 09:58:12 -0800
From: "W. Eliot Kimber" <eliot@isogen.com>
Organization: ISOGEN International Corp.
To: xml-dev@ic.ac.uk
Subject: Re: DOCTYPE (was Re: Announcement: SAX 1998-01-12 Draft)

   -------------------------------------------------------------------

Peter Murray-Rust wrote:
> In conclusion (IMO) the DOCTYPE statement really only serves to
> identify
> the address of the external subset. It is equivalent to:
> 
> <!DOCTYPE FOO [
> <!ENTITY % foo "foo.dtd">
> %foo;
> ]>

Exactly correct. The DOCTYPE declaration tells you *nothing* about the
abstract type of the document (that is, the general class of documents
of which the author intended it to be an instance).

> How do we determine the TYPE of a document?  There is no good
> mechanism.

Not true.  All that is necessary is to provide some way to point to a
separate definition of the type.  The SGML architecture mechanism,
defined in ISO/IEC 10744:1997 and implemented in the SP parsers (as well
as in purpose-built code) provides just such a mechanism.  In December,
James and I submitted for WG4 approval an enhancement to the formal
mechanism that lets it be used with XML documents. See
http://www.ornl.gov/sgml/wg8/document/1957.htm.

The idea is a simple one: you use a PI to associate a local name for the
"type" and then use a URL or public identifier to point to the
documentation and the DTD that defines the type.

For example, ISOGEN has defined for its own use a base architecture from
which a variety of specific document types can be derived.  I can invoke
the use of this architecture like so:

<?XML 1.0 ?>
<?IS10744:arch 
  name="ISOBase" 
  public-id="+//IDN isogen.com//NOTATION ISOGEN Base Architecture//EN"
  dtd-system-id="http://www.isogen.com/ISOBase/isobase.mdt"
?>
<foo ISOBase="paragraph">Foo is now clearly a kind of ISOBase
paragraph</foo>

By default, the architecture ("type") name is used as the name of the
attribute you use to map local elements to element types in the
architecture (which types you can determine by looking at the
architectural DTD).

Note that the presence or absence of a DOCTYPE declaration is
irrelevant--all the information you need to interpret the Foo element as
an ISOBase paragraph is in the instance.  The only think a DOCTYPE
declaration would add would be the convenience of setting a default
value for the ISOBase attribute.

Note also that it requires no parser-level code to interpret and support
the mapping because it's using normal XML syntax: PIs and attributes. 
It also doesn't require anything like the colonized names because the
name mapping is done through an attribute, which has the advantage that
the same element can be mapped to different architectures at the same
time.  For example, I might want to also indicate that the Foo element
corresponds to something in the RDF spec:

<?XML 1.0 ?>
<?IS10744:arch 
  name="ISOBase" 
  public-id="+//IDN isogen.com//NOTATION ISOGEN Base Architecture//EN"
  dtd-system-id="http://www.isogen.com/ISOBase/isobase.mdt"
?>
<?ISO10744:arch
  name=rdf
  public-id="+//IDN w3c.org//NOTATION Resource Definition Format//EN"
  dtd-system-id="http://www.w3c.org/RDF/rdf.dtd"
?>
<foo ISOBase="paragraph" rdf="some-rdf-element-type"
>Foo is now clearly a kind of ISOBase paragraph</foo>

When you're doing ISOBase-related processing, you ignore the RDF mapping
and when you're doing RDF-related processing you ignore the ISOBase
mapping.  Or, you can consider both at once, it's up to your processor.

The document can be validated against either of the architectural DTDs
by using a tool like SP, which has that facility built in, or by
explicitly generating the document that reflects the mapping and then
validating it against the architectural DTD.  For example, the ISOBase
"architectural instance" of the above is:

<?XML 1.0?>
<!DOCTYPE paragraph SYSTEM "http://www.isogenc.com/ISOBase/isobase.mdt">
<paragraph>Foo is now clearly a kind of ISOBase paragraph</paragraph>

That's all there is to it.  The idea that DOCTYPE declarations tell you
something useful is one of the top five Big Lies of SGML.

For more on the subject of architectures, see
http://www.isogen.com/papers/archintro.html, which goes into more
detail about using architectures within an XML context.

If anyone would like to see real code that does architecture-based
processing, I would be happy to provide it in any of the languages in
which I've done it (Perl, Rexx, DSSSL, ACL, VisualBasic--sorry, no Java,
only because I haven't had a need to do Java programming yet--note the
preponderance of *interpreted* languages in this list :-).

Cheers,

Eliot

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo@ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo@ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa@ic.ac.uk)