Architectural Processing with SPAM
Subject: Architectural Processing with SPAM
From: "W. Eliot Kimber"
I just realized that the SPAM application included with the latest version
of SP can generate architectural instances when used with the
-A (architecture) flag.
One of the things this means is that you can use SPAM to "extract" data
from documents just by mapping particular element types to architectural
forms in some architecture and then use SPAM to generate the architectural
instance containing only those forms.
For example, say you have an architecture for bibliographic information
(the sort of thing you put in document metadata--Docbook's is a good
example). An architecture for bibliographic data would be very useful. One
problem with bibliographic information is that it's often the most
ideosyncratic part of a document type. Metadata must reflect the specific
needs of an enterprise to be worth the trouble to create it. However, if
you want to collect a bunch of documents from different sources together
and build a coherent bibliography from them, you need to have some
regularity in the form of the bibliographic metadata.
While it's not too hard to define a set of bibliographic elements (i.e.,
MARC), it's very difficult to get any group of people larger than two to
agree on what the details of names and content models should be, for the
reasons given above.
Enter architectures. Architectures side-step the name issue by making
names arbitrary. Architectural forms can be mapped to any element type in
your DTD. Thus, if you can get agreement on a set of semantic elements no
matter what the names are, you're done--just define some arbitrary names
and get on with it (i.e., MARC, which uses numeric identifiers for the
fields--can't get much more arbitrary than that).
In fact, you could treat MARC as an architecture just by declaring a
document type that uses the MARC field numbers as architectural form
<!-- Meta-DTD for MARC architecture -->
<!AFDR "ISO/IEC 10744:1992">
<!ENTITY % marcfields "MF0001 | MF0002" >
<!ELEMENT MarcDoc O O (%marcfields;)* >
MarcArc NAME #FIXED "marcdoc"
<!ELEMENT MF0001 - O (#PCDATA)
-- Marc field 0001: Document title (or whatever it really is) --
<!ELEMENT MF0002 - O (#PCDATA)
-- Marc field 0002: Document author (or whatever it really is) --
<!-- End of architectural meta-DTD -->
Now you just need to map the elements in your document's metadata to the
appropriate MarcArc element types (corresponding the the different MARC
fields), and hey-presto, you can extract a "MARC" entry just by running
SPAM on it and specifying -A marcarc.
Because SP (and therefore SPAM) supports the SGML LINK feature, you can do
the architecture application in a link process declaration (LPD), removing
the need to put all the architectural stuff in your client DTD. An LPD for
a typical Docbook-like document using the above architecture would look
something like this:
<!LINKTYPE MarcStuff Book #IMPLIED [
<!-- Declare architecture naming attributes: -->
<!ATTLIST Book MarcArc NAME #FIXED "MarcDoc" >
<!ATTLIST Doctitle MarcArc NAME #FIXED "MF0001" >
<!ATTLIST Author MarcArc NAME #FIXED "MF0002" >
<!-- Declare the MarcArc architecture. -->
<!NOTATION MarcArc PUBLIC
HyTime Architecture Definition Document//EN"
-- A document architecture conforming to the
Architectural Form Definition Requirements of
International Standard ISO/IEC 10744. --
<!NOTATION AFDRMeta PUBLIC "ISO/IEC 10744//NOTATION
AFDR Meta-DTD Notation//EN">
<!ATTLIST #NOTATION MarcArc
ArcFormA -- Attribute name: architectural form --
NAME #FIXED MarcARc
ArcNamrA -- Attribute name: attribute renamer --
NAME #FIXED MarcNames
ArcDocF -- Architectural form name: document element --
NAME #FIXED "MarcDoc"
ArcDTD -- Architecture meta-DTD. Must be declared --
-- As a data entity in the client document --
NAME #FIXED "MarcArc"
<!-- Declare an entity for the MarcArc meta-DTD: -->
PUBLIC "-//MARC//DTD AFDR Marc Architecture Meta-DTD//NE"
"marcarc.mdt" NDATA AFDR >
<!-- Define link rules for each element type
mapped to an arch form: -->
book [ ]
doctitle [ ]
author [ ]
]><!-- End of LINKTYPE declaration -->
The LPD serves to both declare the use of the architecture and map the
relevant element types to the relevant architectural forms. That means you
don't have to have any of that syntax in the base DTD if you don't want
The LINKTYPE declaration follows the doctype declaration, so the full
document would look like this (using tag omission):
<!DOCTYPE Book ... >
<!LINKTYPE MarcStuff ...>
<Doctitle>This is the document title
<Author>I Wrote This
<Date>23 July 96
[In practice, you can put the meat of the LPD in an external parameter
entity, reducing the actual syntax in the document to something like:
<!LINKTYPE Marcstuff Book #IMPLIED [
<!ENTITY % stuff SYSTEM "marc2bk.lpd" >
Entity managers could also provide the options of slipping LPDs in between
DOCTYPE declarations and document elements if they wanted to as a function
of a specialized storage manager.]
When you run this through SPAM with "-A marcarc" (to generate the
architectural instance) and "-a markstuff" (to activate the MarcStuff
LPD), you get this result (I added line breaks for readability):
C:>spam -Amarcarc -momittag -amarcstuff testmarc.sgm
<MF0001>This is the document title</MF0001>
<MF0002>I Wrote This</MF0002>
SPAM has effectively extracted the relevant parts of what would normally
be a much larger document and put them into the desired format, namely the
MarcDoc document type. By using the "-p" flag of SPAM, I could have
included the DOCTYPE declaration for the architectural instance, which is
identical to the meta-DTD shown above.
The meta-DTD, with the AFDR declaration removed, can then be used to
process the generated architectural instance normally.
Pretty darn cool, I think.
I'm not a MARC expert, but my guess is that extending this example to
handle the full MARC specification is mostly an exercise in typing.
W. Eliot Kimber, firstname.lastname@example.org
Senior SGML Consultant and HyTime Specialist
Passage Systems, Inc., 10596 N. Tantau Ave., Cupertino, CA 95014-3535
(408) 366-0300 (Cupertino), (512) 339-1400 (Austin),
"If I never had existed, would you still remember me?..."
--Austin Lounge Lizards, "1984 Blues"