SGML: Architectural Processing with SPAM

Architectural Processing with SPAM

Subject:      Architectural Processing with SPAM
From:         "W. Eliot Kimber" 
Date:         1996/07/24
Message-Id:   <01bb799a.eea90780$>
Newsgroups:   comp.text.sgml

I just realized that the SPAM application included with the latest version
of SP can generate architectural instances when used with the 
-A (architecture) flag. 

One of the things this means is that you can use SPAM to "extract" data
from documents just by mapping particular element types to architectural
forms in some architecture and then use SPAM to generate the architectural
instance containing only those forms.

For example, say you have an architecture for bibliographic information
(the sort of thing you put in document metadata--Docbook's is a good
example). An architecture for bibliographic data would be very useful. One
problem with bibliographic information is that it's often the most
ideosyncratic part of a document type. Metadata must reflect the specific
needs of an enterprise to be worth the trouble to create it. However, if
you want to collect a bunch of documents from different sources together
and build a coherent bibliography from them, you need to have some
regularity in the form of the bibliographic metadata.

While it's not too hard to define a set of bibliographic elements (i.e.,
MARC), it's very difficult to get any group of people larger than two to
agree on what the details of names and content models should be, for the
reasons given above.

Enter architectures. Architectures side-step the name issue by making
names arbitrary. Architectural forms can be mapped to any element type in
your DTD. Thus, if you can get agreement on a set of semantic elements no
matter what the names are, you're done--just define some arbitrary names
and get on with it (i.e., MARC, which uses numeric identifiers for the
fields--can't get much more arbitrary than that).

In fact, you could treat MARC as an architecture just by declaring a
document type that uses the MARC field numbers as architectural form
names, e.g.:

<!-- Meta-DTD for MARC architecture -->
<!AFDR "ISO/IEC 10744:1992">
<!ENTITY % marcfields "MF0001 | MF0002" >
<!ELEMENT MarcDoc O O (%marcfields;)* >
   MarcArc  NAME #FIXED "marcdoc"
  -- Marc field 0001: Document title (or whatever it really is) --
  -- Marc field 0002: Document author (or whatever it really is) --
<!-- End of architectural meta-DTD -->

Now you just need to map the elements in your document's metadata to the
appropriate MarcArc element types (corresponding the the different MARC
fields), and hey-presto, you can extract a "MARC" entry just by running
SPAM on it and specifying -A marcarc.

Because SP (and therefore SPAM) supports the SGML LINK feature, you can do
the architecture application in a link process declaration (LPD), removing
the need to put all the architectural stuff in your client DTD. An LPD for
a typical Docbook-like document using the above architecture would look
something like this:

<!LINKTYPE MarcStuff Book #IMPLIED [
 <?ArcBase MarcARc>
 <!-- Declare architecture naming attributes: -->
 <!ATTLIST Book     MarcArc NAME  #FIXED "MarcDoc" >
 <!ATTLIST Doctitle MarcArc NAME  #FIXED "MF0001"  >
 <!ATTLIST Author   MarcArc NAME  #FIXED "MF0002"  >
 <!-- Declare the MarcArc architecture. -->
           "ISO/IEC 10744:1992//NOTATION
            HyTime Architecture Definition Document//EN" 
            -- A document architecture conforming to the
               Architectural Form Definition Requirements of
               International Standard ISO/IEC 10744.     --
                     AFDR Meta-DTD Notation//EN">
           ArcFormA -- Attribute name: architectural form --
                    NAME     #FIXED MarcARc
           ArcNamrA -- Attribute name: attribute renamer --
                    NAME     #FIXED MarcNames
           ArcDocF  -- Architectural form name: document element --
                    NAME     #FIXED "MarcDoc"
           ArcDTD   -- Architecture meta-DTD. Must be declared --
                    -- As a data entity in the client document --
                    NAME     #FIXED "MarcArc"
 <!-- Declare an entity for the MarcArc meta-DTD: -->
          PUBLIC "-//MARC//DTD AFDR Marc Architecture Meta-DTD//NE"
                 "marcarc.mdt" NDATA AFDR >

 <!-- Define link rules for each element type 
      mapped to an arch form: -->
     book [ ] 
     doctitle [ ] 
     author [ ] 
]><!-- End of LINKTYPE declaration -->

The LPD serves to both declare the use of the architecture and map the
relevant element types to the relevant architectural forms. That means you
don't have to have any of that syntax in the base DTD if you don't want

The LINKTYPE declaration follows the doctype declaration, so the full
document would look like this (using tag omission):

<!DOCTYPE Book ... >
<!LINKTYPE MarcStuff ...>
<Doctitle>This is the document title
<Author>I Wrote This
<Date>23 July 96

[In practice, you can put the meat of the LPD in an external parameter
entity, reducing the actual syntax in the document to something like:

<!LINKTYPE Marcstuff Book #IMPLIED [
 <!ENTITY % stuff SYSTEM "marc2bk.lpd" >

Entity managers could also provide the options of slipping LPDs in between
DOCTYPE declarations and document elements if they wanted to as a function
of a specialized storage manager.]

When you run this through SPAM with "-A marcarc" (to generate the
architectural instance) and "-a markstuff" (to activate the MarcStuff
LPD), you get this result (I added line breaks for readability):

C:>spam -Amarcarc -momittag -amarcstuff testmarc.sgm
<MF0001>This is the document title</MF0001>
<MF0002>I Wrote This</MF0002>

SPAM has effectively extracted the relevant parts of what would normally
be a much larger document and put them into the desired format, namely the
MarcDoc document type. By using the "-p" flag of SPAM, I could have
included the DOCTYPE declaration for the architectural instance, which is
identical to the meta-DTD shown above.

The meta-DTD, with the AFDR declaration removed, can then be used to
process the generated architectural instance normally.

Pretty darn cool, I think. 

I'm not a MARC expert, but my guess is that extending this example to
handle the full MARC specification is mostly an exercise in typing. 

<Address HyTime=bibloc 
W. Eliot Kimber, 
Senior SGML Consultant and HyTime Specialist
Passage Systems, Inc., 10596 N. Tantau Ave., Cupertino, CA 95014-3535 
(408) 366-0300 (Cupertino), (512) 339-1400 (Austin), </Address>
"If I never had existed, would you still remember me?..." 
--Austin Lounge Lizards, "1984 Blues"