SGML: SUBDOC Processing (Kimber)

From owner-adepters@arbortext.com Thu Apr  3 11:24:34 1997
Date: Thu, 03 Apr 1997 10:17:31 -0900
From: "W. Eliot Kimber" <eliot@isogen.com>
Subject: SUBDOC Processing (Long post)
To: adepters@arbortext.com

         -----------------------------------------------------------

Warning: long post.  The second part of this post is essentially the
content of the paper I will be submitting for SGML '97: Architectures: Why
I Demand Them.

In resonse to a couple of mails I've received in response to my postings on
SUBDOC processing, I thought I would answer the two most common SUBDOC
questions:

1. When the parser encounters a SUBDOC entity reference, it doesn't parse
the entity.  Why not?

A. Subdocument entities are like data entities: the parser merely reports
the reference, it doesn't automatically parse the document.  However,
modern parsers (SP, Omnimark, Markit) should provide the ability to have
multiple parsing sessions with the output going to a single application
processor.  You can do this with ADEPT in the same way it processes
references to external text entities: just open the referenced entity as a
file.  With SP and Omnimark, it should be possible to have multiple parser
processes (but I don't know how to do it because I always use NSGMLS).
With NSGMLS, you can simply have the program processing the output of
NSGMLS call a new parser process when it sees a subdocument reference, e.g.:

    if ($attname == $subdoc_attname) {
       push(input_files, $input_fileID); # Put the current input stream on
a stack
       $input_fileID = &Parse_Subdoc(&Ent_SysID($attvalue[$level,
$attname])); # Parse the subdocument
       # When we reach the end of the current input file, we pop the stack.
 If stack is empty,
       # we've finished processing the root document.
    }

Where the function Parse_Subdoc() shells out to NSGMLS and returns the file
handle of the output file created by NSGMLS (you could probably do
something cleaner with pipes, but I'm not that sophisticated of a Perl
programmer).

If you think of ADEPT as representative of the textbook SGML processing
application (which it is), it should be clear that the processing
application (ADEPT) can handle input from multiple documents at once.  If
you can do this, then you can process subdocuments with a minimum of extra
effort because the application has access to all the data in the compound
document, just as if it had been in a single document.  You just need to
construct your processor so that it expects and can handle compound documents.

Note that the use of SUBDOCs for compound documents also means that you
don't necessarily need to process all the child documents in order to
validate the higher-level structures of the document.  This can be
essential for larger documents (like aircraft manuals) where validation by
ADEPT or some other parser would require checking out every component from
the database.  Even on subtrees of the document, this could be
prohibitively time consuming (like when the author does "check
completeness" in ADEPT.  Subdoc avoids that problem by letting you choose
when you resolve the dependencies and when you don't.  It also means you
can validate the parts in parallel because no subdoc can have parsing
depencies on its parent or child documents.  If you had enough Omnimark
licenses and enough processors, you could validate an entire database of
subdocuments at once by firing off a zillion parallel parsing processes.
If you don't use subdocuments, you can never be sure the data is truly
valid unless you parse from the one true root document.  For a 50 000 page
aircraft maintenance manual, this may be prohibitive to do except at the
end of a development cycle, for example.

2. As each subdocument can have its own DTD, won't the authors be able to
get around our DTD by using
   subdocuments?

A. Not any more than they can today.  You must remember that the document
type *IS A PROPERTY OF THE DOCUMENT*.  Even though we tend to think of DTDs
as being things you can control centrally, in fact, any control you think
you have is illusary.  That's because any SGML-knowlegeable author can
simply replace the reference to the central DTD with one of their own
construction, and there's nothing you can do to prevent it.  Even with
Adept, authors can always change their catalogs or compile their own
versions of the DTD.

In other words, the use of "our DTD" is a matter of policy.  It is a policy
that *CANNOT BE ENFORCED BY SGML MEANS*.  It can only be enforced by laying
down rules and watching to see that people follow them.  Some tools,
especially those that use compiled DTDs, provide some enforcement
capability by making it more difficult (but not impossible) to use variant
DTDs, but of course that works only if you use the tool.
The problem is that the concept of "our DTD" in SGML *HAS NO DIRECT
SYNTACTIC EXPRESSION OR REFERENCE*.  The document type definition is an
abstraction, weakly expressed through some set of SGML element and
attribute declarations.  But the *DTD file* is *NOT* the DTD.  It is merely
an external parameter entity for which SGML provides a shorthand form of
reference.  Its use is optional (despite the fact that tools like
ADEPT*Editor and Author/Editor require it, making them non-conforming
applications [NOTE: ADEPT*Editor does not claim to be a conforming SGML
application, even though it conforms in pretty much every way except its
failure to completely process internal DOCTYPE declaration subsets.]).
Note that Framemaker+SGML gets around this problem in two ways: 1) it's not
an SGML tool 2) each document carries the full EDD, meaning that each
document can have it's own declarations: there's no notion in Framemaker of
using EDDs by reference (at least not that I know of, certainly not in the
MIF spec I was working with a year ago). 

As the discussion of the doc_type() function points out, there is *NOTHING*
in the DOCTYPE declaration that actually tells you what the real document
type is: the "document type name" merely names the document element type.
The external identifier, if any, simply identifies an external parameter
entity.  Consider this perfectly valid DOCTYPE declaration:

<!DOCTYPE A SYSTEM >

What is the document type? In fact, as intended by the author (me), the
document type is HTML (this being the declaration for a subdocument
consisting of a single A element).  But there's nothing in the declaration
that tells you that.  Oops.

Thus, the problem is not that subdocs might allow authors to use a
different DTD, it's that you have no way of knowing *today* what DTD
authors are using.  Consider this perfectly valid document fragment:

<!DOCTYPE BOOK SYSTEM "book.dtd" >
<book>
  ...

Is this a Docbook document? Could be.  Don't know.  Can't tell by parsing.
As long as the document conforms to its own declarations, it is valid.  

Obviously, the only part of the system that can determine whether or not
this is really a Docbook document is a processing application that knows
what Docbook documents look like and can tell when it doesn't have one.

Given that this is the case, this application can just as easily validate
subdocuments for conformance to its requirements as it can validate single
documents.

Thus, it doesn't matter whether you have subdocuments are not: the problem
is the same and the solution is the same.

Note that architectures can help here by providing a *validatable* way to
define and enforce document type policies.  This is because architectures
are *NOT* direct properties of documents but are only used by reference.
Thus, architectures can be centrally defined and enforced.  

Because architectures are defined using normal SGML declaration syntax, any
document can be automatically validated against the architectures it claims
derivation from.  

Thus, if instead of saying "the DTD" you say "the architecture", now you
have something you *can* automatically validate and enforce.  For example,
you might have a policy that says "all documents that support business
process X must conform to the Docbook architecture."  You can then validate
this conformance, irrespective of the declarations used for the document
itself.  Consider:

<!DOCTYPE BOOK SYSTEM "book.dtd" [
  <?ArcBase Docbook >
  <!NOTATION AFDRMeta PUBLIC "ISO/IEC 10744//NOTATION
                      AFDR Meta-DTD Notation//EN"
  >
  <!ENTITY docbook-dtd PUBLIC "-//Hal and O'Reilly//DTD Docbook//EN" 
                       "docbook.dtd" CDATA AFDRMeta
                       -- This entity points to the Docbook declaration set
-->
  <!NOTATION Docbook PUBLIC "-//Hal and O'Reilly//NOTATION 
                             Docbook Doctype Definition//EN" 
                     -- Note: this public ID points to the *documentation*
for docbook* -->
  <!ATTLIST #NOTATION Docbook
       ArcFormA NAME  #FIXED "Docbook"     
                -- Used to map client elements to Docbook forms --
       ArcDTD   CDATA #FIXED "docbook-dtd" 
                -- Reference to the architectural meta-DTD entity --
 >
]
>
<book>
  ...

These declarations define this document as being derived from the Docbook
"architecture" (which is just the Docbook DTD used as an architectural
meta-DTD without change).  An architecture-aware parser like SP can
validate conformance of the document to the meta-DTD.  Thus, you can now
tell through a general and automatic process whether or not my "BOOK"
document complies with my "derived from Docbook" policy.  If it does
comply, I also know that I can always transform the document into a Docbook
instance if need be.

When applied to the use of SUBDOC, it should be clear that you can now
impose policies for all members of a compound document by defining which
architectures they must conform to.  Because there is no *syntactic*
difference between architecture declarations and doctype declarations, you
can use any existing document type as an architecture (even if the derived
documents use exactly the same declarations).

Thus, when using architectures in this way, there is *NO DANGER* of authors
subverting your business rules and document content policies because you
can enforce those policies by checking conformance to the appropriate
architectural meta-DTDs.  Problem solved.  Not only that, but it becomes no
longer necessary to impose draconian restrictions on the DTDs of individual
documents.  Thus, sub-enterprises within the scope of the general policy
can specialize from the architecture to meet their unique local
requirements without affecting the ability of their documents to interact
with other documents within the same policy scope.

ADEPT implementation note: ADEPT has all the functionality you need to
implement architecture validation. All it requires is compiling your
architectural meta-DTD as an ADEPT doctype, then creating a general
architectural document instance derivation process that creates a new
document according to the architectural mapping defined in the client
document's own DTD (which is defined through defaulting rules defined in
the HyTime standard's AFDR annex and through explicit mapping using the
"architectural form naming atribute, e.g., "docbook" in the example above).
 If this new document validates against the meta-DTD, the document conforms
to the policy.  If it doesn't, it's toast.  This same process can also be
used to create architectural instances that can then be pushed through
down-stream processes that operate at the architectural level (printing,
DynaText, database loading, etc.).

Note that this same process can also be used to resolve subdoc references
to create a single architectural instance from a compound document.  This
is how you get around the fact that most down-stream tools don't support
subdoc.  However, as most publishing processes extract from the development
database into a publishing database anyway, you probably already have a
stage in your process where both the architectural instance creation and
subdoc reference resolution could be put without affecting the tools on
both sides of that stage.

Cheers,

E.
--
<Address HyTime=bibloc>
W. Eliot Kimber, Senior Consulting SGML Engineer
Highland Consulting, a division of ISOGEN International Corp.
2200 N. Lamar St., Suite 230, Dallas, TX 95202.  214.953.0004
www.isogen.com
</Address>