[Mirrored from: http://www.isogen.com/papers/subdoc.html]

Re-Usable SGML: Why I Demand SUBDOC

W. Eliot Kimber

This paper presented originally at SGML '96.

Abstract

This paper discusses the issues of SGML re-use and shows why they can only be solved generally through the use of subdocuments. The paper explores the following general issues:

General text entities are not re-usable
How to enable interoperation of documents with possibly different document types?
How to effect the cross-document addressing needed when a single document is composed of many subdocuments?

The SGML standard only defines two object types that can have independent existence: documents and subdocuments. Thus it is clear that only documents and subdocuments can be reliably re-used. In particular, external general text entities are not useful candidates for general re-use. My plea then is for tools to add the functions necessary to support the use of subdocuments for the re-use of semantic fragments. For most applications, such as browsers, this means treating the content of subdocument entities as though it had occurred in a general text entity for the purpose of processing (not parsing). For parsers, it means providing a mechanism to either parse multiple documents in parallel or to suspend the parsing of the parent document while the subdocument is parsed and then integrating the parsing result of the subdocument with the data resulting from the parsing of the parent document. For editors, it means allowing the declaration and editing of subdocument entities. Editors, in particular, may also need to provide ways to define constraints on what document types or architectures are to be allowed for subdocuments in specific application environments (families of DTDs).

I think that these conventions provide a clear and simple way to make the use of subdocuments in general less problematic and more fruitful. The full promise of SGML cannot be realized until the problem of fragment re-use is solved and I am firmly convinced that subdocuments are the key to that solution.

Biography

W. Eliot Kimber

Eliot is a Senior SGML Consultant and HyTime specialist for Passage Systems Inc. Before joining Passage Systems in 1994, Eliot was one of the chief architects (with Wayne Wohler and Don Day) of IBM's IBM ID Doc project, an effort to provide an SGML-based authoring and production system for IBM's product documentation. Eliot is an active participant in the SGML and related standards processes, most recently with the HyTime Technical Corrigendum and the W3C's XML project. Pending completion of the Technical Corrigendum, Eliot will publish a book on HyTime, Practical Hypermedia: An Introduction to HyTime, part of the Charles F. Goldfarb Series on Open Information Management from Prentice Hall Professional Technical Reference.

Problem: General text entities are not re-usable

A basic principle of re-use is that to be re-usable an object must be largely self contained and self describing. This is both a principle of object-oriented programming (expressed as "encapsulation" in object-oriented terms) and SGML. SGML documents are self contained and, through the document type definition, self describing (remembering that a document type definition includes the document type declaration and supporting documentation and other defining objects, such as style sheets). Thus, for an object to be generally re-usable it must be self contained and self describing.

Many users of SGML use (or try to use) external text entities for information re-use. External text entities are not self contained and self describing (and therefore not generally re-usable) because:

IDs and Entity names cannot be guaranteed unique by SGML-defined means when a general text entity is used in more than one document
General text entities cannot be validated in isolation
There are no SGML-defined structural constraints on general text entities. Thus you cannot predict, in the general case, the shape of the structures contained in a general text entity.

In short, general text entities can only be processed by an SGML parser in the context in which they are referenced: as far as SGML is concerned, they have no meaningful separate existence.[Note: Other applications can, of course, process text entities in any way they please, but such processing cannot be SGML processing because SGML only defines processing of text entities in terms of their reference from within a larger, complete, document.]

The most severe problem with general text entity re-use, and the one that cannot be solved by policies that define restrictions on them, is ID and entity name uniqueness. Problems 2 and 3 can be solved, to some degree, by defining enterprise- or tool-specific policies for the structure of general text entities. For example, many enterprises require general text entities to consist of a single element (e.g., a single chapter, a single table, etc.). This is a reasonable constraint but one that cannot be enforced through SGML-defined means. This means that you cannot depend on documents created outside the scope of your policy to meet this requirement. Note that defining the policy in this way is almost the same as using subdocuments because any such entity can be made into a subdocument entity simply by adding the appropriate document type declaration (more about this later).

The ID and entity name uniqueness problem cannot be solved by defining policies either. The only policy that can avoid this problem is one that treats a set of documents as a single name space so that IDs and entity names are unique across all the documents in the set. There are two severe problems with this policy.

First, you must have a way to enforce this policy, which usually means having some automated system that assigns and manages IDs and entity names. This can be done at small scales, say less than 50 authors working together, but at larger scales, especially in distributed enterprises, it becomes prohibitively difficult if not impossible. Even in the case of a small enterprise with a completely closed set of documents, you cannot guarantee that the document set will remain closed for the life of the documents (assuming they have a life span of more than a couple of years). For example, your enterprise might merge with a different enterprise, requiring that your document sets be integrated. When you consider the potential of interconnecting many documents over an intranet or Internet, it becomes clear that any mechanism based on a single, universal ID name space will break down quite quickly.

As shown in Figure x. , if the same general text entity is used by two different documents, it is possible for an ID in the entity to conflict with one in one of the documents.

Re-Use of General Text Entities

The document on the left, the one for which the entity was originally created, can use the entity without difficulty. However, when the authors of the right-hand document decides to use the entity in their document, they discover that they already have an element with an ID of "bar". This results in a conflict that may be unresolvable without considerable expense to one document or the other.

For example, the author of the first document may have many references to the division in the entity. Changing the ID of that division to something that does not conflict with the second document would require reworking all the references in the first document. Likewise, the second document may also have many references to its division with the ID "bar". Therefore, changing that ID so it didn't conflict with the ID of the re-used division would require changing all those references. When this problem is scaled up so that it occurs for many documents, it should become clear that the conflict may be unresolvable for practical considerations.

Obviously, if your goal is to develop an SGML system that enables and encourages a large amount of re-use you must find a solution to this problem that avoids the problems of name conflict and does not depend on fragile and unenforceable policies.

Solution: Use subdocuments for re-usable fragments

Subdocument entities eliminate the re-use problems inherent in general text entities because they are themselves complete documents. This means that they can be validated independently, that they represent unique ID and entity name spaces, and that there are strict rules for their construction. Thus subdocuments satisfy the criteria for re-usable objects. Figure x. shows how the re-used division in Figure x. can be re-cast as a subdocument entity.

Figure 2. Re-Use of Subdocument Entities

The re-used division has been made into a subdocument simply by adding a DOCTYPE declaration where the document type is "DIV" and, in this case, the external declaration subset is the same one used for the other documents. Because the original general entity consisted of a single element, it was easy to turn it into a subdocument.

The use of subdocuments does pose some potential problems:

As defined by ISO 8879, there is no direct way to define within a document what the rules are for the content of a subdocument
SGML does not provide, by itself, a way to do cross-document addressing.

The lack of a cross-document addressing mechanism is a general problem in SGML that affects more than just subdocument use. Pending the SGML revision, when a way to do simple cross-document ID references will almost certainly be added to SGML, you must find some way to do addressing among documents and subdocuments.

Constraining the content of subdocuments is a problem because, unlike general text entities, which are parsed as part of the documents that refer to them, subdocument entities can have any document types and content they want, at least as far as SGML parsing is concerned. Of course, processing applications may have certain expectations and rules about the content of subdocument entities and can define and enforce these in a number of ways, including ungraceful failure should they encounter something they don't expect.

It is important to realize, however, that this potential always exists in an SGML system. There is nothing in a conforming SGML system that prevents an author from replacing or changing their document type declaration so that it does not meet the processing system's expectations. Even when applications do not support the SUBDOC feature it is still incumbent upon them to recognize and respond gracefully and appropriately to data that is unexpected.

The key point is that the thing that defines the requirements of what a document can contain is the processing application, not the SGML parser. SGML document type declarations can only express part of an application's total set of constraints (this is why thinking that SGML validation is enough is such a dangerous fallacy). An SGML parser can only provide syntactic validation, it cannot provide semantic validation: only the processing application (and the human users of it) can do that.

An example of giving a processing system data it doesn't expect is using the style sheet for one DTD with documents of another DTD. The documents may be syntactically valid as far as SGML parsing can tell, but clearly the system is not getting the data it expects, namely documents in the DTD the style sheet was designed for. What happens in a case like this is entirely a function of how the processing application handles such situations.

In addition, different types of applications vary in their sensitivity to semantic constraints. SGML editors, for example, are largely independent of the document type used (at least for the purpose of distinguishing data from markup and enabling basic editing). Composition systems, on the other hand, may be very sensitive to the semantic constraints because the formatting is tied directly to specific element types and combinations of element types. Search and retrieval systems may or may not be sensitive to semantics, depending on how their indexing is done.

Thus, the use of subdocuments doesn't really change the normal level of potential for anarchy in SGML systems. It does, however, make it more convenient to inadvertently give a processing application data it doesn't expect. The problem then is to define ways that applications and document type designers can express their expectations about what is meaningful content for subdocuments.

As a rule, it is easier to define systems that are less general and therefore tied to a fixed set of semantics. However, it is not impossible, or even necessarily that difficult, to build systems that are more general and less tied to a specific set of semantics. In any case, there are now formal mechanisms for expressing the degree of semantic constraint a particular system needs.

Defining Expectations: Conventions for the Use of Subdocuments

The primary solution to the problems raised by subdocument use can be solved by easy-to-define and use conventions. These conventions represent policies that can be expressed and validated by SGML facilities or that are easy for applications to validate and enforce.

The first convention is an agreement that subdocuments should be used only for data that is semantically a fragment of the document that refers to the subdocument. In other words, the data in the subdocument could be included in the document as a general text entity (more or less). The SGML standard doesn't define any precise meaning for subdocument references. I think it makes the most sense to only use subdocuments for semantic fragments of documents. If an entity is not a semantic fragment of the document that refers to it, then it should be defined as a document entity, not a subdocument entity.[Note: In hindsight, SGML should have an entity type of "DOCUMENT", parallel to SUBDOC. A proposal to this effect has been made for the SGML revision. SGML does define a public text class of "DOCUMENT". The HyTime standard also defines a notation of SGML to be used for declaring document entities.]

In addition to this general principle, I have identified three key conventions that make the use of subdocuments for re-use more reliable:

Limit references to subdocuments to ENTITY attributes
Require subdocuments to use the same (or compatible) DTDs
Use HyTime indirect addressing for ID references

Limiting References to ENTITY Attributes

The purpose of this convention is to make it clearer where subdocument entity references are allowed within a particular document type. As currently defined, SGML allows subdocument entity references anywhere that replaceable character data is allowed.[Note: There are many, myself included, that feel that SUBDOC and data entity references should only be allowed from ENTITY attributes and not allowed in character content. The ability to do so may become an optional feature of SGML in the SGML revision.] Thus, it is syntactically possible to refer to subdocument entities from places where it is probably not considered sensible to do so. What would it mean, for example, to refer to a subdocument entity within the content of a paragraph (assuming the subdocument does not consist of a valid subelement of the paragraph). I think this is a reasonable concern (and one that results from what is probably a bug in SGML anyway).

Fortunately, it is easy to avoid this situation by restricting subdocument references to ENTITY attributes. This has the advantage that you can define in a document type where it is meaningful to make such references. This lets you match your document type to your re-use policies. It also means that processing applications will only see such references at points where they expect to see them. Finally, because the reference is made from an element, you can define policies about the content of subdocuments in terms of the element type making the reference and the document element of the subdocument. These policies can be easily checked using the information available to SGML tools. The most restrictive form of this policy is requiring that subdocuments have the same element type as the referencing element and use the same declaration set as the referencing document.

For example, in a particular document type you might only want to allow division elements to be made into re-usable objects. You could express this either by allowing an ENTITY attribute on Divisionelements or by creating a special referential element type that is only allowed where Division elements are allowed. Figure x. shows these two techniques.

Figure 3. Referring to Subdocuments Via ENTITY Attributes

<!-- Division that can refer to a subdocument entity for
     its content -->
<!ELEMENT Division - O (%Div.Content;) >
<!ATTLIST Division
     Content  ENTITY #CONREF
     -- Reference to subdocument entity that contains
        the content of this division --
>

<!-- Specialized element type for refering to subdocument
     entities. -->
<!ELEMENT Re-Used-Division - O EMPTY >
<!ATTLIST Re-Used-Division
     Division  ENTITY #REQUIRED
     -- Reference to subdocument entity that contains
        the content of this division --
>

Depending on the need for constraint, you could allow any element to refer to a subdocument entity for its content. If you are using enabling architectures (as described in the forthcoming HyTime Technical Corrigendum) you can define your policies in terms of architectures and architectural forms. A typical policy would be that the subdocument must be of the same architectural form as the referencing element, but not necessarily the same element type. This enables the interoperation and re-use of information among documents that share a common architecture but that may be specialized for specific kinds of data[Note: This approach works very well with the "microdocument" systems championed by Omnimark Technologies. Small documents with smaller, more focused document types improve the information and make systems more efficient while the use of architectures provides a formal definitional framework for the system and information as a whole.] Rules for references to element types can be expressed using the reference type control feature of the SGML Extended Facilities Architecture (defined in the HyTime Technical Corrigendum).

Another reason for only refering to subdocument entities from ENTITY attributes is that inline references to subdocument entities, unlike general text entities, can only occur in PCDATA or RCDATA content. This means that unless you use ENTITY attributes for subdocument references that there may be contexts where subdocument entity references need to be allowed but can't be without also allowing character data (the use of subdocuments for divisions being a likely example). For example, if the content model for the body of your documents is "(Chapter+)", you would not be able to refer to re-used chapters held in subdocuments by using direct references to subdocument entities because character data is not allowed where Chapterelements are allowed.

Requiring Subdocuments to Use Compatible DTDs

This convention is an extension of the basic convention that subdocuments are semantic fragments of their parent documents. In the most restrictive case, you require that subdocuments use exactly the same document type declaration (not counting entity declarations) as the parent document.

This requirement can be less restrictive depending on the nature of the document type and the flexibility of the applications that support it. For example, you might have a family of related document types, all of which can be handled by your processing system (perhaps you have a set of master style sheets that will work for any document type in the family). In this case, it probably makes sense to allow subdocuments to use any document type in the family.

In the least restrictive case, your processing system can handle anything it gets and therefore imposes no restrictions on the document types allowed for subdocuments, leaving it purely to authors to define what constitutes a valid semantic fragment.

The key is that you, as the application or document type designer, get to say what compatible means for your application and you have a continuum of choices from the most restrictive (exactly the same document type) to the least restrictive (any document type).

Enabling architectures provide a formal means to define compatibility. Two document types derived from the same architecture are inherently compatible. Use of architectures can be strict, with policies that require that every element type be derived from an architectural form in a known architecture. Architecture use can be loose, requiring only that some parts of a document type be derived from some architecture. Different processing applications and different processing tasks will require different degrees of strictness.

Note that given a typical industry- or enterprise-standard DTD, it is easy to treat that DTD as an architecture from which more specialized document types can be derived. Existing processing defined in terms of the standard DTD continues to work as before. In this situation, requiring that all element types be derived from forms in the standard architecture makes sense and ensures that documents can always be processed completely in terms of the architecture. Note that the SP suite of tools provides facilities for automatically deriving "architectural" documents (documents whose document type is the architectural meta-DTD) from documents derived from the architecture.

This means that, given the strict architecture use policy, any document derived from the standard architecture (which was the standard DTD) can be turned into a document in the standard DTD by this automatic process. Thus, with this policy in effect, you can process documents with specialized document types using tools written for a standard DTD simply by first doing the automatic transform. This gives authors and subenterprises the ability to create specialized (and quite likely, smaller) document types for their own use without jeopardizing the ability of those documents to interoperate and without requiring significant change to existing tools.

Using HyTime Indirect Addressing for ID References

Because subdocuments are separate documents syntactically, any reference to an element within a subdocument from its parent document (or other subdocuments included by the parent) is an inter-document reference, just like a reference to another book. This means you need a way to represent cross-document references. HyTime provides a complete and robust referential mechanism that is itself an ISO standard (ISO/IEC 10744).

The HyTime function needed to support cross-document ID references is minimal and can be easily implemented by tool vendors and in purpose-built tools. Such customizations already exist for tools like ADEPT*Editor and common SGML processing tools like Omnimark and NSGMLS and Perl are more than capable of providing the same functions. You need only support the nameloc and nmlistarchitectural forms and need not even support multiple locations (meaning that each nameloc refers to a single target element).

In an ideal system, the creation of subdocuments and the management of references among them would be managed automatically by editing tools and document management systems. Any direct ID reference can be easily converted to an indirect HyTime address using a very simple algorithm. Doing cross-document addressing of this sort at larger scales requires some sort of document and link management tool, but the need for such systems is there whether you use subdocuments or not.

You could also use other addressing schemes, such as TEI locators or URLs, and which you use will depend upon the nature of your application and the scope of interoperability you need.

A proposal to the SGML revision has been made (and accepted in principle) to extend the ID reference syntax to enable simple cross-document ID references without the need for indirection or other addressing syntaxes. This proposal will almost certainly be accepted (because it's something that should have been in SGML from the beginning). It will not, however, replace HyTime's addressing methods. The change to SGML will simply allow you to bind the name of a document entity to an ID within that entity. It will not define any indirect addressing mechanisms.

Note that unless you are creating documents that live completely separate lives and are never delivered as anything but paper you need a robust mechanism for cross-document addressing in any case. Having solved that problem, you can then apply it to the use of subdocument entities. So while it may seem that the use of subdocuments raises a difficult problem by requiring cross-document addressing, in fact the problem is always there and subdocument use only increases the volume of addresses you may have to manage.

In Conclusion

The SGML standard only defines three object types that can have independent existence: documents, subdocuments, and non-SGML entities. Thus it is clear that only documents and subdocuments can be reliably re-used. In particular, external general text entities are not useful candidates for general re-use (and external parameter entities can only be re-used in very controlled environments).

My plea then is for tools to add the functions necessary to support the use of subdocuments for the re-use of semantic fragments. For most applications, such as browsers, this means treating the content of subdocument entities as though it had occurred in a general text entity for the purpose of processing (not parsing). For parsers, it means providing a mechanism to either parse multiple documents in parallel or to suspend the parsing of the parent document while the subdocument is parsed and then integrating the parsing result of the subdocument with the data resulting from the parsing of the parent document.[Note: I am encouraged by the fact that Omnimark Version 3 includes the ability to process multiple documents in a single session, enabling the direct processing of subdocuments.] For editors, it means allowing the declaration and editing of subdocument entities. Editors, in particular, may also need to provide ways to define constraints on what document types or architectures are to be allowed for subdocuments in specific application environments (families of document types).

This is also a plea for the implementation of architecture-based systems: systems that understand the formal architecture mechanisms defined by the HyTime Technical Corrigendum and implemented by James Clark in his SP parser. Tools should provide easy and efficient mechanisms to both base processing on architectural forms, not just element types and attributes, and should provide facilities for defining policies in architectural terms. For example, it should be possible to configure SGML editors so that they only allow references to subdocuments that have particular document types or are derived from particular architectures.

I think that these conventions provide a clear and simple way to make the use of subdocuments practical, fruitful, and compelling. The full promise of SGML cannot be realized until the problem of fragment re-use is solved and I am firmly convinced that subdocuments are the key to that solution.