Re-Usable SGML: A Plea for SUBDOC

[Mirrored from: "Re-Usable SGML: A Plea for SUBDOC"]

A poster session first presented at SGML '95.

Author: W. Eliot Kimber, Passage Systems Inc.

Copyright © Passage Systems Inc., 1995, 1996

If you are using Panorama or PanoramaPRO , you can view the SGML version of this paper .

Table of Contents

Problem: General text entities are not re-usable

External general text entities are not generally re-usable because:

The most severe problem, and the one that cannot be solved by policies that define restrictions on general text entities, is ID and entity name uniqueness. As shown in Figure 1. Re-Use of General Text Entities, if the same general text entity is used by two different documents, it is possible for an ID in the entity to conflict with one in one of the documents.

Figure 1. Re-Use of General Text Entities

Solution: Use subdocuments for re-usable fragments

Subdocument entities eleminate the re-use problems inherent in general text entities because they are themselves complete documents. This means that they can be validated independently, that they represent unique ID and entity name spaces, and that there are strict rules for their construction. Figure 2. Re-Use of Subdocument Entities shows how the re-used division in Figure 1 can be re-cast as a subdocument entity.

Figure 2. Re-Use of Subdocument Entities

However, the use of subdocuments does pose some potential problems of their own, namely the fact that, as defined by ISO 8879, there is no direct way to define within a document what the rules are for the content of a subdocument. In other words, unlike a general text entity, which is parsed as part of the document that refers to it, a subdocument entity can have any document type and content it wants, at least as far as SGML parsing is concerned. Of course, processing applications may have certain expectations and rules about the content of subdocument entities and can define and enforce these in a number of ways, including ungraceful failure should they encounter something they don't expect.

It is important to realize, however, that this potential always exists in an SGML system. There is nothing in a conforming SGML system that prevents an author from replacing or changing their document type declaration into something that does not meet the processing system's expectations. Even when applications do not support the SUBDOC feature it is still encumbent upon them to recognize and respond gracefully and appropriately to data that is unexpected.

An example of giving a processing system data it doesn't expect is using the style sheet for one DTD with documents of another DTD. The documents may be syntactically valid as far as SGML parsing can tell, but clearly the system is not getting the data it expects to get, namely documents in the DTD the style sheet was designed for. What happens in a case like this is entirely a function of how the processing application handles such situations.

Thus, the use of subdocuments doesn't really change the normal level of potential for anarchy in SGML systems. It does, however, make it more convenient to inadvertantly give a processing application data it doesn't expect. The problem then is to define ways that applications and document type designers can express their expectations about what is meaningful content for subdocuments.

Defining Expectations: Conventions for the Use of Subdocuments

The first convention is an agreement that subdocuments should be used only for data that is semantically a fragment of the document that refers to the subdocument. In other words, the data in the subdocument could be included in the document as a general text entity (more or less). The SGML standard doesn't define any precise meaning for subdocument references. I think it makes the most sense to only use subdocuments for semantic fragments of documents.

In addition to this general convention, I have identified three key conventions that make the use of subdocuments for re-use more reliable:

  1. Limit references to subdocuments to ENTITY attributes
  2. Require subdocuments to use same (or compatible) DTDs
  3. Use HyTime indirect addressing for ID references

Limiting References to ENTITY Attributes

The purpose of this convention is to make it clearer where subdocument entity references are allowed within a particular document type. For example, in a particular document type you might only want to allow division elements to be made into re-usable objects. You could express this either by allowing an ENTITY attribute on Division elements or by creating a special referential element type that is only allowed where Division elements are allowed. Figure 3. Refering to Subdocuments Via ENTITY Attributes shows these two techniques.

<!-- Division that can refer to a subdocument entity for 
     its content -->
<!ELEMENT Division - O (%Div.Content;) >
<!ATTLIST Division 
     Content  ENTITY #CONREF 
     -- Reference to subdocument entity that contains
        the content of this division --

<!-- Specialized element type for refering to subdocument
     entities. -->
<!ELEMENT Re-Used-Division - O EMPTY >
<!ATTLIST Re-Used-Division 
     Division  ENTITY #REQUIRED
     -- Reference to subdocument entity that contains
        the content of this division --

Figure 3. Refering to Subdocuments Via ENTITY Attributes

Depending on the need for constraint, you could allow any element to refer to a subdocument entity for its content.

Another reason for only refering to subdocument entities from ENTITY attributes is that inline references to subdocument entities, unlike general text entities, can only occur in PCDATA or RCDATA content. This means that unless you use ENTITY attributes for subdocument references that there may be contexts where subdocument entity references need to be allowed by can't be without also allowing character data (the use of subdocuments for divisions being a likely example).

Requiring Subdocuments to Use Compatible DTDs

This convention is an extension of the basic convention that subdocuments are semantic fragments of their parent documents. In the most restrictive case, you require that subdocuments use exactly the same document type declaration (not counting entity declarations) as the parent document.

This requirement can be less restrictive dependent on the nature of the document type and the flexibility of the applications that support it. For example, you might have a family of related DTDs, all of which can be handled by your processing system (perhaps you have a set of master style sheets that will work for any DTD in the family). In this case, it probably makes sense to allow subdocuments to use any DTD in the family.

In the least restrictive case, your processing system can handle anything it gets and therefore imposes no restrictions on the document types allowed for subdocuments, leaving it purely to authors to define what constitutes a valid semantic fragment.

The key is that you, as the application or document type designer, get to say what compatible means for your application and you have a continuum of choices from the most restrictive (exactly the same document type) to the least restrictive (any document type).

Using HyTime Indirect Addressing for ID References

Because subdocuments are separate documents syntactically, any reference to an element within a subdocument from its parent document (or other subdocuments included by the parent) is an inter-document reference, just like a reference to another book. This means you need a way to represent cross-document references. HyTime provides a complete and robust referential mechanism that is itself an ISO standard (ISO/IEC 10744 ).

You could also use other addressing schemes, such as TEI locators or URLs, and which you use will depend upon the nature of your application and the scope of interoperability you need.

In Conclusion

The SGML standard only defines two object types that can have independent existance: documents and subdocuments. Thus it is clear that only documents and subdocuments can be reliably re-used. In particular, external general text entities are not useful candidates for general re-use.

My plea then is for tools to add the functions necessary to support the use of subdocuments for the re-use of semantic fragments. For most applications, such as browsers, this means treating the content of subdocument entities as though it had occurred in a general text entity for the purpose of processing (not parsing). For parsers, it means providing a mechanism to either parse multiple documents in parallel or to suspend the parsing of the parent document while the subdocument is parsed and then integrating the parsing result of the subdocument with the data resulting from the parsing of the parent document. For editors, it means allowing the declaration and editing of subdocument entities. Editors, in particular, may also need to provide ways to define constraints on what document types or architectures are to be allowed for subdocuments in specific application environments (families of DTDs).

I think that these conventions provide a clear and simple way to make the use of subdocuments in general less problematic and more fruitful. The full promise of SGML cannot be realized until the problem of fragment re-use is solved and I am firmly convinced that subdocuments are the key to that solution.