[This local archive copy mirrored from the canonical site: http://www.arbortext.com/sgmlxept.html; links may not have complete integrity, so use the canonical document at this URL if possible.]

XML Resources


Why XML?

XML News

Who Needs XML?


Contact Us

  SGML Exceptions and XML

by ArborText, Inc.

This paper briefly describes SGML exceptions (inclusions and exclusions) and discusses how “exception users” can handle their DTDs and data in XML, which does not allow exceptions.

Table of Contents
What Are SGML Exceptions?
When and How Can Exceptions Be Used?
Why Aren't Exceptions Allowed in XML?
How Can Exception-Using DTDs Become XML-Compliant?
Should Exception-Using DTDs Become XML-Compliant?


This paper briefly describes SGML exceptions (inclusions and exclusions) and discusses how “exception users” can handle their DTDs and data in XML, which does not allow exceptions.



What Are SGML Exceptions?

SGML exceptions are “global” control parameters for content models in DTDs. An exception affects not just the content model of the ELEMENT declaration in which it appears, but models for any subelements that appear inside that element.

There are two kinds of exception:

  • An exclusion disallows the specified subelements from appearing anywhere directly inside this element or inside any of its further contents.
  • An inclusion allows the specified subelements to appear freely anywhere within the element being declared and its further contents.

Exclusions

An exclusion consists of a hyphen (-), followed by one or more element names surrounded by parentheses and separated by vertical bars (|), ampersands (&), or commas (,). For example:

<!ELEMENT title (#PCDATA|emph) -(indexterm|link)>
<!ELEMENT emph (#PCDATA|indexterm|link)>
<!ELEMENT indexterm (#PCDATA)>

Even though emph appears to allow indexterm and link inside it, it's not allowed to contain them whenever emph itself appears inside title.

(Note that if you exclude an element, it can't also appear as a required part of the proper content model.)

Inclusions

An inclusion consists of a plus sign (+), followed by one or more element names surrounded by parentheses and separated by vertical bars (|), ampersands (&), or commas (,). For example:

<!ELEMENT document (front, body) +(indexterm|link)>
<!ELEMENT front (title, author)>

Even though the content model for front never even mentions indexterm or link, it can freely contain them whenever it appears inside document because of the inclusion on the document content model.

Note that if front happens to appear inside other elements as well, its ability to contain the “included elements” will depend on the inclusions provided on those other content models.

Combinations and Nesting

If a single element declaration needs to supply both exclusions and inclusions, the exclusion must come first. For example:

<!ELEMENT title (#PCDATA) -(indexterm) +(footnote)>

If an element is both excluded and included, the exclusion “wins.” For example:

<!ELEMENT chapter (title, para+)>
<!ELEMENT appendix (title, para+) -(xref)>
<!ELEMENT para (#PCDATA) +(xref)>

The xref element will be allowed inside a para when the para appears inside a chapter, but not when it appears inside an appendix.

 

When and How Can Exceptions Be Used?

Exceptions have several useful functions. However, they must be used cautiously because their effect extends far beyond the ELEMENT declaration on which they appear and because included elements are treated specially by SGML parsers.

Exceptions for Content Model Simplicity

Exceptions are obviously a powerful shorthand way to control many content models at once and to create the illusion of having multiple content models for a single element type.

For example, if you intend for a “floating” annotation element to be allowed literally anywhere in your document, it's much more convenient (and easier to read) to mention annot once as an inclusion up at the top than to insert it in between all the other elements in all the content models:

<!ELEMENT doc (...) +(annot)>
<!ELEMENT front (title, author)>
<!ELEMENT body (division+)>
<!ELEMENT back (appendix+)>
<!-- vs. -->
<!ELEMENT doc (annot*, front, annot*, body, annot*, back, annot*)>
<!ELEMENT front (annot*, title, annot*, author, annot*)>
<!ELEMENT body (annot*, (division, annot*)+)>
<!ELEMENT back (annot*, (appendix, annot*)+)>

Exceptions for Validation Power

In some cases, exceptions are more than a shorthand; they're the only way to enforce a constraint by means of validation with an SGML parser.

For example, let's say that:

  • Your para element can contain footnote (in the old-fashioned way, without inclusions) because you might need to annotate a sentence that appears inside a paragraph:
    <!ELEMENT para (#PCDATA|emph|footnote)*>
  • Your footnote element can contain para because your footnotes are sometimes large enough to contain several paragraphs, supporting graphics, lists, and so on:
    <!ELEMENT footnote (para|list|graphic)+>

Your DTD now allows the footnote element to contain itself indirectly, because a footnote could contain a paragraph that contains a footnote.

If you want to disallow footnotes from appearing inside themselves, you can't do it by removing mention of footnote from the content model of footnote, because it doesn't appear there in the first place. The only way (short of inventing new element types) is to exclude footnotes from themselves:

<!ELEMENT footnote (para|list|graphic)+ -(footnote)>

Problems with Exceptions in Reuse

Exceptions can be powerful, but they can also get you into trouble. For example, let's say that:

  • Your main document DTD allows for term definitions, and also has an inclusion of annotations at the top level:
    <!ELEMENT document (...) +(annot)>
    ..
    <!ELEMENT defined-term (term, definition)>
    <!ELEMENT term (#PCDATA)>
    <!ELEMENT definition (para+)>
  • You have an auxiliary DTD that serves to structure your terminology database, using the same basic defined-term markup (you might have modularized your DTDs so that they share the identical ELEMENT declarations for defined-term):
    <!ELEMENT terminology (title, intro, defined-term+)>
    ..
    <!ELEMENT defined-term (term, definition)>
    <!ELEMENT term (#PCDATA)>
    <!ELEMENT definition (para+)>
  • You plan to “mine” your documents for newly defined terms and add them to your SGML terminology database.

The inclusions in effect for both versions of defined-term are not identical, because annot is implicitly allowed inside one version and entirely absent inside the other. Therefore, someone might legitimately add an annotation to a definition while working inside a regular document, and when the definition is pulled into a terminology document context, the definition will be invalid.

To solve your reuse problem, you would have to ensure that the same set of inclusions was in effect for both contexts. For example, you could put an inclusion of annot on terminology or on the shared declaration for defined-term itself.

Problems with Inclusions and Record Ends

You might think that the following two ELEMENT declarations are equivalent:

1.This declaration uses inclusions:

<!ELEMENT para (#PCDATA) +(emph|trademark)>

2.This declaration uses a proper content model:

<!ELEMENT para (#PCDATA|emph|trademark)*>

They are equivalent in many ways, since they both allow emph and trademark elements to appear freely within para. For example, the following instance will be valid under both:

<para>The
<trademark>KoolSurf</trademark>
program has <emph>everything<emph>
you need for carefree Web surfing.  It even reads the
pages <emph>for</emph> you!
</para>

However, SGML parsers do not treat included subelements identically with “proper subelements” (elements that appear directly in the para content model). The difference is in the pattern of “record end” signals that are retained or discarded. Let's reveal the record end signals in this example, and add line numbers for easy reference:

1: <para>TheRE
2: <trademark>KoolSurf</trademark>RE
3: program has <emph>everything<emph>RE
4: you need for carefree Web surfing.  It even reads theRE
5: pages <emph>for</emph> you!RE
6: </para>RE

Record ends surrounding included subelements are discarded, whereas record ends surrounding proper subelements are retained (and probably ultimately turned into spaces by a formatting application, if the text is supposed to be wrapped). Thus, under declaration 1, the REs at the ends of lines 1, 2, and 3 would discarded, and under declaration 2, they would be retained.

Presumably, in order not to have words run together, you would want the behavior under declaration 2. Therefore, you should be cautious about using inclusions for “real document content” when a repeatable-OR content model will give you the same validation power.

 

Why Aren't Exceptions Allowed in XML?

Exceptions can be a useful tool for the experienced DTD developer, though they should be used sparingly and cautiously. However, the XML Working Group removed SGML exceptions from XML Version 1.0 for two reasons:

Simplification of validation

A validating “XML processor” (equivalent to an SGML parser) must be much more complex if it is required to handle exception-checking of element content, and one of XML's goals is to be relatively easy to implement.

XML parsing model expression

Because exceptions are a very powerful shorthand and because they interact with content in subtle ways, their power is difficult to express using standard formal-language theory and practice, and another of XML's goals is to be described in a formal and concise manner.

It is worth noting that the Working Group intends to place them on the list of capabilities to be reconsidered in preparing future versions of XML.

 

How Can Exception-Using DTDs Become XML-Compliant?

Since XML does not allow exceptions, any exception-using DTD that needs to conform to XML will have to undergo a process to remove its exceptions (possibly among other changes not discussed in this paper).

If your goal is to remove exceptions but to keep the DTD as robust in its expressiveness and validation as before, you'll find that some exceptions are harder to remove than others.

In some cases, you just need to create more complicated content models, but in others, you need to split out some of your existing element types into new context-specific element types so that you can give a different content model to each one.

Removing Convenience Inclusions

Where you have used inclusions as shorthand, removing them involves changing content models to produce the same expressive power in the DTD, which will make the content models more complex.

For example, let's say your DTD has an inclusion of footnote and sidebar at the top level:

<!ELEMENT doc (front?, body, back?) +(footnote|sidebar)>
<!ELEMENT front (title, (author|authorgroup)?, publisher, pubdate?)>
<!ELEMENT body (division+)>
<!ELEMENT back ((appendix+, glossary?) | glossary)>
..

To remove the inclusions and keep the same expressiveness would require the following declarations:

<!ELEMENT doc ((footnote|sidebar)*, (front, (footnote|sidebar)*)?,
        body, (footnote|sidebar)*, (back, (footnote|sidebar)*)?)>
<!ELEMENT front ((footnote|sidebar)*, title, (footnote|sidebar)*,
        ((author|authorgroup), (footnote|sidebar)*)?, publisher,
        (footnote|sidebar)*, (pubdate, (footnote|sidebar)*)?)>
<ELEMENT body ((footnote|sidebar)*, (division, (footnote|sidebar)*)+)>
<!ELEMENT back ((footnote|sidebar)*, ((appendix, (footnote|sidebar)*)+,
        (glossary, (footnote|sidebar)*)?) |
        (glossary, (footnote|sidebar)*))>
..

These declarations are difficult to read and therefore to maintain, though the use of parameter entities could help them to become a bit clearer:

<!ENTITY % inc "(footnote|sidebar)*">
<!ELEMENT doc ((%inc;), (front, (%inc;))?, body, (%inc;),
        (back, (%inc;))?)>
<!ELEMENT front ((%inc;), title, (%inc;),
        ((author|authorgroup), (%inc;))?, publisher, (%inc;),
        (pubdate, (%inc;))?)>
<ELEMENT body (%inc;), (division, (%inc;))+)>
<!ELEMENT back ((%inc;),
        (((appendix, (%inc;))+, (glossary, (%inc;))?) |
        (glossary, (%inc;)))>
..

If you've used different “inclusion mixtures” at different levels, or have excluded some of the inclusion elements at lower levels, the complexity will increase, and even parameter entities won't help very much.

You should consider whether the original validation power is worth the complexity. In some cases, you might prefer a simpler content model that does not exactly duplicate the original DTD's functionality. It might be looser or tighter, and might have some intermediate element types added to “break up” the content models. For example, in the example above, you might choose to allow footnotes only in mixed content (#PCDATA mixtures) and sidebars only as peers to paragraphs, both of which seem like reasonable restrictions.

(Remember that record ends are treated differently around included subelements and proper subelements, so once you've recast the content models to be similar, they're still not perfectly identical in their treatment by SGML parsers.)

Removing Must-Have Exceptions

Our earlier example of nested footnotes in paragraphs demonstrates why exception removal sometimes requires changing the element structure in the document type. Here is the example again:

<!ELEMENT para (#PCDATA|emph|footnote)*>
<!ELEMENT footnote (para|list|graphic)+ -(footnote)>
..

Because footnote is not literally mentioned anywhere in its own content model, the only way to ensure that a para inside a footnote doesn't contain another footnote is to use an exclusion, unless you're willing to split out an element type into multiple types. Here's how you could split out the paragraph element:

<!ELEMENT para (#PCDATA|emph|footnote)*>
<!ELEMENT footnote (ftntpara|list|graphic)+>
<!ELEMENT ftntpara (#PCDATA|emph)*>
..

Doing this has a strong design impact on your DTD. In DTDs where element types are reused in many different contexts, the requirement to add context-specific element types could cause a combinatorial explosion. Also, such changes would require conversion of legacy instances and updating of all DTD documentation materials, not to mention all the other components of your SGML-based system that rely on the DTD.

Therefore, it is probably a better strategy in these cases to redesign the exception use into a non-exception-using version that has looser content model constraints. The easiest way to do this would be to retain the original content model and simply delete the exclusion:

<!ELEMENT para (#PCDATA|emph|footnote)*>
<!ELEMENT footnote (para|list|graphic)+>
..

In order to check your documents for previously illegal combinations, you would have to develop application software that does the checking outside the SGML parsing process.

Extended Exception Removal Example

Let's look at an extended exception-using DTD scenario, and what would be involved in removing the exceptions. Following is a portion of the original exception-using DTD, with commentary.

<!-- doc has an inclusion of ndx elements. -->
<!ELEMENT doc (front, body, back) +(ndx)>

<!-- front implicitly has it also, because it only appears in doc. -->
<!ELEMENT front (title, subtitle?, abstract?)>

<!-- title excludes ndx. -->
<!ELEMENT title (#PCDATA|emph)* -(ndx)>

<!-- emph doesn't mention ndx. In the context of title, it can't contain
ndx, but in other contexts, it might. -->
<!ELEMENT emph (#PCDATA)>

Now let's work through each of the ELEMENT declarations:

1.You can allow ndx in and around all the subelements of doc easily, because doc has a simple content model:

<!ELEMENT doc (ndx*, front, ndx*, body, ndx*, back, ndx*)>

2.Here's what might be the first “logical” attempt to remove exceptions from front; it is ambiguous because of the interaction of ndx and the optional elements:

<!ELEMENT front (ndx*, title, ndx*, subtitle?, ndx*, abstract?, ndx*)>

The second attempt fixes the problem, but is more complicated and less obvious.

<!ELEMENT front (ndx*, title, ndx*, (subtitle, ndx*)?, (abstract, ndx*)?)>

3.The title element isn't supposed to contain ndx, so it's unnecessary to mention ndx at all in the non-exception-using version of the DTD.

<!ELEMENT title (#PCDATA|emph)*>

4.The emph element isn't supposed to contain ndx when it's in the title element, but it might be allowed to contain ndx in other contexts. You would need to declare one new element type per “exclusion context” to reflect this. This involves both inventing a new element and mentioning it in the relevant parent's content model. Also, don't forget that you would have to convert your document instances and update your DTD documentation.

<!ELEMENT emph-in-title (#PCDATA)>
<!ELEMENT emph (#PCDATA|ndx)*>

Because of the emph-in-title invention, don't forget that you now have to use it in your declaration for title:

<!ELEMENT title (#PCDATA|emph-in-title)*>

 

Should Exception-Using DTDs Become XML-Compliant?

The considerations for removing or keeping exceptions can be complex. Let's take the issues one by one.

Q: Regardless of XML concerns, given what I now know about exceptions, should I get rid of the exceptions in my DTD?

If you feel that the exceptions in your DTD are used in problematic ways, you may very well want to remove these uses. However, if you like the creation and validation results you're getting, there's no reason to remove them unilaterally.

Q: If I'm happy with my exceptions otherwise, should I remove them in order to help make my full-SGML DTD XML-compliant?

To determine your answer, consider your reasons for using XML:

1.Is the purpose of using XML the general delivery of XML data on the Web or elsewhere? If so, then you may be able to continue using your full-SGML DTD as is, because your strategy is highly likely to involve writing out XML-compliant instances that don't rely on a DTD, and it won't matter that the DTD you validated against has exceptions in it. You will need to track XML and Web developments in order to determine whether distributing merely “well-formed” documents without DTDs is the right strategy for you.

(Note that there are a few other important constructs besides exceptions that you may want to consider removing from your DTD, even if you're just delivering DTDless XML instances.)

2.Is the purpose of using XML the interchange of editable source files with others who can handle full SGML? If so, then you can continue to use your full-SGML DTD as is, because if you found exceptions useful, your interchange partners will too.

(Again, note that there are a few other important constructs besides exceptions that you may want to consider removing from your DTD, even if you're just delivering DTDless XML instances.)

3.Is the purpose of using XML the interchange of editable source files with others who can handle only XML and not full SGML? If so, then hypothetically you'll want to create an XML equivalent (to the extent possible) for your full-SGML DTD, because your interchange partners' creation and validation tools will not be able to handle a full-SGML DTD. (This is hypothetical because there are not yet high-quality XML-only validating processors or other components of an XML-only production system.)

Note that the answers to these questions are likely to change over time, as more XML tools become available and as the relationship between XML and full SGML becomes clearer.

Q: I've decided to remove the exceptions from my DTD. Should I attempt to preserve the validation power of the original DTD in the modified version?

To determine your answer, consider the importance of handling highly constrained output in your system:

1.If your processes would feel relatively little impact with fewer constraints in place, then you can choose relatively simple methods of exception removal that result in a somewhat similar (but not identical) DTD.

In practical terms, if your DTD uses only exclusions, you might choose simply to remove them. If your DTD uses inclusions, you might restrict the occurrence of the included elements as part of choosing how to migrate them into proper content model locations.

2.If your processes require highly constrained input, you might choose to migrate some inclusions and exclusions over to proper content models, as discussed above. Where this solution is unsatisfactory, you will need to implement proprietary extra-DTD validation processes. These might take the form of programs written in generic tools such as perl, or in SGML-specific parsing/transformation languages.

Note that the issue of non-DTD schema languages is receiving a great deal of attention in the XML community, and eventually it may be possible to use an open, nonproprietary means of checking validation constraints that couldn't have been expressed easily in DTD form.

Q: Is there an automatic or programmatic way to remove exceptions from myDTD?

Several scholars are researching the feasibility of automatically converting an exception-using DTD into an exception-free DTD. Some of the simple cases can probably be handled already, but a production-quality converter (for example, one that retains your existing parameter entity structure and handles complex cases) does not yet exist, and may not exist for some time. For now, you will probably need to do such conversions by hand; you may want to use a DTD analysis tool (either commercial or freely available) to help you determine whether the resulting DTD has strayed too far from the original.