[This local archive copy mirrored from the canonical site: http://www.arbortext.com/sgmlxept.html; links may not have complete integrity, so use the canonical document at this URL if possible.]
|SGML Exceptions and XML
by ArborText, Inc.
This paper briefly describes SGML exceptions (inclusions and exclusions) and discusses how exception users can handle their DTDs and data in XML, which does not allow exceptions.
SGML exceptions are global control parameters for content models in DTDs. An exception affects not just the content model of the ELEMENT declaration in which it appears, but models for any subelements that appear inside that element.
There are two kinds of exception:
An exclusion consists of
a hyphen (-), followed by one or more element names
surrounded by parentheses and separated by vertical bars
(|), ampersands (&), or commas (,). For example:
Even though emph appears to allow indexterm and link inside it, it's not allowed to contain them whenever emph itself appears inside title.
(Note that if you exclude an element, it can't also appear as a required part of the proper content model.)
An inclusion consists of
a plus sign (+), followed by one or more element names
surrounded by parentheses and separated by vertical bars
(|), ampersands (&), or commas (,). For example:
Even though the content model for front never even mentions indexterm or link, it can freely contain them whenever it appears inside document because of the inclusion on the document content model.
Note that if front happens to appear inside other elements as well, its ability to contain the included elements will depend on the inclusions provided on those other content models.
Combinations and Nesting
If a single element
declaration needs to supply both exclusions and
inclusions, the exclusion must come first. For example:
If an element is both
excluded and included, the exclusion wins.
The xref element will be allowed inside a para when the para appears inside a chapter, but not when it appears inside an appendix.
Exceptions have several useful functions. However, they must be used cautiously because their effect extends far beyond the ELEMENT declaration on which they appear and because included elements are treated specially by SGML parsers.
Exceptions for Content Model Simplicity
Exceptions are obviously a powerful shorthand way to control many content models at once and to create the illusion of having multiple content models for a single element type.
For example, if you
intend for a floating annotation element to
be allowed literally anywhere in your document, it's much
more convenient (and easier to read) to mention annot
once as an inclusion up at the top than to insert it in
between all the other elements in all the content models:
Exceptions for Validation Power
In some cases, exceptions are more than a shorthand; they're the only way to enforce a constraint by means of validation with an SGML parser.
For example, let's say that:
Your DTD now allows the footnote element to contain itself indirectly, because a footnote could contain a paragraph that contains a footnote.
If you want to disallow
footnotes from appearing inside themselves, you can't do
it by removing mention of footnote from the content model
of footnote, because it doesn't appear there in the first
place. The only way (short of inventing new element
types) is to exclude footnotes from themselves:
Problems with Exceptions in Reuse
Exceptions can be powerful, but they can also get you into trouble. For example, let's say that:
The inclusions in effect for both versions of defined-term are not identical, because annot is implicitly allowed inside one version and entirely absent inside the other. Therefore, someone might legitimately add an annotation to a definition while working inside a regular document, and when the definition is pulled into a terminology document context, the definition will be invalid.
To solve your reuse problem, you would have to ensure that the same set of inclusions was in effect for both contexts. For example, you could put an inclusion of annot on terminology or on the shared declaration for defined-term itself.
Problems with Inclusions and Record Ends
You might think that the following two ELEMENT declarations are equivalent:
1.This declaration uses
2.This declaration uses
a proper content model:
equivalent in many ways, since they both allow emph and
trademark elements to appear freely within para. For
example, the following instance will be valid under both:
However, SGML parsers do
not treat included subelements identically with
proper subelements (elements that appear
directly in the para content model). The difference is in
the pattern of record end signals that are
retained or discarded. Let's reveal the record end
signals in this example, and add line numbers for easy
Record ends surrounding included subelements are discarded, whereas record ends surrounding proper subelements are retained (and probably ultimately turned into spaces by a formatting application, if the text is supposed to be wrapped). Thus, under declaration 1, the REs at the ends of lines 1, 2, and 3 would discarded, and under declaration 2, they would be retained.
Presumably, in order not to have words run together, you would want the behavior under declaration 2. Therefore, you should be cautious about using inclusions for real document content when a repeatable-OR content model will give you the same validation power.
Exceptions can be a useful tool for the experienced DTD developer, though they should be used sparingly and cautiously. However, the XML Working Group removed SGML exceptions from XML Version 1.0 for two reasons:
Simplification of validation
A validating XML processor (equivalent to an SGML parser) must be much more complex if it is required to handle exception-checking of element content, and one of XML's goals is to be relatively easy to implement.
XML parsing model expression
Because exceptions are a very powerful shorthand and because they interact with content in subtle ways, their power is difficult to express using standard formal-language theory and practice, and another of XML's goals is to be described in a formal and concise manner.
It is worth noting that the Working Group intends to place them on the list of capabilities to be reconsidered in preparing future versions of XML.
Since XML does not allow exceptions, any exception-using DTD that needs to conform to XML will have to undergo a process to remove its exceptions (possibly among other changes not discussed in this paper).
If your goal is to remove exceptions but to keep the DTD as robust in its expressiveness and validation as before, you'll find that some exceptions are harder to remove than others.
In some cases, you just need to create more complicated content models, but in others, you need to split out some of your existing element types into new context-specific element types so that you can give a different content model to each one.
Removing Convenience Inclusions
Where you have used inclusions as shorthand, removing them involves changing content models to produce the same expressive power in the DTD, which will make the content models more complex.
For example, let's say
your DTD has an inclusion of footnote and sidebar at the
To remove the inclusions
and keep the same expressiveness would require the
These declarations are
difficult to read and therefore to maintain, though the
use of parameter entities could help them to become a bit
If you've used different inclusion mixtures at different levels, or have excluded some of the inclusion elements at lower levels, the complexity will increase, and even parameter entities won't help very much.
You should consider whether the original validation power is worth the complexity. In some cases, you might prefer a simpler content model that does not exactly duplicate the original DTD's functionality. It might be looser or tighter, and might have some intermediate element types added to break up the content models. For example, in the example above, you might choose to allow footnotes only in mixed content (#PCDATA mixtures) and sidebars only as peers to paragraphs, both of which seem like reasonable restrictions.
(Remember that record ends are treated differently around included subelements and proper subelements, so once you've recast the content models to be similar, they're still not perfectly identical in their treatment by SGML parsers.)
Removing Must-Have Exceptions
Our earlier example of
nested footnotes in paragraphs demonstrates why exception
removal sometimes requires changing the element structure
in the document type. Here is the example again:
Because footnote is not
literally mentioned anywhere in its own content model,
the only way to ensure that a para inside a footnote
doesn't contain another footnote is to use an exclusion,
unless you're willing to split out an element type into
multiple types. Here's how you could split out the
Doing this has a strong design impact on your DTD. In DTDs where element types are reused in many different contexts, the requirement to add context-specific element types could cause a combinatorial explosion. Also, such changes would require conversion of legacy instances and updating of all DTD documentation materials, not to mention all the other components of your SGML-based system that rely on the DTD.
Therefore, it is
probably a better strategy in these cases to redesign the
exception use into a non-exception-using version that has
looser content model constraints. The easiest way to do
this would be to retain the original content model and
simply delete the exclusion:
In order to check your documents for previously illegal combinations, you would have to develop application software that does the checking outside the SGML parsing process.
Extended Exception Removal Example
Let's look at an
extended exception-using DTD scenario, and what would be
involved in removing the exceptions. Following is a
portion of the original exception-using DTD, with
Now let's work through each of the ELEMENT declarations:
1.You can allow ndx in
and around all the subelements of doc easily, because doc
has a simple content model:
2.Here's what might be
the first logical attempt to remove
exceptions from front; it is ambiguous because of the
interaction of ndx and the optional elements:
The second attempt fixes
the problem, but is more complicated and less obvious.
3.The title element
isn't supposed to contain ndx, so it's unnecessary to
mention ndx at all in the non-exception-using version of
4.The emph element isn't
supposed to contain ndx when it's in the title element,
but it might be allowed to contain ndx in other contexts.
You would need to declare one new element type per
exclusion context to reflect this. This
involves both inventing a new element and mentioning it
in the relevant parent's content model. Also, don't
forget that you would have to convert your document
instances and update your DTD documentation.
Because of the
emph-in-title invention, don't forget that you now have
to use it in your declaration for title:
The considerations for removing or keeping exceptions can be complex. Let's take the issues one by one.
Q: Regardless of XML concerns, given what I now know about exceptions, should I get rid of the exceptions in my DTD?
If you feel that the exceptions in your DTD are used in problematic ways, you may very well want to remove these uses. However, if you like the creation and validation results you're getting, there's no reason to remove them unilaterally.
Q: If I'm happy with my exceptions otherwise, should I remove them in order to help make my full-SGML DTD XML-compliant?
To determine your answer, consider your reasons for using XML:
1.Is the purpose of using XML the general delivery of XML data on the Web or elsewhere? If so, then you may be able to continue using your full-SGML DTD as is, because your strategy is highly likely to involve writing out XML-compliant instances that don't rely on a DTD, and it won't matter that the DTD you validated against has exceptions in it. You will need to track XML and Web developments in order to determine whether distributing merely well-formed documents without DTDs is the right strategy for you.
(Note that there are a few other important constructs besides exceptions that you may want to consider removing from your DTD, even if you're just delivering DTDless XML instances.)
2.Is the purpose of using XML the interchange of editable source files with others who can handle full SGML? If so, then you can continue to use your full-SGML DTD as is, because if you found exceptions useful, your interchange partners will too.
(Again, note that there are a few other important constructs besides exceptions that you may want to consider removing from your DTD, even if you're just delivering DTDless XML instances.)
3.Is the purpose of using XML the interchange of editable source files with others who can handle only XML and not full SGML? If so, then hypothetically you'll want to create an XML equivalent (to the extent possible) for your full-SGML DTD, because your interchange partners' creation and validation tools will not be able to handle a full-SGML DTD. (This is hypothetical because there are not yet high-quality XML-only validating processors or other components of an XML-only production system.)
Note that the answers to these questions are likely to change over time, as more XML tools become available and as the relationship between XML and full SGML becomes clearer.
Q: I've decided to remove the exceptions from my DTD. Should I attempt to preserve the validation power of the original DTD in the modified version?
To determine your answer, consider the importance of handling highly constrained output in your system:
1.If your processes would feel relatively little impact with fewer constraints in place, then you can choose relatively simple methods of exception removal that result in a somewhat similar (but not identical) DTD.
In practical terms, if your DTD uses only exclusions, you might choose simply to remove them. If your DTD uses inclusions, you might restrict the occurrence of the included elements as part of choosing how to migrate them into proper content model locations.
2.If your processes require highly constrained input, you might choose to migrate some inclusions and exclusions over to proper content models, as discussed above. Where this solution is unsatisfactory, you will need to implement proprietary extra-DTD validation processes. These might take the form of programs written in generic tools such as perl, or in SGML-specific parsing/transformation languages.
Note that the issue of non-DTD schema languages is receiving a great deal of attention in the XML community, and eventually it may be possible to use an open, nonproprietary means of checking validation constraints that couldn't have been expressed easily in DTD form.
Q: Is there an automatic or programmatic way to remove exceptions from myDTD?
Several scholars are researching the feasibility of automatically converting an exception-using DTD into an exception-free DTD. Some of the simple cases can probably be handled already, but a production-quality converter (for example, one that retains your existing parameter entity structure and handles complex cases) does not yet exist, and may not exist for some time. For now, you will probably need to do such conversions by hand; you may want to use a DTD analysis tool (either commercial or freely available) to help you determine whether the resulting DTD has strayed too far from the original.