Improving the Quality of SGML Documents

Bernd Nordhausen, Passage Systems


Usually OmniMark programmers use context translation for converting non-SGML source into SGML. In this paper, we show how to exploit the power of context translation by transforming SGML documents into SGML of higher quality.

This is useful, as it is far easier to achieve an non-SGML to SGML translation in a series of steps rather than all at once. By using a series of translation programs, the programmer can not only divide the task into smaller ones, but also use a variety tools that might be better suited for the individual task. For example, the first step might convert the non-SGML source into SGML purely based on the structure available in the source document. A second step can then focus on conversion based on content. Another step might achieve any graphics conversion necessary by converting graphics file entities.

In this paper we describe a step that improves the quality of an SGML document by enhancing the structure of the document based on content. The process we described accepts as input valid SGML documents, and produces as output valid SGML documents of higher quality. While this specific program is tailored towards the PCIS DTD, the methods are here applicable to any arbitrary document type definition. Throughout this paper we refer to the actual code that can be found in the appendix by line numbers.

Context-Translation And Referents

The methods described in here make use of two unique features of OmniMark, context translation and referents. Context translation is an ``up-translation intermixed with a simultaneous down-translation.'' While context translations have been traditionally used for converting non-SGML text into SGML text, we show this feature to be extremely useful in SGML to SGML transformations.

Referents are buffers that can be output before their final value is known. Referents are very useful in handling ID IDREF relations.

In this paper we show how these two features can be combined to achieve three traditionally hard tasks, turning implicit references into explicit references, content based transformation, and rearrangement of blocks.

Throughout the paper we refer to the actual OmniMark code that is included in the appendix. The basis of the code described in here is a simple normalizer. That is, it accepts SGML input and simply outputs the SGML. The up-translation part simply reads the input and sends it to the SGML domain (the down-translation). The SGML domain outputs the SGML using an implied element rule. To make the input more readable appropriate indentation is added. Additional code is then added to achieve the three tasks.

Turning Implicit References Into Explicit Ones

Many authors do not use the referencing facilities of word processor. In the pre-SGML days this would often go unnoticed as they would insert the "correct'' visible markup for footnotes and other references.

In this case, the converted SGML document included a number of <note.in.text> elements that have a label but no reference pointing to the <note.in.text>. Rather the text simply contains a ``visual'' reference to that < note.in.text>.

The goal here is to convert the following block containing an implicit reference:

text  &dagger; more text
<note.in.text label=&dagger;>

into the following block containing an explicit reference:

text <xref idref="x1234">&dagger;</xref> more text
<note.in.text id="x1234" label=&dagger;>

To make matters even more complicated, the scope of the implicit references is limited to a page. That is, a ``dagger'' on page 3 points to the note.in.text with a label="&dagger;" on page 3, while the same "&dagger;" on page 5 points to the note.in.text on page 5. The idea is to ``catch'' an entity and surround it by a cross reference to the appropriate note.in.text and using the id attribute of the element.

  1. The pagebreak information itself was preserved as a processing instruction, and the pageNumber counter is increased every time a pagebreak processing instruction is encountered (line: 384-386).

  2. In the SGML domain, using a translate rule, a dagger entity is surrounded by an xref element with an idref to a referent. The referent is uniquely identified by a pagenumber counter and the name of the entity. (line: 367-372)

  3. When encountering in the SGML domain a note.in.text which has a dagger label, we set the referent to the id of that element. The referent is known by the pagenumber and the label with which it is referenced, %d(PageNumber)dagger, this ensures the uniqueness of the referent for each page. (line: 333-347)

Content Based Transformation

Content based transformation is the transformation of element types based on the content of elements. This transformation is especially useful to convert from structural to content oriented markup. For example, the PCIS DTD contains both structure oriented element types such as division and para and content oriented element types such as general.desc and features.summary. The latter ones are generally preferred as they describe the content of the element type better that simple structure oriented tags.

However, many legacy documents do not have special tags for content oriented element types. That is legacy documents contain styles indicating the start of a new division header, while the text contains keywords that describe the content tag. For example, the following is a block that is a typical output of conversion based purely on markup.

    <title>General Description</title>

The goal is to transform this fragment into the following which contains content oriented markup:

<general.desc><title>General Description</title>

The following procedure shows how this can be achieved using a context translation in OmniMark. The goal is to change the start element type in the find domain based on pattern consisting of the element type and the text, and let the parser output the appropriate end tag. We achieve this by outputting a shorttag </> to the parser domain, which then correctly outputs the appropriate end tag. Since, most standard SGML declarations do not allow the SHORTTAG feature, we dynamically enable the feature in the find domain before sending it to the SGML parser.

  1. Set the SHORTTAG feature in the SGML declaration to yes dynamically. (line: 150-151)

  2. Whenever an end tag is encountered in the find domain, send a shorttag </> to the parser stream. (line: 230-231)

  3. Search for the pattern that indicates the transformation of the element type, and change the element type in the find domain, and send the new element types to the parser. (line: 220-221)

  4. Let the parser domain output the correct end tag automatically.

Rearrangement Of Blocks

Much thought has gone into the order of elements for standard DTDs. Unfortunately, the order in which the DTD requires the text and elements does not always correspond with legacy documents or even standard writing style guides. The most common example is figure caption. While most writing styles prefer the caption to appear after the actual figure, most common document type definition including Docbook and PCIS specify the caption of a figure to appear before the actual figure.

Rather than trying to rearrange the blocks during the first step, it is easier to achieve this transformation once the document is already in SGML, because we can use structural information from the SGML.

In this example, we show how to transform from

    <avo avo.ptr="fig1"></avo>
    <caption>Figure caption</caption>


    <caption>Figure caption</caption>
    <avo avo.ptr="fig1"></avo>

This time we exploit the ability to capture text in the find domain to a referent, and output the referent in the SGML domain.

  1. Based on a pattern in the find domain, any caption gets saved into a unique referent. (line: 207-210)

  2. In the SGML domain in the fig element rule, we output the referent before the avo element. (line: 275-294)

However, there might be cases when a) a figure has no caption, and b) a caption appears outside an avo element.

We avoid the first case by making the caption referent not only unique to the caption but to the caption within a figure. That is, there are two counters fig.number and caption.number which get incremented when a fig and caption element is encountered respectively. The caption referent uses both counters to make the caption referent unique.

To address, the second case we ``remember'' whether a referent for the caption has been output by using a switch. The caption is output in the element domain, only if the referent has not been already output. (line: 297-311)


In this paper we demonstrated how the unique features of OmniMark can be exploited to achieve an SGML to SGML transformation resulting in documents of higher quality. This rests on the philosophy that a conversion from a non-structured source to SGML is better achieved in a number of smaller steps rather than all at once.

We showed how context translation can be combined with the use of referents to achieve such traditional difficult tasks such as transforming implicit references into explicit ones, transformation based on content, and rearrangement of blocks.