**Project Manager:** Keith E.
Shafer, Senior Research Scientist

**Abstract**

While there is now an international standard for mathematical markup, no systems produce formatted documents from the complete standard. This report describes how mathematical markup is translated at OCLC.

OCLC's Electronic Journals Online (EJO) provides a typeset quality display of journal articles via the Guidon document viewer. Guidon formats files coded in the TeX typesetting language to produce online pages. EJO accepts, however, source documents marked up via the Standard Generalized Markup Language (SGML); thus, source documents must be translated to TeX to produce displayable files.

To facilitate this translation to TeX, we added translation capabilities to the SGML Grammar Builder interpreter, Fred. (See "Manipulating Tagged Text" for an overview of Fred's translation processes.) The goal of this project was to use Fred to translate the set of tagged structures that comprise the international standard for SGML mathematical markup (found in ISO 12083) to TeX for use in EJO.

Many of the 12083 structures have direct mappings to TeX control
sequences. For instance, the tag "bold" maps directly to the
TeX sequence "\bf" and the entity "Dgr" maps directly to
"\Delta." Some ISO 12083 mathematical tags do not have direct
mappings to TeX control sequences. These tags require either a
choice of which TeX control sequence to use or the insertion of
additional control sequences to produce correct formatting. Many of
these situations can be resolved by looking at the context of a
given tag. Three common contextual possibilities are often used in
the mathematical translation: *ancestor, descendant,* and *sibling*.

Text justification is a good example of the use of
ancestor information. The justification of a fraction in the
ISO 12083 mathematical standard can be specified in the
*fraction* start tag as an attribute. In TeX, horizontal fill is
generally used to manually justify text by placing space before or
after the element to be justified. To translate *fraction* into TeX,
horizontal fill must be specified in the numerator or denominator
substructures. The translation program must look "up" at the
enclosing fraction structure for the value of its alignment
attribute to know where to properly insert the fill for the
substructures. In some instances, the program may need to look
"up" into the enclosing mathematical formula to discover the proper
alignment.

Similarly, translation of the radical structure uses descendant information. TeX has two control sequences for radicals: one generates a simple square root and the other generates a general root with an explicit radix. The translation program determines which control sequence to use by checking the number of immediate substructures of the radical structure. If there is one substructure, indicating that there is no radix, the simple square root control sequence is selected. If there are two substructures, the general root sequence is selected.

The generation of TeX array cell separators requires
*sibling* knowledge. In the ISO 12083 mathematical standard, every
array cell is marked with a start tag, and usually, the cell is
completely delimited via an end tag. TeX, on the other hand, marks
only the separation of cells. This means that the translation program must be able to determine whether a cell is last in a list of cells. If it is, the translation program does not generate a separator.

Translation in the previous examples involved simple text substitutions or text insertions. Some translations are more complex in that they require the placement of text in
locations other than those where the tags occur. An example is
the placement of superscripts and subscripts before an
element. The ISO 12083 mathematical standard uses a syntax that
specifies all of the superscripts and subscripts for an
element after the element. For example, an N with a leading
superscript i and a trailing superscript *j* is encoded as:

<subform> N </subform> <sup loc=pre> i </sup> j </sup>

The value assignment of *pre* to the attribute loc specifies
that the superscript *i* occurs before the subform N. TeX
encodes this whole structure as

`$^iN^j$',

so the ^i that corresponds to <sup loc=pre>> i </sup>> must be moved in front of the target subform N when the text is translated.

We were able to handle most of the ISO 12083 mathematical standard with the current translation capabilities of Fred. However, some of the standard required the use of post processing. In particular, ISO 12083 allows the specification of arrays as a sequence of columns as well as a sequence of rows. TeX only specifies them as rows. We currently pass through any array specified in column order to a process that transforms the column-order specification into a corresponding row-order specification.

Another problematic SGML structure is overlapping underlines and
overlines. In SGML these are specified by reference mark tags that
have an *id attribute*. These reference tags can be used by the
*underline* and *overline* structures to determine where to start or
finish. No corresponding TeX structure directly encodes to this.

We are continuing to expand Fred's translation capabilities based on suggestions from our translation users and the EJO development team. Plans are currently underway to make Fred translation services available outside OCLC in the near future.

Project Staff: Roger Thompson, Senior Systems Analyst