Translating Mathematical Markup for Electronic Journals

Project Manager: Keith E. Shafer, Senior Research Scientist


Abstract

While there is now an international standard for mathematical markup, no systems produce formatted documents from the complete standard. This report describes how mathematical markup is translated at OCLC.


OCLC's Electronic Journals Online (EJO) provides a typeset quality display of journal articles via the Guidon document viewer. Guidon formats files coded in the TeX typesetting language to produce online pages. EJO accepts, however, source documents marked up via the Standard Generalized Markup Language (SGML); thus, source documents must be translated to TeX to produce displayable files.

To facilitate this translation to TeX, we added translation capabilities to the SGML Grammar Builder interpreter, Fred. (See "Manipulating Tagged Text" for an overview of Fred's translation processes.) The goal of this project was to use Fred to translate the set of tagged structures that comprise the international standard for SGML mathematical markup (found in ISO 12083) to TeX for use in EJO.

Many of the 12083 structures have direct mappings to TeX control sequences. For instance, the tag "bold" maps directly to the TeX sequence "\bf" and the entity "Dgr" maps directly to "\Delta." Some ISO 12083 mathematical tags do not have direct mappings to TeX control sequences. These tags require either a choice of which TeX control sequence to use or the insertion of additional control sequences to produce correct formatting. Many of these situations can be resolved by looking at the context of a given tag. Three common contextual possibilities are often used in the mathematical translation: ancestor, descendant, and sibling.

Ancestor

Text justification is a good example of the use of ancestor information. The justification of a fraction in the ISO 12083 mathematical standard can be specified in the fraction start tag as an attribute. In TeX, horizontal fill is generally used to manually justify text by placing space before or after the element to be justified. To translate fraction into TeX, horizontal fill must be specified in the numerator or denominator substructures. The translation program must look "up" at the enclosing fraction structure for the value of its alignment attribute to know where to properly insert the fill for the substructures. In some instances, the program may need to look "up" into the enclosing mathematical formula to discover the proper alignment.

Descendant

Similarly, translation of the radical structure uses descendant information. TeX has two control sequences for radicals: one generates a simple square root and the other generates a general root with an explicit radix. The translation program determines which control sequence to use by checking the number of immediate substructures of the radical structure. If there is one substructure, indicating that there is no radix, the simple square root control sequence is selected. If there are two substructures, the general root sequence is selected.

Sibling

The generation of TeX array cell separators requires sibling knowledge. In the ISO 12083 mathematical standard, every array cell is marked with a start tag, and usually, the cell is completely delimited via an end tag. TeX, on the other hand, marks only the separation of cells. This means that the translation program must be able to determine whether a cell is last in a list of cells. If it is, the translation program does not generate a separator.

Translation in the previous examples involved simple text substitutions or text insertions. Some translations are more complex in that they require the placement of text in locations other than those where the tags occur. An example is the placement of superscripts and subscripts before an element. The ISO 12083 mathematical standard uses a syntax that specifies all of the superscripts and subscripts for an element after the element. For example, an N with a leading superscript i and a trailing superscript j is encoded as:

<subform> N </subform> <sup loc=pre> i </sup> j </sup>

The value assignment of pre to the attribute loc specifies that the superscript i occurs before the subform N. TeX encodes this whole structure as

`$^iN^j$',

so the ^i that corresponds to <sup loc=pre>> i </sup>> must be moved in front of the target subform N when the text is translated.

We were able to handle most of the ISO 12083 mathematical standard with the current translation capabilities of Fred. However, some of the standard required the use of post processing. In particular, ISO 12083 allows the specification of arrays as a sequence of columns as well as a sequence of rows. TeX only specifies them as rows. We currently pass through any array specified in column order to a process that transforms the column-order specification into a corresponding row-order specification.

Another problematic SGML structure is overlapping underlines and overlines. In SGML these are specified by reference mark tags that have an id attribute. These reference tags can be used by the underline and overline structures to determine where to start or finish. No corresponding TeX structure directly encodes to this.

We are continuing to expand Fred's translation capabilities based on suggestions from our translation users and the EJO development team. Plans are currently underway to make Fred translation services available outside OCLC in the near future.

Project Staff: Roger Thompson, Senior Systems Analyst