Manipulating Tagged Text

Project Manager: Keith E. Shafer, Senior Research Scientist

Abstract

While the Standard Generalized Markup Language (SGML) is intended to offer freedom from vendor-dependent data, it is difficult to translate arbitrary SGML into multiple output formats. To address this problem, we have incorporated translation capabilities into the SGML Grammar Builder project.

SGML is a meta-language for writing a Document Type Definition (DTD). A DTD describes how a document conforming to it should be marked up: the structural tags that may occur in the document, the ordering of the tags, and a host of other SGML features. A DTD describes a class of tagged documents in a vendor-independent way. This flexibility to describe and create vendor-independent documents via DTDs is one of SGML's greatest benefits.

Vendor independence, however, comes at a price. For example, users of powerful word processing systems expect to be able to edit and/or view formatted documents, not just create ASCII documents with embedded SGML tags. Only with tools to translate tagged text into other formats, or SGML input capabilities in traditional word processors, can SGML really free users. Users avoid the constraints of proprietary formats, but must accept other constraints imposed by SGML tools.

Despite the drawbacks, the use of SGML is increasing because its benefits far outweigh the disadvantages. For instance, we have several different tagged text data sources for reference databases at OCLC. Not all are true SGML sources since, by definition, SGML requires a DTD and some of OCLC's tagged data sources do not strictly adhere to a DTD. Fortunately, the SGML Grammar Builder interpreter does not require an SGML DTD. In fact, it can induce a DTD from sample documents.

The tagged text that OCLC receives must be manipulated into other formats for database loading and interface presentations. For example, Electronic Journals Online (EJO) documents are typeset using TeX, so they must be translated into TeX. Some of the same documents are also made available via the World Wide Web (WWW). Thus, the same source documents must also be translated into the HyperText Markup Language (HTML). Accordingly, translating tagged text into multiple output formats is of primary interest.

A few general translation tools are available, but most force users to map into a predefined DTD or do not offer sufficient options to meet the translation needs at OCLC. Consequently, we have added a translation language to the SGML Grammar Builder interpreter, Fred.

To translate a tagged document using Fred, a user must supply a translation script and an optional entity translation table. The entity translation table provides for textual substitutions based on SGML entities. Basically, SGML entities are like textual variables and pointers. For instance, a document could contain the entity &food; which might be replaced by the word pizza in the translated output.

To perform a translation, Fred first processes the tagged document to extract the tags and build a structural representation of the document. Then Fred traverses the document structure applying the complete translation script against each tag in the document.

The user-supplied translation script uses a language composed of two parts: conditions and actions. There are several types of conditions including checks on tag names, tag attributes and variables, and document structure. Translation conditions can be combined using standard Boolean operators with normal precedence, and can be parenthesized for grouping and readability. Guided by the translation script, Fred scans the document tree structure from the current tag looking for tags that match the desired conditions.

Translation actions include text addition, text removal, text movement, text sorting, and the use of integer- and string-valued variables. As with other programming languages, translation actions can be nested and blocked, and can include sub-blocks which can specify additional conditions and actions.

The SGML Grammar Builder interpreter is currently being used for several translation tasks including EJO mathematics, EJO HTML documents, and the Internet Cataloging project. More detailed reports on some of these projects appear in this issue of the Annual Review of Research.

Project Staff: Roger Thompson, Senior Systems Analyst