Converting ACM Authors' Articles to SGML

Project Manager: Bradley C. Watson, Research Scientist


The Association for Computing Machinery (ACM) has contracted with OCLC to provide an end-to-end electronic publishing system for their journal publications. This report focuses on the article conversion component, which converts accepted articles from the original format used by the author to Standard Generalized Markup Language (SGML), based on ACM's SGML Document Type Definition (DTD). The conversion component can work with documents in several word-processing and text-editing formats, including WordPerfect, MS Word, Framemaker, and LaTex, that originate on a variety of platforms, including DOS/Windows, Macintosh, OS/2, and Unix. The key transformations in the process are: (1) from the native format to Rich Text Format (RTF), and (2) from RTF to SGML. Exoterica Corporation's OMNIMARK programming language is used for the second step, while each word processor creates the RTF files.

The Association for Computing Machinery (ACM) has contracted with OCLC to provide an end-to-end electronic publishing system for its journal publications based on Standard Generalized Markup Language (SGML). The OCLC System of Total Electronic Publishing Services™ (STEPS) serves as the basis for the ACM publishing system. OCLC STEPS is composed of five components:

The Office of Research (OR) role in the ACM project in particular and the STEPS project in general is to produce the document capture and conversion engine.

The Office of Research was chosen to support this aspect of OCLC STEPS because of its ongoing research in the area of text markup, specifically SGML, and its consequent expertise in the general arena of text manipulation and conversion techniques. More specifically, it has acquired knowledge and experience in the use of Exoterica Corporation's OMNIMARK language, a computer language specifically designed for producing programs that translate text from one format to another, especially, though not necessarily, to and from SGML. ACM specified use of OMNIMARK where applicable in the conversion processes.

Fig. 1 STEPS Components

Conversion: The Problem

The process of automatically converting documents produced under a given set of markup rules to another set of such rules is not yet a sufficiently understood domain. However, the primary elements are known. Such a process must be based, either explicitly or implicitly, on formal grammar theory at both the syntax and semantic levels because any set of markup rules is actually a grammar of a language.

At the syntax level, conversions between markup languages must deal with what are admittedly mundane matters of differences of vocabulary and formation rules. The mapping movement is usually one of two kinds. In the first instance, a presentation level set of rules and associated markup tags, that is, rules and tags that specify how the document looks on paper or on screen, are mapped to a logical structure level of rules and associated markup tags, that is, rules and tags that specify how parts of the document relate to each other logically. In the second instance, the reverse, mapping from logical structure rules to presentation rules is necessary. Automatic movement from logical markup to presentation markup is relatively straightforward in theory and practice, but the reverse mapping movement is not. In fact, the possibility of defining an algorithm addressing all possible cases is highly problematical.

For the OCLC STEPS and ACM projects, both conversion directions must be accomplished: logical to presentation to support exporting SGML documents from the database to be printed, and presentation to logical to support importing of documents into the SGML database. Our responsibility is for the latter.

Pretagging Eases Conversion

Fortunately, while such conversions are difficult, perhaps intractable in the abstract, specific sets of data, which are marked up in a known, controlled manner provide a much more reasonable source for accomplishing such conversions. This is true because in specific cases further control can then be exerted to constrain the possible transformations to a particular set. Or, these known transformations can be precoded into the document by the author, using markers or tags at specific points to indicate that a particular logical structure begins or ends at the tagged point. The ACM project uses this solution; the authors are required to precode accepted articles before final submission.

Because the relationship between a sign and its signified is arbitrary, as the French linguist, Sassure, first noted in the early part of this century, the tags to mark specific logical structures are determined by the conversion system designer. ACM decided that tags would be mnemonic. For instance, the title tags are "Title:" and "END: Title." Descriptive tags were chosen because SGML tags are often hard to decipher. For instance, while the title tags (<title> and </title>) in the DTD are easily recognized, tags for other structures such as inline equations (<f> and </f>) are cryptic.

The STEPS document conversion system, as customized for ACM, defines 26 tags to indicate the major logical structures of articles. While many more structures in a typical ACM journal article must be tagged via SGML, the list was trimmed to avoid burdening the authors with tagging each element. Therefore the design goal of the system is to convert each article so that it complies with a subset of the DTD that includes only those 26 SGML elements. All other elements are tagged by ACM conversion specialists using the SGML editing environment component.

Authors tag the 26 different logical structures using a set of macros, one per structure, that have been designed and developed for each word processor/operating environment. Generally, the author selects the text to be tagged and then chooses the tag from a menu of macros. In LaTex, the author directly keys the macro into the text, since LaTex is an ASCII text-based system that depends on manual insertion of LaTex commands.

Converting to SGML

Once a given article is tagged by the author, it is converted from its native word processing format to Rich Text Format (RTF), a generic format that allows for export and import across word processing platforms. In most cases, this conversion is done by the native word processor. The exception is LaTex documents, since they are not actually created in a word processor, but are manually tagged by the authors in text editors. LaTex documents are converted to RTF by a program written in OMNIMARK.

Besides LaTex, the system supports WordPerfect, MS Word, and Framemaker. The supported operating environments are DOS/Windows, Macintosh, OS/2, and Unix. Not all text processors are supported in all operating environments (table 1). While other text processors can produce RTF format, ACM chose to limit support to these three.

Table 1 Supported Text Processors and Environments

              MS DOS   Windows   Macintosh     UNIX    OS/2

WordPerfect   YES       YES        YES          YES     YES
MS Word       YES       YES        YES          NO      YES
FrameMaker 4  NO        YES        YES          YES     YES
LaTex         YES       YES        YES          YES     YES
ASCII         YES       YES        YES          YES     YES

Because all text processors do not generate RTF alike, a tailored RTF to SGML conversion is needed to handle each processor's idiosyncrasies. In theory, since RTF is a generic transition markup, there should not be such differences, but there are. Another reason for limiting the number of supported text processor/operating environments is that for each one, a set of macros for generating tags must be developed and then maintained. Thus, ACM chose to limit its exposure to the escalating costs and complexity associated with additional environments.

Once an article is in RTF format, it is processed using two OMNIMARK programs. The first program clears up the idiosyncratic aspects of the RTF file introduced by the original text processing environment. The second program converts the generic RTF to SGML. The document is then ready for insertion into the SGML database. However, since the authors only mark 26 different elements, it is not fully tagged. Manual editing finishes the tagging and corrects errors introduced during earlier processing, whether by the author's tagging or by the OMNIMARK programs' interpretation of those tags.

Future Work

The conversion of ACM journal articles to SGML from known presentation-level formats which include author-inserted tags for the major logical structures, such as title, author, sections, paragraphs, etc., is relatively simple using existing computer-based capabilities, such as RTF and OMNIMARK. The presence of the author-inserted tags alone guarantees a high degree of success in the conversion process. However, in the majority of existing repositories, the documents are not consciously or consistently pretagged in any way that is directly related to any desirable logical structure.

The goal of future Office of Research efforts in conversion applications will be to go to the next level: to address the larger problem of converting generically tagged presentation documents to some minimal form of logical structure with SGML. OCLC's commitment to the use of SGML-compliant documents within its information delivery systems, such as Electronic Journals Online, to conform to the standards emerging for documents accessible over the Internet and the World Wide Web, makes such research a necessity.

Project Staff: Paul Corrigan, Manager, OCLC-ACM Project; Thomas Dehn, Consulting Systems Analyst, Research & Special Projects; Mary Faure, Technical Writer/Editor, Documentation; Kevin Flash, Manager, Electronic Publishing Solutions Section; Thomas Fought, Senior Systems Analyst, Research & Special Projects; Gayne Gunderson, Certified Quality Analyst, Quality Assurance; Christine Mrockowski, Applications Analyst, IDI; Joanne Murphy, User Documentation Specialist, Documentation