TeX; and SGML: A Recipe for Disaster?

Peter Flynn, University College, Cork

Abstract

The relationship between TeX and SGML has often been uneasy, with adherents to one system or the other displaying symptoms remininscent of the religious wars popular between devotees of TeX and of wordprocessors. SGML and TeX can in fact coexist successfully, provided features of one system are not expected of the other. This paper presents one method of achieving such a cohabitation.

Introduction

For many years, SGML and its relationship with TeX has been a frequent topic of presentation and discussion. Those who read the TeXhax digest and the comp.text.tex Usenet newsgroup will be familiar with the sometimes extensive crosspostings to the sgml-l mailing list and the comp.text.sgml newsgroup.

Two extremes are apparent in the misunderstandings: that SGML is some kind of desktop publishing (DTP) system; and that TeX or &LaTeX; are exclusively for structured documentation. Such problems highlight the lack of information about the design of either system, as avalable to the novice, but also reveal the capabilities and limitations of both systems.

Indeed, there is a parlous level of understanding about both TeX and SGML even in the printing and publishing industry, where one would expect a more sophisticated level of understanding: in this author's personal hearing, so-called experts from major publishing houses have criticised TeX's `lack of fonts' and SGML's `lack of font control'.

(It is perhaps worth emphasising the difference at this stage, for the non-expert, in that TeX is a typographic system principally for the creation of beautiful books [Knuth84] (but also other printed documents: it is intended for putting marks on paper) and SGML (Standard Generalised Markup Language, ISO 8879) [Goldfarb90] is the international standard for describing the structure of documents (intended for document storage and control, which could, of course, include typesetting as one of many possibilities).

Publishing: the view from outside

A recent article [Beard93] quotes John Watson, London Editorial Director of Springer-Verlag:

We can use &LaTeX; files, which many of our authors of books or papers with complex maths find convenient, but if they need serious editing, it's so expensive we have to mark up hardcopy and send it back to the author to make the changes. TeX and &LaTeX; are only a stop-gap. SGML hasn't really reached our authors yet. What's really needed is a WYSIWYG system that's as universal as TeX, preferably in the public domain so all our authors and freelances can use it, and easy for subject specialists to edit on screen. And of course the output should be Linotron- as well as Postscript- compatible. [My italics]

This view of the world expresses an attitude common in the publishing field, that editing TeX is difficult, that the nature of TeX is impermanent, and that the only goal of all writing is for it to be printed on paper. While SGML has indeed `not reached our authors yet', that is hardly the fault of SGML, when editing systems for handling SGML are readily available for most platforms.

The speaker's desires are very laudable, however much one may agree or disagree with the implied benefits of WYSIWYG systems, in that the software should be universal, easy to use and in the public domain. The speaker's complaints, however, deserve further analysis.

Editing

The speaker seems here to be confusing two aspects of the technical editorial process: mathematics editing and editing text for production, both of which are indeed matters for the specialist, as those who use TeX in a professional pre-production capacity with publishers as clients have long recognised.

In the confusion, sight has been lost of the fact that editing a file of TeX source code need be no more of a problem than editing any other kind of file, and is probably less of a problem the better structured the text is. If the publisher's authors are unable or unwilling to adhere to the very straightforward guidelines put out by most publishers, it is a little ingenuous to blame TeX for their deficiencies.

There are large numbers of literate and numerate graduates with sometimes extensive TeX experience: if (as seems to be implied) editing may now be entrusted to authors, a publisher has little excuse for not employing some of these graduates on non-specialist editorial work. It is, however, as unnerving to hear publishers so anxious to encourage authors to undertake pre-press editing as it would be to hear them encourage non-mathematicians to undertake mathematical editing: it is precisely because the authors do not normally possess the specialist knowledge to do this that the work is handled by in-house or contract editors. The mechanics of editing a TeX document are not specially difficult, given proficiently-written macros, and there are some crafty editor programs around to assist this task. Training courses in elementary TeX abound, so if a publisher is serious about cutting pre-press costs by using TeX, the way lies open.

The typographic skill resides in implementing the layout: taking the typographer's specifications and turning them into TeX macros to do the job, ideally leaving the author and subsequent editor with as little trouble as possible to get in the way of the creative spirit. This implementation of design is increasingly being left to the author, who may understandably resent having to undertake what is usually seen as a task for the publisher, and who may be ill-equipped to perform this task, especially if a purely visual DTP system is being used.

Impermanence

TeX has been around for nearly 15 years, which is long enough for the mantle of impermanence to be shrugged off: there is no other system which can claim anywhere near that level of stability and robustness. However, the present writer would be among the first to disclaim any pretensions on the part of TeX to being the final solution to a publisher's problems (although properly implemented it has no difficulty in seeing off the competition). It is difficult, however, to see what TeX is supposed to be a stopgap for, because the logical conclusion drawn from the quotation above is that SGML is some kind of printing system, which it is not, although it can be used for that purpose in conjunction with something like TeX.

Printing as a goal

WYSIWYG TeX systems exist for both PCs and Macintosh platforms, if a user feels compelled to see type springing into existence prematurely. There are also editors for SGML which are WYSIWYG, ranging from the simple to the sophisticated. The misconception seems to be that printing on paper is always going to be the goal of the writer and the publisher, but even if we accept this goal as the current requirement, there appears to be no reason why both TeX and SGML cannot be used together to achieve this.

The increasing importance being attached to hypertext systems, especially in academic publishing, is amply evidenced by the presentations at scholarly conferences, for example the recent meeting of the Association for Literary and Linguistic Computing and the Association for Computing and the Humanities [Flynn93]. While paper publication will perhaps always be with us, alternative methods are of increasing importance, and systems such as SGML are acknowledged as providing a suitable vehicle for the transfer and storage of documents requiring multiple presentations [Sperberg90].

Software Development

Before we leave this analysis, it is worth asking if publishers who are seeking an easy-to-use, widely-available, public-domain WYSIWYG structured editor would be prepared to back their demands with funding for the development of such a system. Organisations such as the Free Software Foundation are well-placed to support and coordinate such an effort, and there are ample human resources (and considerable motivation) in the research and academic environment to achieve the target.

Document Type Disasters

The newcomer to SGML is often perplexed by the apparent complexity of even simple Document Type Definitions (DTDs, which specify how a document is structured). Although there are several excellent SGML editors on the market, many users are still editing SGML in a plain file editor with perhaps the use of macro key assignments to speed the use of tags and entity references. Worse, the task of getting the document printed in a typographic form for checking by proofreaders who are unfamiliar with SGML can present a daunting task without adequate software.

While we have said that such software is readily available, there are two inhibiting factors, cost and complexity. Although we are now beginning to see wordprocessor manufacturers take an interest in SGML (WordPerfect, for example), the impecunious researcher or student is still at a disadvantage, as WYSIWYG software for SGML is still expensive for an individual.

The problem of complexity is not easily solved: designing a document at the visual level of typography is already understood to be a specialist task in most cases, and designing a document structure, which is a purely conceptual task, without visual representation, is at a different level of abstraction. However, document structure design is not normally the province of the publisher's author, and should not affect the author's use of a structured-document editor, once the initial concept has been accepted.

Into print

The comp.text.sgml newsgroup repeatedly carries requests from intending users for details of available editing and printing software, which are usually answered rapidly with extensve details. The low level of SGML's public image (the `quiet revolution' [Rubinsky92]) indicates one possible reason why the system is still regarded with misgivings by some people.

There have been several attempts in the past to develop systems which would take an SGML instance and convert its text to a TeX or &LaTeX; file for printing. The earliest appears to have been Daphne, developed in the mid 1980s by the Deutsche Forschungsnetz in Berlin, and the most recent is gf from Gary Houston in New Zealand [Houston93] (available from the Darmstadt ftp server). Several other programs exist, including some written in TeX itself, but the principal stumbling-block seems to be the desire to make the program read and parse the DTD so that the instance can be interpreted and converted accordingly.

A DTD contains information principally about the structure of the documents which conform to it, rather than about its visual appearance. (It is of course perfectly possible to encode details on visual appearance in SGML, but this is more the province of the analyst or historian, who wishes to preserve for posterity the exact visual nature of a document.) The DTD is used to ensure conformance, often by an editor while the document is being written or modified, or by a parser (a program which checks the syntax and conformity of an instance to its DTD). Given the easy availability of various versions of a formal SGML parser ( sgmls, from various ftp archives), there seems to be little point in embedding that process again in a formatter. Indeed, one conversion system reported to this author takes the route of using sgmls output as its input.

Through all these systems, however, runs the thread that somewhere in the SGML being used must reside all the typographical material needed to make the conversion to TeX (or indeed any typographical system) a one-shot process. As has been pointed out, this implies that the author or writer using SGML to create the document must embed all the necessary typographical data in the instance. Yet this is entirely the opposite of the natural use of SGML, which is to describe document structure or content, not its appearance. Predicating typographic matters ties the instance to one particular form of appearance, which may be wholly irrelevant.

Style and content

One of TeX's strongest features is that of the style file, a collection of macros to implement a particular layout or format. In particular, where this uses some form of standardised naming for the macros, as with &LaTeX; or eplain, the portability of the document is greatly enhanced. A single word changed in the documentstyle and the entire document can be re-typeset in an entirely different layout, with (usually) no further intervention.

The convergence of SGML and TeX for the purposes of typesetting brings two main advantages: the use of TeX's highly sophisticated typesetting engine and the formally parsed structure of the SGML instance. In such a union, those elements of the DTD which do have a visual implication would migrate to a macro file, in which specific coding for the visual appearance of the current edition could be inserted, and the SGML instance would migrate to a TeX or &LaTeX; file which would use these macros.

In this way, we would avoid entirely the predication of form within the SGML: it becomes irrelevant for the author to have to be concerned with the typographic minutiæ of how the publication will look in print (although obviously a temporary palliative can be provided in the form of a WYSIWYG editor). We also avoid tying the instance to any one particular layout, thus enabling the republication (or other reuse) in a different form at a later date with the minimum of effort.

The most undemanding form of conversion is thus one where the appearance is completely unreferenced in the SGML encoding. This means that the publisher (or typesetter) has all the hooks on which to hang a typographic implementation, but is not retricted or compelled to use any particular one of them.

A pilot program: sgml2tex

The author's own pilot attempt at this form of conversion can be seen in the SGML2TeX program, available by anonymous ftp from curia.ucc.ie in pub/tex/sgml2tex.zip (this was developed in PCL (a language written explicitly for high speed development on the 80x86 chips): WEB should probably be the basis for a future version).

The program reads an SGML instance character by character, and converts all SGML tags into TeX-like control sequences, by removing the < and > delimiters and prepending `\start' or `\finish' to the tag name. Attributes are similarly treated, within the domain of the enclosing element, and with their value given in curly braces as a TeX macro argument. Entity references are converted to simple TeX control sequences of the same name.

The output from the program is a .tex file and a .sty file. The .tex file contains an `\input' of the .sty file at the start, and also a `\bye' at the end; otherwise it is merely a representation of the instance in a form digestible by TeX or &LaTeX;. The .sty file contains a null definition of every element, attribute and entity encountered in the instance. Thus the fragment

prepending `<tt>&bsol;start</tt>' or
becomes
prepending `\startTT<tt>\bsol start</tt>\finishTT' or
in the .tex file, with the following definitions in the .sty file:
\begin{verbatim}
     \def\startTT{}
     \def\finishTT{}
     \def\bsol{}
\end{verbatim}
All line-ends, multiple spaces and tabs in the instance are condensed to single space characters.

It must be made clear that this pilot is not a parser: it does not read any DTD and has no understanding of the SGML being processed, although a planned rudimentary configuration file will allow a small amount of control over the elimination of specific elements where no conversion is desired. There is also no capability yet for handling any degree of minimisation, so all markup must be complete and orthogonal (as many parsers and editors already have the capability to output such non-minimised SGML code, this should not cause any problems). As the DTD is not involved, the instance being converted must therefore also have passed the parsing stage: it is the user's responsibility to ensure that only validly-parsed instances are processed. Additionally, no attempt has been made to support scientific, mathematical or musical tagging, as this is outside the scope of the pilot.

As it stands, therefore, the output file is a valid TeX file, although trying to process it with null definitions in the .sty file would result in its being treated as a single gigantic paragraph. However, editing the .sty file enables arbitarily complex formatting to be implemented: the present document is a simple example (available at http://curia.ucc.ie/info/TeX/aston/tug93.html).

Conclusion

The pilot program certainly is a stopgap, being severely limited: there are many other related areas where SGML design, editing, display and printing tools are still needed. There is still no portable and widespread public-domain dedicated SGML editor such as would encourage usage (although an SGML-sensitive modification for emacs exists and the interest of WordPerfect has been noted). Although SGML import is becoming available for some high-end DTP systems, migration and conversion tools are still at a formative stage.

One particular gap is highlighted by the need for a program to assist the user in building a DTD, with a graphical interface which would show the structure diagrammatically, so that permitted and prohibited constructs can be analysed, and a valid DTD generated.

SGML has now passed the phase of `new product' and is on its way to greater acceptance, but the real disaster would be for it to become an isolated system, unrelated to other efforts in computing technology. This will only be avoided by the concerted efforts of users and intending users in demanding software which can bridge the gaps.

Bibliography

  1. [Knuth84] Knuth DE, The \TeX book, Addison-Wesley, 1984
  2. [Goldfarb90] Goldfarb C, The SGML Handbook (ISO 8879), OUP, 1990
  3. [Beard93] Beard J, The art and craft of good science in Personal Computer World, June 1993, p.350
  4. [Fyffe69] Fyffe C, Basic Copyfitting, Studio Vista, London, 1969, p.60
  5. [Sperberg90] Sperberg-McQueen CM & Burnard L (eds), Guidelines for the Encoding and Interchange of Machine-Readable Texts, Draft version 1.1, ACH/ACL/ALLC, Chicago & Oxford, 1990
  6. [Rubinsky92] Rubinsky Y, The Quiet Revolution, keynote speech, SGML'92 Conference, Danvers, Mass., October 1992
  7. [Houston93] Houston G, announcement of new version of gf, Article 19930604.065824.Houston in comp.text.sgml
  8. [Flynn93] Flynn P, Summary of ACH/ALLC Meeting, Georgetown DC, June 1993, in http://curia.ucc.ie/tlh/curia/doc/achallc.html