[Mirrored from: http://mish161.cern.ch/sc4wg6/math/pike.htm]

SGML and the Semantic Representation of Mathematics

Roy Pike, Clerk Maxwell Professor of Theoretical Physics, King's College, Strand, London, U.K.
Stephen Buswell, Stephen Healey, Martin Pike, Stilo Technology Ltd., Empire House, Mount Stuart Square, Cardiff, UK.
11. April. 1996

Abstract

Currently, most mathematics DTDs in widespread use are presentation-based, that is the markup relates to the layout of the mathematics on the page or screen rather than to the mathematical content. Such an approach makes the interchange between different SGML applications, and between SGML applications and computational applications, very difficult. This paper proposes a semantics-based DTD for mathematics, and describes a mechanism for selection of the particular branch of maths in use and extension of the DTD to cover areas of maths not as yet covered. Issues related to presentation, and the implications for applications, are discussed. Examples of possible mappings between the DTD and notations used by a typical computational program are given.

SGML and the Semantic Representation of Mathematics.

Since the publication of the ISO 8879 SGML standard in 1986 there has been interest in its possible use for describing mathematical formulae and several DTD's have been put forward for this purpose. For example, ISO 9573 (adopted by CALS) [1] , the AAP DTD [2] , a DTD developed by the Euromath consortium [3] and one by Elsevier Science BV [4] . None of these so far, however, have grasped the nettle of providing an SGML document template free from formatting instructions which reflects the true semantics (meaning) of the mathematical content. In Annex A of ISO 8879 it is stated that

"Markup should describe a document's structure and other attributes
rather than specify processing to be performed on it, as descriptive markup
need be done only once and will suffice for all future processing".

The potential benefits for mathematics of adhering strictly to this philosophy would be no less than those proclaimed for SGML in its other uses and, in fact, it is difficult to see any reason to use SGML for mathematics if it written in a presentation-based rather than semantically based form. The benefits of divorcing the author from the typesetting alone would be considerable for both the author and the publisher. Indeed the newer forms of the TEX [5] language, beginning with LATEX [6] and now, in particular, REVTEX [7] have moved a long way in this direction.

We can almost see SGML in there trying to get out!

The "Mathematical Expressions" described below will be semantically and syntactically complete and may, therefore, be machine converted into input for mathematical manipulation systems such as Reduce, Maple, MathCad, Matlab and Mathematica for evaluation, plotting, manipulation, etc. which, in view of the ever increasing use of such applications, would be of great interest to the community. They may in principle also be used in programming codes such as C, PASCAL or FORTRAN; such marked-up Math may also be converted for any typesetting system or to TEX. As a proof of concept, we have developed a prototype TeX converter which generates the TeX for simple formulae from the parsed SGML. Conversion to formats like Mathematica Full Form will be possible using the same techniques. A prototype converter for a Unix typesetting system is also in progress. We will show in Appendix A an example of a formula, marked up in SGML, converted to its TEX equivalent and also plotted out graphically using the "Mathematica" system for doing mathematics by computer [8] .

Debate, in fact, has raged long and hard about even the possibility of accomplishing this goal. In this paper we consider previous objections and show how they can be overcome. We put forward a DTD fragment which we feel meets the objective of providing a semantic representation of mathematics and which may also be adaptable for use in other scientific disciplines where similar problems arise. There is still a great deal to be done before this work could be incorporated into ISO 12083 and coexistence with other DTD's will be required for some time. However, we are optimistic that there are no fundamental problems between our present proposals and an acceptable long-term solution for the semantic markup of mathematics in SGML.

If this is the case, what then has delayed such a development for so long? We can find the main arguments in a paper by Poppelier et al [9] . Let us first note that in this paper the authors suggest that the proliferation of DTDs for the same class of documents is a weakness of SGML as it is currently being used. If a semantically based DTD is to be developed for mathematics it will be crucial that the community rallies around a single version, since these problems will be multiplied many fold. Close coordination with developments of HTML in this direction must also be achieved.

One of the first statements in ref [9] that needs consideration is that mathematics is so vast that it may be impossible to design a single DTD that covers every kind of formula. The same point has been made by Soiffer [10] who says

"Computer algebra systems have hundreds to thousands of built-in functions.
Listing them all and supporting them in a renderer would be quite a chore and the list would change with each new version."

It is certainly true that the designer of a DTD for mathematics faces a fearsome champ de bataille and clearly needs a stout heart. The very fact, however, that the field has already been well trodden by the pioneers of computer algebra systems who, as we speak, are putting forward new releases of their products designed for use as typesetting systems (and who, unless the SGML community acts sufficiently quickly, will pre-empt the market with their own de facto standards), should give us the necessary encouragement.

Are we to be worried by the scale of the project? When one considers the latest version of Microsoft's word-processing package, for example, which sits in some 35 MBytes of one's disc space, we think not. Even several thousand of Neil Soiffer's built-in functions are a very small proportion of the number of words in the thesaurus alone of the word-processor.

Nevertheless, to help break the problem down to size we have divided mathematics up into sections which may be switched in and out of the DTD by using SGML's marked-section facility. For example, arithmetic is the first section and is always switched in. Most elementary mathematics will be covered by the functions provided in this section. Specialists working in different fields will be able to switch in other sections as appropriate. We have made a start by defining the following subfields as separate marked sections:

In no way, of course, do we intend this breakdown to be complete or definitive and we envisage committees of experts being recruited to complete this entity set as comprehensively as possible for release 1 of the DTD. In each sub-field it is intended that all the functions normally required will be available, although we describe an escape mechanism for new or missing functions below. The section %Other above is meant to indicate this further work to be done. The amount of effort which will be required to complete these lists will not be insignificant but should, in fact, be quite manageable. The eight-character limit for names is not convenient for mathematics and, since it will most probably be changed to 16 or 32 in the next upgrade of SGML, it has not been observed in this DTD.This will allow enough freedom to avoid duplication of element names between fields. We have also assumed that typing full math tags will not be necessary with any application sophisticated enough to tackle mathematics so brevity has given way to clarity for the most part.

As an example, in Appendix C we have marked up a portion of a recent article [11] from Paul Ginsparg's Los Alamos/SISSA bulletin board in the field of many-body quantum mechanics. We show the document fragment itself in Appendix D. This required only some of the following declared Quantum Mechanics functions, together with some functions from the algebra, trigonometry, functional analysis and Sturm-Liouville theory sections:

 ] ]>.

The DTD is targetted at the originator; conversion of legacy hard copy or even of TEX mathematics at publishing houses will not always be feasible except by specialists in the particular branch of mathematics concerned. Mathematics of any sort are basic document constituents and will be contained in a required and repeatable "or" model group:

The key concept in this DTD is that of a "MathExpression" which may be a single #PCDATA "MathElement" (which may be labelled), or a function with MathElement parameters and arguments (operands), or a recursive sequence of such functions:

Thus each function may have other functions as its parameters and arguments but finally, at the lowest level, all functions, whether parameters or arguments themselves, will have parameters and arguments which will be MathElements.

Each MathExpression is a single quantity which in principle is calculable. Adornments to functions, eg. integral, sum and product signs, would be added by a formatting application but are irrelevent to the SGML document. Syntactical information specifying the numbers of parameters and arguments of each function, together with their agreed relative order of entry and meaning, are #IMPLIED CDATA "Syntax" attributes which may be consulted via an attribute window, a help file or an instruction manual by the originator or by others for reference. The model group for functions makes no distinction between parameters and arguments; this lies only in their prescribed order of entry. This approach results in what we will call an "SGML form", which contains only semantic information, for any MathExpression. Entry and formatting in any other form will be handled by the application and translated to the internal SGML form.

For example, logbasen will have a single mandatory parameter (the base) and a single argument (which could be another function) and it can be agreed (for ISO 12083, say,) that the parameter is specified as the first element in the model group and the argument second and this will be stated for reference in its Syntax attribute. On the other hand, a quantum-mechanical commutator will have no parameters but two ordered arguments (either or both of which may again be functions).

Another example is a one-dimensional definite integral which has three parameters: the lower and upper limits and the integration variable, and a single argument as the integrand. In the SGML form these must be entered, say, in the above order. In this case the limit parameters as well as the argument may again be functions (the use of an equation for a limit will be a formatting choice). A final example is a fraction, which has two arguments, the numerator first, say, and the denominator second, each of which, of course, can again be functions; in this case (as in all others) an #IMPLIED "Layout" attribute may be overridden to specify the style of the fraction when printed, although this should not be necessary in a sufficiently intelligent application and could not be used in any case without defining layout options for each function to be rendered. In general, the author should not be concerned with such decisions.

The left- and right-hand sides of an equation are each MathExpressions and are separated by a verb. Verbs have been separated into an arithmetic set and a set used in functional analysis which may be switched in if required. Provision is made for continued equations eg x = y = z... , and equation arrays.

In addition to the problem of vastness, a second perceived major problem has been that of new notations. Soiffer commented further in ref [10]

"More significantly, users can define their own new
functions, so the list can never be complete."

This worry is also expressed in ref [9] where the authors state

"Another problem is that mathematics is by its nature extensible... Notations are changed or new notations are invented almost every day..."

We feel that this problem is not as serious as it might seem. What does one mean by a new function and why does that pose a problem? It is true that mathematicians are always defining new functions, for example, one might wish to define f(x) = x + 1, or f(x,y) = x + y. This type of new function is not really new and gives us no difficulty, even if the author wishes to define fred(x) = x+1. All these cases are covered by our arithmetic function where the first element is a #PCDATA function name and the following elements define the independent variables. If, instead of fred, the author wished his function to have some weird symbol, we still can allow him to use a character entity definition from some public set, or in the worst case, to use ISO 9573 to design his own glyph. The problem arises when the author wishes to define his new function in an unknown new format like writing fred backwards and upside down over the top of the variable. To cope with this type of new notation we have an element with a #REQUIRED CDATA Attribute which explains exactly what the new notation demands. Of course, if this facility is used it must be interpreted by hand wherever the document lands up and the author will not be very popular. If he must, however, he must and the escape route exists for these cases. We can take it that they will be sufficiently rare not to endanger the whole project.

It seems that there has been a third stumbling block impeding the creation of a semantically based math DTD which is an implicit problem difficult to bring into the light of day. This reaches right into the understanding of mathematical notation. It has gone by the name of "ambiguity". It is noted that, to take a simple example, xbar can denote a mean or a complex conjugate. This is said to pose a problem. Why? Because the originator or the expert in the field have not been considered as players in the SGML mark-up or, if they have, it has been assumed that they themselves must type in xbar and the application will have to interpret its meaning. This, of course, is a real problem for legacy data and we agree with the authors of ref [9] that conversion of such data to semantic form is only possible by the originator or by an expert. However, the originator will certainly know whether he is calling for a mean or a complex conjugate and in SGML form he will be required to choose the appropriate element. The "problem" clearly then ceases to exist. The plot thickens when we look further into the future when screen-edited SGML math applications become available. These will allow input to be made in several different forms. For example, the non-minimised SGML form of the expression a + b is < M>a< M>b. There is no difficulty in principle to translate automatically between these two forms and indeed the equivalent of this is done as a matter of course in the Mathematica package where a + b is always translated into the internal "Full Form" Plus[a,b]. Thus, with such a future package the minimised SGML form x could be entered as xbar and if the same application without warning translates xbar as < CompConj >x something is clearly wrong with the application and not the SGML! It is with this type of reasoning that we argue that those who are concerned with ambiguous notation are like the engineer who worries about how he is going to do something before he has decided what he is going to do.

In summary we have presented a new approach to the use of SGML to mark up mathematics which, in contrast to previous efforts, is strictly semantically based. We have considered why this process has taken so long and have answered the usual counter-arguments. We have pointed out the advantages of this approach which, however, still needs further input and concurrence from the mathematics, SGML and HTML communities. In Appendix B we give the current form of our draft Math DTD fragment.

Acknowledgements

We gratefully acknowledge helpful discussions with and encouragement from Eric van Herwijnen, Nico Poppelier, Klaus Harbo, Chris Rowley, Conrad Wolfram, Neil Soiffer Bob Kelly and a number of other members of the ISO/TC46/SC4/WG6 working group.