CML: Syntax and Semantics

An important property of SGML is that it clearly separates the syntax of information from the semantics. Since these may not be concepts you've come across, here's a simple introduction, and I hope that computer scientists and SGML experts will forgive imprecisions.

Some analogies might help. Syntax could represent the nuts and bolts that a car is built from, whilst semantics describes what you want to use it for. Syntax is the frequencies that TV or radio use, whilst semantics is the programming.

Syntax

Most legacy systems (existing chemical software) have a sizeable amount of code dealing with problems like:

Will it display on a Mac?
What happens to lines longer than 80 characters?
How do I pass a carriage-return?
Are there any non-printing characters in this whitespace?
Can I send it through a mailer?
How do I read a VAX file?

and, later:

Which part of the program do I pass this data to?
How many more lines are there in this file?
Has this file been transferred without corruption?
How do I read in a triangular matrix?
Can I strip the table headings of this output before I send it to another program?
Does the whitespace in this table represent a data item?
Am I expecting to read another table before I come to the bibliography?

These are all syntactic problems that are not related to molecules (or any other discipline) and SGML allows you to tackle them independently of what information means and should be used for .

I have configured CML so that you don't need to worry about some of these problems, and others (e.g. how control characters are passed) have well-established mechanisms in SGML. SGML allows you to determine the abstract structure of a document, for example: "first we have a FOO, which must contain one or more XYZZYs but no BARs. Then there is either a WIDGET or a PLUGH (but not both) and then no more than one FOO". Note that what a FOO is doesn't matter at a syntactic level.

Essentially this is all that SGML gives you - a validated abstract document structure, with validated character data. For example:
"<PAIN QUANT=10>Danger!</PAIN>
could be part of a valid SGML document, but without defined semantics, what it means could depend upon the language spoken in the country :-).

The syntactic components of CML are:

The DTDs. These must never be altered without consent as they are used to validate the structure of the CML documents. Changes will almost certainly cause the parsing to abort.
A parser (we use sgmls). The parser takes an SGML document (e.g. 1ins.cml) and checks the syntax against the DTDs. If it's syntactically incorrect the parser gives a (partially intelligible) error message and aborts. If it's OK, a transformed document (1ins.esis) is output. This is not only easier for a postprocessor to read, but the default values have been added. (The document has also been normalised, i.e. closing tags have been added where required.)

Important note. SGML allows for considerable 'minimisation' (i.e. components which can be unambiguously inferred by a parser can be omitted - either to save space of make the document more readable.) For example in HTML you can write:
<UL>
<LI>Item one
<LI>Item two
</UL>
There is no need for a closing </LI> on each line because it's possible to work out when the <LI> should be terminated. Note, however, that you need to know how to write an SGML parser to do this, and it's non-trivial. One way round this is for the software to read the output of the parser (the *.esis file); if necessary the later parsing software from J.Clark (SP, SPAM and nsgmls) can be used to produced normalised CML files. Note that you must have access to the DTDs if you are going to parse or normalise an SGML file.

Semantics

Semantics can be added to a parsed CML document in several ways:

Humans reading the document (preferably with a browser like cmlcost).
Linking the terms in the document to a glossary, so that further meaning is added.
Inputting the document into a program which has been written or adapted to take CML input and to 'know' the meanings of the terms.
Transforming the document into some other semantically-rich format.

We haven't really started on CML semantics yet - there are many possibilities:

running cmlcost with links to a glossary of terms.
retrieval of glossary items which can be used for semantic validation ("Is this a reasonable value for this item?", "Are the values of FOO and BAR in this file comaptible?").
transformation of values ("Our journal insists on SI units, so let's transform all Angstrom units".)
document architecture. ("Do all references point to items in the bibliography?" "Are any in journals we publish?").

Much of this can be done with simple add-on software to cmlcost.

Back to Index Page