CML: Syntax and Semantics

An important property of SGML is that it clearly separates the syntax of information from the semantics. Since these may not be concepts you've come across, here's a simple introduction, and I hope that computer scientists and SGML experts will forgive imprecisions.

Some analogies might help. Syntax could represent the nuts and bolts that a car is built from, whilst semantics describes what you want to use it for. Syntax is the frequencies that TV or radio use, whilst semantics is the programming.

Syntax

Most legacy systems (existing chemical software) have a sizeable amount of code dealing with problems like: and, later: These are all syntactic problems that are not related to molecules (or any other discipline) and SGML allows you to tackle them independently of what information means and should be used for .

I have configured CML so that you don't need to worry about some of these problems, and others (e.g. how control characters are passed) have well-established mechanisms in SGML. SGML allows you to determine the abstract structure of a document, for example: "first we have a FOO, which must contain one or more XYZZYs but no BARs. Then there is either a WIDGET or a PLUGH (but not both) and then no more than one FOO". Note that what a FOO is doesn't matter at a syntactic level.

Essentially this is all that SGML gives you - a validated abstract document structure, with validated character data. For example:
"<PAIN QUANT=10>Danger!</PAIN>
could be part of a valid SGML document, but without defined semantics, what it means could depend upon the language spoken in the country :-).

The syntactic components of CML are:

Important note. SGML allows for considerable 'minimisation' (i.e. components which can be unambiguously inferred by a parser can be omitted - either to save space of make the document more readable.) For example in HTML you can write:
<UL>
<LI>Item one
<LI>Item two
</UL>
There is no need for a closing </LI> on each line because it's possible to work out when the <LI> should be terminated. Note, however, that you need to know how to write an SGML parser to do this, and it's non-trivial. One way round this is for the software to read the output of the parser (the *.esis file); if necessary the later parsing software from J.Clark (SP, SPAM and nsgmls) can be used to produced normalised CML files. Note that you must have access to the DTDs if you are going to parse or normalise an SGML file.

Semantics

Semantics can be added to a parsed CML document in several ways:

We haven't really started on CML semantics yet - there are many possibilities:

Much of this can be done with simple add-on software to cmlcost.

Back to Index Page

© Peter Murray-Rust
http://www.dl.ac.uk/CBMT/pmr.html