[Cache from http://www.ch.ic.ac.uk/cml/; please use this canonical URL/source if possible. More current information from http://www.xml-cml.org.]
CML - Chemical Markup Language
[a] Glaxo Research & Development, Stevenage, Herts., UK,
[b] Department of Chemistry, Imperial College, London, SW7 2AY
Abstract 40. Presented as a poster at the 210th ACS Meeting
in Chicago on 21 August, 1995
http://www.ch.ic.ac.uk/cml/
The contents of this poster
'Chemical Markup Language' represents a collaborative approach to tackling
some of the problems of the interchange of chemical information over the
Internet and other networks. This poster is the first public announcement
of this initiative, but a lot of work has already been done and so there
are links to several distributed resources. It's arranged as a hyperdocument,
so be prepared for the different styles! Some of these resources
change daily so as you view this poster in the coming days you will see
constant change 'underneath' this top page.
CML is based on the ideas and success of HTML and the WWW and might be thought
of as 'HTML with some chemistry added'. This is a useful starting point,
and recently (1995/8/20) we got approval to use HTML2.0 as apart of CML.
CML will, however, need its own renderers (browsers) and so we cannot yet
show you this part of CML. (One approach may be to use standard browsers
with helper applications).
There are, however, major differences from HTML in that CML is requires
strict adherence to SGML. (HTML 'ignores' unknown tags and bad syntax;
CML will object.) CML relies heavily on the use of carefully structured
glossaries rather than adding semantics to the language itself. This allows
great flexibility in that the chemical community can add new terms or even
glossaries without having to revise the language or modify the software.
Special interest groups can devise glossaries without having to seek
'permission' from the CML project.
CML has been designed to cater for a wider range of structured datatypes
than HTML
(which deals primarily with hypertext (text, image, movie, audio). CML
allows definition of:
- Numeric and string data in scalar, array, matrix or tabular form.
The glossaries may contain run-time code (e.g. in tcl, perl) for validation
or transformation of such data items.
- Molecular information (with sufficient power to represent
the common molecular formats chosen in chemical MIME). This includes
stoichiometry, connexion tables, crystallography, symmetry, chirality,
atom and bond types, etc. Moreover it can be easily extended through the
glossary mechanism to cater for an indefinite number of ATOM and BOND
attributes. provision is also made for multiple observations of the same
molecule (e.g. conformational analysis, molecular dynamics, NMR ensembles.)
- Scientific units are give special attention with a separate glossary.
This will allow the development of code for automatic conversion.
- Certain specified files (e.g. chemical MIME types) can be included in
CML files. Thus a CML file for a protein could contain PDB and SWISS-PROT
files with accompanying hypertext
- NOTE: at present CML does not explicitly provide for chemical
reactions and protein sequences. However we are optimistic that rapid
progress will be made with them after the release of CML V1.0. We also
have not catered for generic molecular structures (substructures, Markush
types, combichem, etc.) and will invite discussion :-).
CML is still evolving rapidly and we are offering it as a project in which
the community can join. Ideas are welcome, but contributions such as
tools, glossary creation, program interfaces, viewers, etc. are even more
valuable and we'd like to hear such offers. We propose to form an electronic
project to develop CML in a rapid fashion and offers of financial or material
assistance would be very welcome. The project will be open and not aligned
to any particular interest groups.
Some starting points
Introduction:
The last five years or so have seen the introduction of a number of so-called
self-defining structured file descriptors for chemical content, including the CIF
format designed for use in the area of crystallography, and a subsequent generalisation
(MIF), and the recently introduced CXF format from Chemical Abstracts.
These in turn are based on the more general STAR and the ASN.1 symbolic notations.
Recent mechanisms of electronic delivery based on the
World-Wide Web system are however derived from an alternative content
descriptor termed
Standard General Markup Language
(SGML). Specific applications of SGML are derived from a "document type definition" or a "dtd". One such has been
specifically developed for use on the World-Wide Web, and in its current
proposed version is known as HTML 2.0. This provides for a range of
content types, including one tagged as <MATH> which allows for formal
definition of symbolic mathematics. Taking this dtd as our model, we now
propose an extension of this dtd
specifically in the area of theoretical chemistry,
which we propose to call CML or Chemical Markup Language. We envisage this
are the first part of a more ambitious project to eventually develop an
SGML based dtd with a much wider scope for defining chemical content.
In this report, we outline
the first self-consistent working version of CML and discuss some of the benefits
and applications that might flow from this concept.
Discussion and Implementation
A powerful reason for adopting an SGML based content definition is
the very substantial amount of software tools available for its
development, compared for example with the STAR format for which there
is a relative dearth of such tools. In addition, by using the current
definition of HTML 2.0, many generic types are available to us, as
well as programs such as Arena capable of rendering this content on
a computer screen. Our CML extension is available as
SGML experiment Collaborative HyperGlossary.
In order to demonstrate a "proof-of-concept", we have implemented this
dtd into a modified version of
the MOPAC 93 program, which can now both read and write its output
in formally structured CML.
Presently this system can receive MOPAC internal and cartesian co-ordinates
as well as the CML output from a previous calculation.
Calculations in this demonstration program have to be
limited to a small selection of MOPAC keywords and molecules with
6 or less heavy atoms and no more than 20 hydrogen atoms.
Errors are currently handled by MOPAC itself in the form of the original
output file rather than the CML file, on the assumption
that the final archival version will not contain errors relating to
e.g. geometry definition, geometry optimisation, etc.
MIME Considerations.
Presently the Content-type in the HTTP header is set
to chemical/x-cml and the expected filename extension is
.cml. Whilst the ultimate
intention is to produce a CML renderer, currently you should use
your WWW browser to display these files. The browser
will need to be configured
to understand this new mime-type, and should display the CML, tags and
all. A formal CML browser of course would interpret the tags, and display
only the content in appropriate form.
CML Working Example
One of the powerful features of the World-Wide Web
system is that it can be used to deliver real working examples of
a new concept. Here, we have created a WWW form
which allows a small molecule to be defined and submitted to
our modified MOPAC-93 code. The output, appropriately marked up in
CML is returned to your browser (or ultimately to a CML browser when
available). To use this system, enter
a small molecule in the text area below. This can be achieved by:
- Creating a molecule file in a text editor
- highlighting this file in the editor
- copy the selection to the clipboard
- pasting the information from the clipboard to the text area
- Alternatively, just type a small molecule into this area. An
example might be:
PM3
Hydrogen Peroxide
CML Example
H
O 1 1
O 1 1 111 1
H 1 1 111 1 111 1 3 2 1
Note the blank line after the last atom line! No more that 6 heavy atoms
other than hydrogen are allowed in order to permit interactive response
to the submission.
A previously prepared
example of a CML output file is also available.
Applications of CML
We outline here a number of possible applications of using CML. In general,
these could form components of say an electronic journal, in which the
text and image based discussion of a chemical theme could be supplemented
with "live" molecular information, coordinates, derived properties etc.
- Because of the structured form of CML, the distinction between input and
output formats for programs can be eliminated. Thus the output CML
of MOPAC can equally well be used as the input if desired, with
appropriate editing of keywords etc. In theory, MOPAC CML could be
equally appropriate as say Gaussian Input.
- We view as a high priority the development of a
CML rendering program. At its simplest, such program could extract
the atom types
and coordinates and display on the screen a rotatable image of the molecule.
More sophisticated versions could search for say eigenvector tags, and
compute and display molecular orbitals. Any property suitable tagged in the
CML definition could in principal be rendered by a suitable CML
viewer. If built on top of code capable of rendering the full HTML 2.0
dtd, then such a viewer would provide an integrated text, image and
chemical viewing system.
Because the content of a CML
document is precisely defined semantically, high quality indexing can now
be performed on molecular properties, and not just simple keyword strings.
For example, one can envisage an indexing program such as Lycos indexing
against a specifically defined dtd such as CML, thus allowing much
richer structuring of chemical content. The lack of any mechanism to
index such chemical content on the World-Wide Web is currently one of
its most severe restrictions.
The development of better chemical indexing methods would
in turn lead to the possibility of automated glossary constructions,
where hyperlinks
between related documents could be automatically inserted by suitable
programs or scripts.
-
Because CML is a superset of HTML , there is a greater
possibility that existing editors (i.e. aswedit for editing
HTML 3.0) and browsers (e.g. Arena for viewing HTML 3.0) could be
adapted for use in this area. The introduction of object based
computer operating systems based on compound document architectures
such as OpenDoc and Taligent
would allow facile integration of new component parts into existing
systems.
The CML project
We are now looking for participants in the CML project. There is no
restriction but we may have to be selective and we are looking for those
with something to bring to the project. The present aim of the project is
to take the work done so far and build on it to provide a working
proof-of-concept. Its components might comprise:
- A generic CML browser. This would be capable of spawning tools or
helper applications and might also interface with HTML. Upload of
local files would be very valuable.
- Specific tools to render aspects of CML (2-D diagrams, 3-D structures,
numeric data).
- A WWW FORMs interface which generates CML files.
- Tools to index CML files and to manage sets (subset, merge, etc.)
- A CML-sensitive authoring tool (hopefully glossary-aware)
- Chemical perception software. (CML provides at least one method
for representing a molecule, but it may not be your preferred one.
(For example, aromatic rings may be Kekule structured). It will be useful
to have perception and conversion tools.
- CML-compliant input and output to some common programs or databases.
- Creation of subject area glossaries, with indexing terms.
Please use this form to register your interest in the CML project. Think
what you or your organisation might be able to contribute -
this could be something you already have!
Conclusion
This project is designed from the outset as a modular development,
with contributions from many diverse sectors of the molecular sciences
possible. In this, we also follow the HTML 2.0 model, which has
been developed jointly by many people, under the management of
an editorial team. Finally, we expect this will develop
as an open project for the benefit of the community as a whole.