[Cache from http://www.ch.ic.ac.uk/cml/; please use this canonical URL/source if possible. More current information from http://www.xml-cml.org.]


Enter your comments here

CML - Chemical Markup Language

Peter Murray-Rust[a], Henry S. Rzepa[b] and Christopher Leach[b]

[a] Glaxo Research & Development, Stevenage, Herts., UK,
[b] Department of Chemistry, Imperial College, London, SW7 2AY
Abstract 40. Presented as a poster at the 210th ACS Meeting
in Chicago on 21 August, 1995

http://www.ch.ic.ac.uk/cml/

The contents of this poster

'Chemical Markup Language' represents a collaborative approach to tackling some of the problems of the interchange of chemical information over the Internet and other networks. This poster is the first public announcement of this initiative, but a lot of work has already been done and so there are links to several distributed resources. It's arranged as a hyperdocument, so be prepared for the different styles! Some of these resources change daily so as you view this poster in the coming days you will see constant change 'underneath' this top page.

CML is based on the ideas and success of HTML and the WWW and might be thought of as 'HTML with some chemistry added'. This is a useful starting point, and recently (1995/8/20) we got approval to use HTML2.0 as apart of CML. CML will, however, need its own renderers (browsers) and so we cannot yet show you this part of CML. (One approach may be to use standard browsers with helper applications).

There are, however, major differences from HTML in that CML is requires strict adherence to SGML. (HTML 'ignores' unknown tags and bad syntax; CML will object.) CML relies heavily on the use of carefully structured glossaries rather than adding semantics to the language itself. This allows great flexibility in that the chemical community can add new terms or even glossaries without having to revise the language or modify the software. Special interest groups can devise glossaries without having to seek 'permission' from the CML project.

CML has been designed to cater for a wider range of structured datatypes than HTML (which deals primarily with hypertext (text, image, movie, audio). CML allows definition of:

CML is still evolving rapidly and we are offering it as a project in which the community can join. Ideas are welcome, but contributions such as tools, glossary creation, program interfaces, viewers, etc. are even more valuable and we'd like to hear such offers. We propose to form an electronic project to develop CML in a rapid fashion and offers of financial or material assistance would be very welcome. The project will be open and not aligned to any particular interest groups.

Some starting points

Introduction:

The last five years or so have seen the introduction of a number of so-called self-defining structured file descriptors for chemical content, including the CIF format designed for use in the area of crystallography, and a subsequent generalisation (MIF), and the recently introduced
CXF format from Chemical Abstracts. These in turn are based on the more general STAR and the ASN.1 symbolic notations. Recent mechanisms of electronic delivery based on the World-Wide Web system are however derived from an alternative content descriptor termed Standard General Markup Language (SGML). Specific applications of SGML are derived from a "document type definition" or a "dtd". One such has been specifically developed for use on the World-Wide Web, and in its current proposed version is known as HTML 2.0. This provides for a range of content types, including one tagged as <MATH> which allows for formal definition of symbolic mathematics. Taking this dtd as our model, we now propose an extension of this dtd specifically in the area of theoretical chemistry, which we propose to call CML or Chemical Markup Language. We envisage this are the first part of a more ambitious project to eventually develop an SGML based dtd with a much wider scope for defining chemical content. In this report, we outline the first self-consistent working version of CML and discuss some of the benefits and applications that might flow from this concept.

Discussion and Implementation

A powerful reason for adopting an SGML based content definition is the very substantial amount of software tools available for its development, compared for example with the STAR format for which there is a relative dearth of such tools. In addition, by using the current definition of HTML 2.0, many generic types are available to us, as well as programs such as Arena capable of rendering this content on a computer screen. Our CML extension is available as SGML experiment Collaborative HyperGlossary.

In order to demonstrate a "proof-of-concept", we have implemented this dtd into a modified version of the MOPAC 93 program, which can now both read and write its output in formally structured CML. Presently this system can receive MOPAC internal and cartesian co-ordinates as well as the CML output from a previous calculation. Calculations in this demonstration program have to be limited to a small selection of MOPAC keywords and molecules with 6 or less heavy atoms and no more than 20 hydrogen atoms. Errors are currently handled by MOPAC itself in the form of the original output file rather than the CML file, on the assumption that the final archival version will not contain errors relating to e.g. geometry definition, geometry optimisation, etc.

MIME Considerations.

Presently the Content-type in the HTTP header is set to chemical/x-cml and the expected filename extension is .cml. Whilst the ultimate intention is to produce a CML renderer, currently you should use your WWW browser to display these files. The browser will need to be configured to understand this new mime-type, and should display the CML, tags and all. A formal CML browser of course would interpret the tags, and display only the content in appropriate form.

CML Working Example

One of the powerful features of the World-Wide Web system is that it can be used to deliver real working examples of a new concept. Here, we have created a WWW form which allows a small molecule to be defined and submitted to our modified MOPAC-93 code. The output, appropriately marked up in CML is returned to your browser (or ultimately to a CML browser when available). To use this system, enter a small molecule in the text area below. This can be achieved by:
  1. Creating a molecule file in a text editor
  2. highlighting this file in the editor
  3. copy the selection to the clipboard
  4. pasting the information from the clipboard to the text area
  5. Alternatively, just type a small molecule into this area. An example might be:
    PM3
    
    Hydrogen Peroxide
    
    CML Example
    
    H
    
    O  1  1
    
    O  1  1  111  1
    
    H  1  1  111   1   111  1  3 2 1
    
    
    
    
Note the blank line after the last atom line! No more that 6 heavy atoms other than hydrogen are allowed in order to permit interactive response to the submission.

A previously prepared
example of a CML output file is also available.

Applications of CML

We outline here a number of possible applications of using CML. In general, these could form components of say an electronic journal, in which the text and image based discussion of a chemical theme could be supplemented with "live" molecular information, coordinates, derived properties etc.
  1. Because of the structured form of CML, the distinction between input and output formats for programs can be eliminated. Thus the output CML of MOPAC can equally well be used as the input if desired, with appropriate editing of keywords etc. In theory, MOPAC CML could be equally appropriate as say Gaussian Input.
  2. We view as a high priority the development of a CML rendering program. At its simplest, such program could extract the atom types and coordinates and display on the screen a rotatable image of the molecule. More sophisticated versions could search for say eigenvector tags, and compute and display molecular orbitals. Any property suitable tagged in the CML definition could in principal be rendered by a suitable CML viewer. If built on top of code capable of rendering the full HTML 2.0 dtd, then such a viewer would provide an integrated text, image and chemical viewing system.

    Because the content of a CML document is precisely defined semantically, high quality indexing can now be performed on molecular properties, and not just simple keyword strings. For example, one can envisage an indexing program such as Lycos indexing against a specifically defined dtd such as CML, thus allowing much richer structuring of chemical content. The lack of any mechanism to index such chemical content on the World-Wide Web is currently one of its most severe restrictions.

    The development of better chemical indexing methods would in turn lead to the possibility of automated glossary constructions, where hyperlinks between related documents could be automatically inserted by suitable programs or scripts.

  3. Because CML is a superset of HTML , there is a greater possibility that existing editors (i.e. aswedit for editing HTML 3.0) and browsers (e.g. Arena for viewing HTML 3.0) could be adapted for use in this area. The introduction of object based computer operating systems based on compound document architectures such as OpenDoc and Taligent would allow facile integration of new component parts into existing systems.

The CML project

We are now looking for participants in the CML project. There is no restriction but we may have to be selective and we are looking for those with something to bring to the project. The present aim of the project is to take the work done so far and build on it to provide a working proof-of-concept. Its components might comprise: Please use
this form to register your interest in the CML project. Think what you or your organisation might be able to contribute - this could be something you already have!

Conclusion

This project is designed from the outset as a modular development, with contributions from many diverse sectors of the molecular sciences possible. In this, we also follow the HTML 2.0 model, which has been developed jointly by many people, under the management of an editorial team. Finally, we expect this will develop as an open project for the benefit of the community as a whole.