The intention behind this work is to facilitate discussions on the requirements for a standardised XML format for taxonomic information by providing a basic example implementation - in this regard XDELTA should be seen as a 'proof of concept' project: there are likely to be additional requirements for such a format over and above a basic 1:1 translation of DELTA into XML.
This page is a work in progress. Please forward any comments to the author at the following address : ldodds@ingenta.com
A full description of the DELTA format can be found here.
The features of the DELTA system include
Discussions relating to the DELTA format, and this paper can be carried out on the DELTA mailing list.
XML is a text based language and can be very easy to author by hand. There are also a lot of freely available tools for manipulating (parsing, reading, writing, translating) XML data. Because all XML formats share a basic common grammar, any XML parser can parse any XML document. For custom markup languages, e.g. CML, XDELTA, a Document Type Definition (DTD) can be defined which specifies exactly what information can be inserted in an XML document, and where in that document the information must be placed. A parser can then validate a document against its DTD to ensure that the document is valid - ensuring that it can be successfully processed.
See the next section for additional resources on XML. The above is only a (very) brief introduction.
This freely available toolset, which includes parsers suitable for use in any programming language, means that there is no need to rely on custom parsing routines to process any XML data. Writing parser software can be very time-consuming and difficult, particularly if the data format is poorly defined or includes non-standard extensions which may or may not be shared by all applications. Relying on this shared toolset means that XML based applications gain the benefits of very robust parser technology.
Well defined and standardised programmatic interfaces to XML parsers (the SAX and DOM APIs) mean that development of XML applications can, once a DTD has been agreed upon, move quickly onto the core functionality rather than having to reinvent the 'parser wheel' every time the format changes slightly.
Extensibility, flexibility, open-ness
Defining an XML DTD involves (or should involve) a lot
of analysis of the data which is to be used by the application(s). This
analysis is then captured within the DTD which helps to document the format.
This means that XML is to a large extent self-documenting. Reading a DTD
is not a difficult task, so it is easy to begin working with data, as well
as authoring new XML documents.
Taking care over the analysis of the DTD, can yield great benefits in terms of the flexibility and extensibility of the format. Providing that the core structure of the DTD remains the same, additional elements and attributes can be added to the document type without invalidating data which conforms to an earlier version. This is far better that pigeon-holing data into comment blocks for example. This causes overloading of the meanings of the fields in the format, further complicating the parser routines.
For example this fragment from the XDELTA DTD might define a character:
<!ELEMENT character (description, comment?, (multi|num|txt))>
The above means that a character element (or tag) must have a description, and optional comment (signified by the ? operator), and contains either text, numeric, or multiple choice data. Assume that this has been in use for several months when it is realised that the format needs to be able to hold additional notes about a character as well as image reference information. In the original DELTA format, to avoid breaking existing software, this information has to be held in comment blocks - making the meaning of such blocks more confusing to the parser writer, and also the author of a document. In XML however we can do the following:
<!ELEMENT character (description, comment?, (multi|num|txt), image?, notes?)>
We've added optional image and note elements to the document type. By making them optional we've ensured that existing data which has been marked up with an earlier version of the DTD is still valid, but we have also greatly increased the expressive power of the document without having to overload the meaning of any particular element. The content of the image and/or note element can be as complex as required.
In certain cases however backwards compatibility is not possible. We may have decided that all characters should have associated notes with them to help the user to make an identification. This presents the problem that existing data must be amended to include this new element. This is easily achieved however by carrying out transformations of the document structure. These transformations can be managed through the use of what are known as 'stylesheets'. A stylesheet defines the mapping between two different document types, with no requirement that the target document is actually XML data - it can as easily be plain text as HTML, another XML format, PDF, etc, etc. Here we see that the well-defined structure of XML documents provides a great deal of flexibility when it comes to processing the data that it captures.
Re-evaluating the DELTA (and other)
formats
Another benefit to be gained from
analysing the description of taxonomic data in XML is that it provides the
opportunity to revisit the DELTA format (as well as others such as NEXUS)
and re-assess whether they are still meeting the requirements of the scientific
community. Are there new requirements that need to be met? Are there old
requirements which are no longer a priority? Are there out-standing problems
that need to be resolved? The DELTA documentation has many references to
future enhancements as well as obsolete constructs in the data. Starting
with a new format gives the opportunity to clean out any 'dead wood' as
well as ensuring that all processing and data capture practices are well-standardised.
For example the DELTA standard states (page 5, Taxonomic Descriptions):
Note. Comments in item names were implemented before text characters, and often contained material, such as synonomy, which would now be better placed in text characters. These comments are now generally used for the authority, as in the example below. The interpretation of inner comments is currently not defined; they may be used in future extensions of the DELTA format.Here we see that text characters are a more recent addition than comments, and that some material in comments is better suited for text characters, and alongside this the note that comments are now generally used for authority references. This example is not meant to criticise the format, or the standard, but hopefully just serves to highlight how a format evolves over time - a well specified XML standard should hopefully minimise these types of problem.
XML is still also a relatively young language and that means that although there are many freely available tools, in some cases those tools are still in early versions with the obvious associated problems that this can cause. However XML is also a rapidly growing language and new parsers, validators, authoring packages and a host of other utilities are constantly being developed.
XML isn't that far removed from HTML so prior experience in this area can help mitigate the learning curve for the end-user of the format.The last few months has seen a surge in the number of XML authoring tools available.
Lack of current software
The DELTA format is well established
and there is a good array of packages available for processing its data.
This software is well-known and well-used - users are familiar with it,
and many bugs will have already been eliminated. Moving to a new standard
may well involve writing new applications, although hopefully incorporation
of an XML parser into packages such as Intkey and NaviKey could avoid re-inventing
the wheel.
Indeed some software will be made obsolete by an XML standard. I am currently working on sample stylesheets which demonstrate how the data can be automatically converted to HTML for use on the Internet, this encroaches on some of the functionality of CONFOR for example.
XML is not a 'magic bullet'
As with any newly hyped tool or
language, XML can be wrongly viewed as the 'magic bullet' of software.
Unfortunately this is not the case, and while it provides a great
deal of benefits these are only a portion of those available to a well-designed
system. XML is not a subsitute for design: it's a starting point. These messages
from the XML-DEV mailing list reiterate this point. [1],
[2], [
3].
The DTD itself contains a lot of comments which go into more detail of its structure.
I see this version of the DTD as being mainly a discussion point. I accept that changes to the DTD may be required after it has been reviewed by users/maintainers of the DELTA format, and would encourage the discussion of what additional information could be captured by the format over and above that found in DELTA. The NEXUS format seems to hold additional information, and this is an obvious example of how the format can be expanded.
Go to the download section to grab a copy of the DTDThe Artistic License will probably suffice in this regard. At present treat XDELTA as 'free, but used at your own risk'.
The latest version of the XDELTA DTD can be found here (you may need to shift-click the link):
http://www.bath.ac.uk/~ccslrd/delta/xdelta.dtd
If you are not familiar with reading XML DTDs then I'll direct you to the section of XML pointers, or this quick 10 minute guide to reading a DTD.
Some style-sheets which demonstrate the manipulation of XDELTA documents can be found here:
An example XML data file containing data for the Lepidoptera is also available. This is derived from the example DELTA Lepidoptera data which is available here.
A zip file containing this page, the dtd, and the example stylesheets and data can also be downloaded.
In developing the XDELTA format, I've been using the XP Java parser, and the XT XSLT processor. You'll need the latest version of the XT processor to use the example stylesheets. The Java Development Kit can be downloaded from the Java Home Page. The RXP application is also useful for validating documents.