dots
small CIMI logo
| Home | About | Program | Projects | CIMI Standards | Other Standards | Member Area | CIMI Institute |

----

XML and the CIMI DTD

by Richard Light

Brief: A summary of the issues for making the CIMI DTD XML compliant, a discussion of the gains and losses for the current CIMI DTD's full SGML functionality, and a proposal for changes required to make the CIMI DTD XML-compliant

The Issues

1. Inclusion exceptions

The major change required to make the CIMI Full Text DTD XML-conformant is the removal of all inclusion and exclusion exceptions. These define 'floating' elements which can occur at any point in the document's structure. The principal instances from CIMI's point of view are <topic>, <context> and <hot-spot>, but the TEI element types <index>, <milestone>, <pb> and <lb> are also inclusion exceptions.

In general, I suggest that these elements should be placed in the TEI Header, and should point to the elements to which they refer.

While this new approach uses a different technique to express the required concept, it is semantically equivalent to the current approach, so there is no loss of SGML functionality. The only drawback (which is the principal reason I didn't suggest this approach originally) is that users might find it hard to make the links from <context>s and <topic>s in the header to the correct elements within the document. It is the sort of thing which software ought to be able to help with, but there is no SGML editing tool that I am aware of which offers this facility. Separate declarations for each element type

XML does not allow multiple element types to be declared in one go. This is a feature of the original TEI DTD, but the normalized version that CIMI uses already has a separate declaration for each element type. It does not, in any case, affect SGML functionality. No minimization on element declarations

This means that the tag omission indicators (- and O) are not allowed in element declarations, since XML makes of use of this SGML feature.

This XML requirement means that the current DTD is not valid as it stands. However, all that is required is to remove the offending declarations. The result for SGML functionality - which is that start- and end-tag omission is not allowed - is a general XML feature rather than an issue for CIMI's specific DTD.

2. Mixed-content models

Mixed content models (where both data content and markup are allowed - for example within paragraphs) have to take a particular, simple, form in XML. As it happens, all of the low-level element types we introduced in the CIMI DTD already have the approved form. Only those which occur at a higher level (like <pGrp> for a group of paragraphs) share the TEI's "pernicious mixed content" model. This needs to be sorted out by TEI. Simplifying these mixed content models will slightly reduce the expressive power of the DTD, but at the same time it will remove problems caused by this pernicious mixed content that CIMI users have encountered in the past, such as obscure parsing errors.

3. The '&' connector

This connector (which means that all the elements mentioned must occur, but in any order) is not supported in XML. However, it is only used within three TEI element types in the CIMI DTD (<publicationStmt>, <cit> and <respStmt>). There will be no major loss of SGML functionality from changing these content models. #CURRENT attribute values

The #CURRENT attribute value is not supported in XML. This is used within TEI for all of the <divN> element types. #CURRENT is like a #REQUIRED attribute value, except that so long as you put one in for the first occurrence of an element type, that value is copied forward to all subsequent instances of the element type. TEI will obviously have to decide whether to make this optional or mandatory for XML. If it becomes mandatory, the functionality will be the same as for SGML, but markers-up will have more work to do!

4. Comments within declarations

It will no longer be possible to put 'inline' comments (surrounded by -- ... --) inside <!ELEMENT and <!ATTLIST declarations. This will require a small change to the program that produces the normalized CIMI DTD. It does not affect SGML functionality.

5. Miscellaneous changes

There are some more, minor changes that will affect the design of the TEI DTD, and thus the CIMI Full Text DTD. However, none of them affect SGML functionality, and none will require any action on our part.

6. Other changes implied by a switch to XML

  1. PUBLIC identifiers
    XML does not allow PUBLIC identifiers anywhere in a document unless they also include a SYSTEM identifier (which XML will interpret as a URL). This goes against CIMI's current strategy, which is to keep system-specific (and thus, changeable) information outside the documents themselves. For example, in Project CHIO the mapping of PUBLIC to SYSTEM identifiers is carried out in an SGML Open Catalog file at delivery time.
  2. Linking syntax
    The <xptr> and <xref> element types in the current DTD use TEI Extended Pointers to express links to any part of the same, or another, document. Although XML's XPointers are modelled on Extended Pointers, they have a different syntax (dictated by the need for these pointers to appear as fragment identifiers at the end of URLs). Also, some aspects of the TEI syntax have been left out of XML's XPointer spec. Thus, any <xptr> or <xref> links in existing documents would need to be re-phrased so as to conform to the XML version of the extended pointer syntax. There is also a slight loss of functionality, although I think it is extremely unlikely that anyone will actually have used the rather obscure features which are being dropped. The main area of loss is that XPointers are only designed to point into XML documents, so features which can be used to specify e.g. areas within images have been lost.
  3. Users' markup practices: Other rules
    When working within an XML framework, users will have to observe some additional rules, principally:


Propose recommendations for incorporating XML into the CIMI DTD eg: should the XML CIMI DTD be separate from the full DTD or can it be incorporated into the full DTD or other options;

Implementation options

  1. Make the core CIMI DTD XML-compliant
    The first option is to make changes to the current DTD so that it is XML-compliant.

    I would advise against this option. It is far from clear to me that moving totally to XML would confer any practical benefits on users of the CIMI DTD. The changes that are required for basic XML conformance will entail a significant reworking of every document entered so far. For example, every <topic> and <context> element will need to be re-sited in the TEI Header.

    Also, the need to include SYSTEM identifiers removes much of the value for CIMI of the SGML approach, which aims to keep documents as future-proof as possible by using only PUBLIC identifiers for external entities. So, in this sense, users would actually be worse off with XML.

  2. Write an additional XML-compliant DTD (for delivery only of CIMI documents)
    This is the option I would recommend. In effect, the conversion from SGML to XML DTD is exactly as outlined above - the difference is that we consider the XML version to be a vehicle primarily for Web delivery of the SGML, not a replacement for the SGML DTD.
  3. Other options
    There are, in my view, no other options to consider. The CIMI full-text DTD has to be treated as a whole, and there is no way that parts of it, for example, can be made XML-compliant. [This is a separate issue from deciding how to include Dublin Core and other metadata schemes 'within' the CIMI DTD. I have commented on those issues in Briefing Paper 1.2.]


Identify those who have created XML-compliant DTDs relevant to the CIMI DTD and briefly describe how they did it; provide pointers to relevant sites for detailed information

Conversion of other relevant DTDs

I am not aware of any significant XML-compliant DTDs that are relevant to CIMI. Jon Bosak of Sun has done a conversion of the DocBook DTD to XML for an experimental version of Sun's AnswerBook project. To see this service (which generates HTML on the fly from the XML) operating in normal mode, point your Web browser at http://docs.sun.com. To see TOCs, or actual chunks of documentation in XML format, use one of the following:

  1. To get a chapter-level TOC of the entire contents of the server:
    http://docs.sun.com/ab2/@xmlToc
  2. To get a chapter-level TOC of the manuals in the alluser category:
    http://docs.sun.com/ab2/alluser/@xmlToc
  3. To get a chapter-level TOC of the Solaris Advanced User's Guide:
    http://docs.sun.com/ab2/alluser/ADVOSUG/@xmlToc
  4. To get a particular chapter from the manual (as listed in the TOC):
    http://docs.sun.com/ab2/alluser/ADVOSUG/@xmlChunk/1120

 

As part of the research for my book "Presenting XML" (published this week by Sams.net), I went through the process of converting the HTML 2.0 DTD to an XML-compliant form. See Chapter 12 for details I have used that work as a checklist when listing the changes required for an XML version of the CIMI DTD.

TEI attitude to XML

I have discussed the issue of an XML version of the TEI framework with Lou Burnard, and he said that:

___________

Richard Light
26 September 1997

----
Navigation | Credits | E-mail

©1997 CIMI