[This local archive copy is from the official and canonical URL, http://www.personal.u-net.com/~sgml/xml-edi/edi-dtds.htm, 1999-02-05; please refer to the canonical source document if possible.]

The Role of Document Type Definitions in Electronic Data Interchange

Martin Bryan, The SGML Centre

This paper looks at the role document type definitions (DTDs) will play in the management of XML-based business-to-business electronic data interchange. It is designed to clarify a number of issues that seem to be causing concern at the beginning of 1999 during discussions relating to the best way to set up and manage shareable repositories of business message types. In particular it looks at:

The XML Process Model

A well-formed XML document or message consists of a set of nested elements which are contained within a root element. Each element has an element name that identifies the role played by the contents of the element. Element names are used to identify the start and end of each element, which XML requires are clearly identified by the presence of a start tag and an end tag containing matched pairs of element names. An XML start tag can also contain specifications for one or more attributes, each of which consists of an attribute name and a quoted attribute value. The only constraints placed on a well-formed XML document are those required by the XML syntax, which defines which characters are used to identify the start and end of each type of markup used to qualify the contained data, and the way in which the component parts of this markup must be separated by whitespace or by the use of an = between attribute names and attribute values.

A document type declaration can be associated with an XML message to allow a more detailed set of constraints to be applied to the structure of the message. These constraints are expressed in the form of a document type definition (DTD) that can be used to:

DTDs consist of a set of declarations, the format of which depend on the type of object being defined. Each declaration starts with a keyword that identifies the type, and therefore format, of the declaration. The permitted set of keywords is:

A DTD may also contain any number of comment declarations that contain explanations of the role of the DTD and its components, etc. Comment declarations start with a pair of hyphens in place of a keyword, and have a matching pair of hyphens before the declaration close.

When a XML document parser encounters a document type decalaration at the start of an XML message it will automatically check the sequence in which elements occur within the XML message conform to the constraints expressed in the element declaration for the containing element within the DTD. It will also check that the values of any attributes conform to the restrictions declared in any attribute list declarations associated with the element name within the DTD. In addition it will check that any attribute declared with a fixed or required value in the DTD is present and has the required type of value. If a required attribute is not present, but has been assigned a default value in the DTD, the parser will automatically add the attribute name and default value to the list of attributes associated with the element in the message. Where an entity reference, consisting of the entity name preceded by & and followed by a semicolon (;), is encountered the reference will be replaced by the replacement text specified in the entity declaration with the specified mame. If this entity has been associated with a specific notation then the name assigned to the notation will be searched for in the notation declarations in the DTD, and the data will be passed to the process(es) identified by the notation declaration with that name (providing the application has a mechanims for processing data of that type).

As part of this validation process an XML document parser can chose to create a document object model (DOM) for the message. This model will contain a desciption of each element in the document hierarchy, and associate with each node in that tree the set of attribute names and values that the parser has detected in the message or added as the result of checking for default or fixed values within the DTD. This document object model can be made available to other processes that need to be applied the message via the widely adopted Simple API for XML (SAX) or via an application dependent API.

XML messages can refer to parts of their own document object model, or to components of the document object model of another message, by use of the XML Pointer Language (XPointer). This language allows the component parts of a document to be identified from either the contents of the DOM or by their position within the DOM.

The DOM of an XML message can be transformed into an alternative structure using the XML transformation language that is defined as part of the Extensible Stylesheet Language (XSL) specification. As part of this process the name and order of elements and attributes can be changed, and new sets of default attributes can be assigned to elements.

The XSL specification also defines a set of elements and attributes that an XML message can be transformed into so that the message can be displayed by on a computer screen by an XSL-aware tool. It is not, however, a requirement of XSL that documents be transformed to this format. XSL allows documents to be transformed into any format required for subsequent processing.

XML Namespaces

To make it easier to manage sets of element and attribute names within DTDs that are made up of components taken from more than one source, W3C has recently approved an extension to the XML specification that allows the names of element and attributes to be assigned a prefix that can be used to associate a name with the Internet uniform resource locator (URL) of the definition of the namespace.

Whilst initial examples of the use of the Namespaces in XML have concentrated on the use of namespaces to qualify element names, it is anticipated that a more useful role for namespaces within commercial environments will be through the use of namespaces to qualify attributes. The reasons for this is that a namespace qualified attribute value represents a tuple that can be directly related to entries within a repository. To understand this, consider the following namespace qualified attribute value specification: x:y="z".

The namespace qualifier (x) is linked, via a namespace declaration attribute in the same or a containing element, to an Internet resource, which will typically be a database/repository. The namespace declaration attribute will typically have the form: xmlns:x="http://www.standardsbody.org/repository-x". The attribute name associated with this namespace qualifier (e.g. y) will identify a relevant field or process within the repository, which will be used to find information or processes relating to the value defined as the attribute value (e.g. z).

The Role of Repositories in the Process of Creating DTDs

The following types of respositories are expected to be relevant for the creation of DTDs:

  1. Repositories containing DTDs for complete EDI messages, or for subsets of messages defined according to a specified message implementation guide (a DTD repository).

    An example of such a repository is the General MIGS Repository (GMR) set up by the CEN/ISSS MIGS project for the storage of EDIFACT message implementation guidelines (MIGS) produced by European and other standards bodies and trade associations.

  2. Repositories containing storage entities whose contents are XML declarations that can be used to associate specific sets of elements and attributes with a DTD (a declarations repository).

    At present such repositories do not exist, though OASIS are currently discussing the setting up of such a repository.

  3. Repostories containing semantic definitions of the components of one or more sets of messages together with pointers to sources of relevant XML declarations (a semantic repository/glossary).

    Examples of such repositories include the ISO Basic Semantic Repository and the UN-EDIFACT Directories (neither of which currently has pointers to sources of XML declarations for their obejcts, but both of which have plans to extend themselves to provide access to such information) and the XML/EDI Glossary currently under construction for the XML/EDI Group.

It would be possible to build a single repository to serve all these roles, but more practically it can be envisaged that there are likely to be multiple repositories, some of which will only deal with subsets of these functions. What is important is that each of these forms of repository be able to supply the relevant information to application developers in a format that will be relevant for building the DTDs needed to validate messages to be used by specific applications.

For an application developer to be able to identify an existing DTD for a relevant message type a repository must provide the following facilities:

A repository that contains storage entities whose contents are XML declarations designed to be used to build new DTDs should provide the following facilities to an application developer:

For a glossary or other form of semantic definition repository to be of use in the creation of DTDs it must offer the following facilities:

Ideally a semantic repository will also offer the functionality found in a DTD or declarations repository. It may also be linked to a mechanism for generating formal data models, such as UML data-entity diagrams.

The Role of Repositories in the Creation of Applications

Repositories can also be used to provide other forms of information that may be relevant to the development of XML/EDI applications. Possible extensions include:

The Role of Repositories in the Management of DTDs

XML DTDs should only be updated using fully-managed procedures that lead to the creation of uniquely identified versions of DTDs. XML document type declarations should use formal public identifiers to clearly identify specific versions in addition to a system identifier containing a URL that identifies both the source and the version of the document. A typical well-managed document type declaration will have the form:

<!DOCTYPE message-x
PUBLIC "-//Standards-body-x//DTD for Message-X, 10th October 1999//EN"
SYSTEM "http://www.standards-body.org/DTDs/message-x-19991010.dtd">

When a specific version of a DTD has been identified the receiving system will be able to cache the DTD the first time it is recieved in the full knowledge that the message structure will not change while this version of the document type declaration is associated with the message.

DTD repositories should provide access to all versions of a particular DTD. They should also provide a mechanism whereby the latest version of the DTD can be identified by using a shortened form of the system identifier, using techniques similar to that employed by W3C for identifying the latest version of an XML-related specification. For example, to reference the latest version of the DTD whose specific version has been entered above the document type declaration could be shortened to read:

<!DOCTYPE message-x
PUBLIC "-//Standards-body-x//DTD for Message-X//EN"
SYSTEM "http://www.standards-body.org/DTDs/message-x.dtd">

Undated DTDs should not be cached on the receiving system. Instead the application should check each time it receives a message with this document type definition associated with it that the last version it cached is still the latest version. (This can be done using the PEP extension method to HTTP, but requires The XML/EDI Group or a similar body to get agreement between repository creators as to the format this XML message needs to take.)

Where DTDs are being designed for use within closed communities it may not be necessary to use formal public identifiers. Providing there are well defined procedures for ensuring that a community updates its DTDs in a controlled manner it may be sufficient to rely on local file naming conventions within a system identifier. This could take the form of, say:

<!DOCTYPE message-x
SYSTEM "/DTDs/message-x-v1.dtd">

The association of processing templates, such as XSL stylesheets, with messages, should be done at application level, and will not generally rely on the referencing of repository copies of stylesheets or processing modules. Normally the stylesheet or processing template will be copied from the repository to a local filestore prior to modification to meet local processing needs. Conventions should be set up between the communicating parties that allow for the storage and recall of the relevant processing instructions by reference to relative addresses which require the processing rules to be placed in a directory that is a fixed number of steps away from the directory used to store messages and the applications used to process them. This will, for example, allow XSL stylesheets to be referenced using instructions of the form:

<?xsl-stylesheet href="/processes/message-x-v1-en.xsl" type="text/xsl"?>