The Role of Document Type Definitions in Electronic Data Interchange

Martin Bryan, The SGML Centre

This paper looks at the role document type definitions (DTDs) will play in the management of XML-based business-to-business electronic data interchange. It is designed to clarify a number of issues that seem to be causing concern at the beginning of 1999 during discussions relating to the best way to set up and manage shareable repositories of business message types. In particular it looks at:

The XML Process Model
The Role of XML Namespaces
The Role of Repositories in the Process of Creating DTDs
The Role of Repositories in the Creation of Applications
The Role of Repositories in the Management of DTDs.

The XML Process Model

A well-formed XML document or message consists of a set of nested elements which are contained within a root element. Each element has an element name that identifies the role played by the contents of the element. Element names are used to identify the start and end of each element, which XML requires are clearly identified by the presence of a start tag and an end tag containing matched pairs of element names. An XML start tag can also contain specifications for one or more attributes, each of which consists of an attribute name and a quoted attribute value. The only constraints placed on a well-formed XML document are those required by the XML syntax, which defines which characters are used to identify the start and end of each type of markup used to qualify the contained data, and the way in which the component parts of this markup must be separated by whitespace or by the use of an = between attribute names and attribute values.

A document type declaration can be associated with an XML message to allow a more detailed set of constraints to be applied to the structure of the message. These constraints are expressed in the form of a document type definition (DTD) that can be used to:

restrict the context in which certain elements can appear
restrict the set of attributes names that can be associated with a specific element name
restrict the set of values that can be assigned to an attribute
provide default or fixed values for attributes whose values do not change from message to message
extend XML's default set of entity names that can be used to identify special characters, to import part of the message from other files, or to provide predefined contents for parts of the message
restrict the data type of a particular element or entity to match a predefined notation.

DTDs consist of a set of declarations, the format of which depend on the type of object being defined. Each declaration starts with a keyword that identifies the type, and therefore format, of the declaration. The permitted set of keywords is:

ELEMENT to indicate that the declaration contains a model that constrains the permitted contents of a particular element
ATTLIST to indicate that the declaration defines the names, permitted values/types and default values of a list (set) of attributes that are to be associated with a named element
ENTITY to indicate that the declaration contains either the replacement text, or a pointer to a file containing the replacement text, that is to be used to replace a named entity reference within messages associated with the DTD
NOTATION to indicate that the declaration assigns a locally significant notation name to a process managed by a resource whose location is defined as part of the notation declaration.

A DTD may also contain any number of comment declarations that contain explanations of the role of the DTD and its components, etc. Comment declarations start with a pair of hyphens in place of a keyword, and have a matching pair of hyphens before the declaration close.

When a XML document parser encounters a document type decalaration at the start of an XML message it will automatically check the sequence in which elements occur within the XML message conform to the constraints expressed in the element declaration for the containing element within the DTD. It will also check that the values of any attributes conform to the restrictions declared in any attribute list declarations associated with the element name within the DTD. In addition it will check that any attribute declared with a fixed or required value in the DTD is present and has the required type of value. If a required attribute is not present, but has been assigned a default value in the DTD, the parser will automatically add the attribute name and default value to the list of attributes associated with the element in the message. Where an entity reference, consisting of the entity name preceded by & and followed by a semicolon (;), is encountered the reference will be replaced by the replacement text specified in the entity declaration with the specified mame. If this entity has been associated with a specific notation then the name assigned to the notation will be searched for in the notation declarations in the DTD, and the data will be passed to the process(es) identified by the notation declaration with that name (providing the application has a mechanims for processing data of that type).

As part of this validation process an XML document parser can chose to create a document object model (DOM) for the message. This model will contain a desciption of each element in the document hierarchy, and associate with each node in that tree the set of attribute names and values that the parser has detected in the message or added as the result of checking for default or fixed values within the DTD. This document object model can be made available to other processes that need to be applied the message via the widely adopted Simple API for XML (SAX) or via an application dependent API.

XML messages can refer to parts of their own document object model, or to components of the document object model of another message, by use of the XML Pointer Language (XPointer). This language allows the component parts of a document to be identified from either the contents of the DOM or by their position within the DOM.

The DOM of an XML message can be transformed into an alternative structure using the XML transformation language that is defined as part of the Extensible Stylesheet Language (XSL) specification. As part of this process the name and order of elements and attributes can be changed, and new sets of default attributes can be assigned to elements.

The XSL specification also defines a set of elements and attributes that an XML message can be transformed into so that the message can be displayed by on a computer screen by an XSL-aware tool. It is not, however, a requirement of XSL that documents be transformed to this format. XSL allows documents to be transformed into any format required for subsequent processing.

XML Namespaces

To make it easier to manage sets of element and attribute names within DTDs that are made up of components taken from more than one source, W3C has recently approved an extension to the XML specification that allows the names of element and attributes to be assigned a prefix that can be used to associate a name with the Internet uniform resource locator (URL) of the definition of the namespace.

Whilst initial examples of the use of the Namespaces in XML have concentrated on the use of namespaces to qualify element names, it is anticipated that a more useful role for namespaces within commercial environments will be through the use of namespaces to qualify attributes. The reasons for this is that a namespace qualified attribute value represents a tuple that can be directly related to entries within a repository. To understand this, consider the following namespace qualified attribute value specification: x:y="z".

The namespace qualifier (x) is linked, via a namespace declaration attribute in the same or a containing element, to an Internet resource, which will typically be a database/repository. The namespace declaration attribute will typically have the form: xmlns:x="http://www.standardsbody.org/repository-x". The attribute name associated with this namespace qualifier (e.g. y) will identify a relevant field or process within the repository, which will be used to find information or processes relating to the value defined as the attribute value (e.g. z).

The Role of Repositories in the Process of Creating DTDs

The following types of respositories are expected to be relevant for the creation of DTDs:

Repositories containing DTDs for complete EDI messages, or for subsets of messages defined according to a specified message implementation guide (a DTD repository).
An example of such a repository is the General MIGS Repository (GMR) set up by the CEN/ISSS MIGS project for the storage of EDIFACT message implementation guidelines (MIGS) produced by European and other standards bodies and trade associations.
Repositories containing storage entities whose contents are XML declarations that can be used to associate specific sets of elements and attributes with a DTD (a declarations repository).
At present such repositories do not exist, though OASIS are currently discussing the setting up of such a repository.
Repostories containing semantic definitions of the components of one or more sets of messages together with pointers to sources of relevant XML declarations (a semantic repository/glossary).
Examples of such repositories include the ISO Basic Semantic Repository and the UN-EDIFACT Directories (neither of which currently has pointers to sources of XML declarations for their obejcts, but both of which have plans to extend themselves to provide access to such information) and the XML/EDI Glossary currently under construction for the XML/EDI Group.

It would be possible to build a single repository to serve all these roles, but more practically it can be envisaged that there are likely to be multiple repositories, some of which will only deal with subsets of these functions. What is important is that each of these forms of repository be able to supply the relevant information to application developers in a format that will be relevant for building the DTDs needed to validate messages to be used by specific applications.

For an application developer to be able to identify an existing DTD for a relevant message type a repository must provide the following facilities:

One or more directories of the messages stored in the repository.
It should be possible to select directories based on criteria such as:
- the industry the DTD is designed to be used by
- the organization responsible for defining the DTDs
- the business process for which the DTD is used
- a name of the root element of the message (i.e. the one to be used as the name of the document type declaration)
A mechanism for searching the repository for significant component name/identifier.
For example, show all messages containing an element or attribute name that includes the word "Deliver".
A mechanism for transmitting the DTD and all the files it references to the browser of the application developer.
If the DTD in its unexpanded form contains parameter entity references the file submitted to the developer should be a ZIP file containing all of the component parts of the DTD, rather than the fully expanded DTD as displayed to the developer for review purposes.

A repository that contains storage entities whose contents are XML declarations designed to be used to build new DTDs should provide the following facilities to an application developer:

A mechanism for searching the repository for significant component name/identifier.
For example, show all messages containing an element or attribute name that includes the word "Deliver".
A mechanism for indicating the DTDs for which selected/identified elements are known to be used by.
The fact that an element is used in many DTDs that define messages similar in type to the one to be developed will help the application developer in the selection of appropriate components. (It might also help him to identify complete DTDs that could be easily tailored to meet his needs.)
A mechanism for showing the declarations used to formally define each element and attribute of a selected component.
Users should be shown not only the declaration for the element(s) selected but also the declarations for any elements that are included in the model group that defines the constraints that are to apply as to permitted contents of the element. If these elements permit nested elements in their contents the declarations for these elements should be added to the display recursively so that the application developer can browse the declarations associated with all permitted subcomponents of the selected element.
A mechanism for indicating that a particular element, and all its permitted subcomponents, is to be added to the set of information to be used by the new message.
When an element is selected for inclusion in a new-DTD the files containing all the declarations associated with it must be added to the list required to create the new DTD.
A mechanism whereby application developers can indicate that certain elements, attributes, etc, currently used in the model of an element or its subcomponents are not required in the new DTD.
The system should be able to remove the name of non-compulsory subelements from the model of an element, together with the associated element declarations. It should also be possible to indicate which optional attributes are to be omitted from the new DTD, and to indicate entries in permitted lists of attribute values that are not required for the new application.
A mechanism whereby the declarations for a selected of elements, etc, can be "packaged" for delivery to the application developer.
Once all the elements needed to construct the DTD have been identified, and tailored appropriately, it should be possible to build a ZIP file containing all the files needed to build the new DTD for distribution to the application developer.
A mechanism to register that the transmitted elements are now being used as part of a new DTD.
This information is needed to guide future DTD developers to the new DTD, or to determine the fields in which the elements would be most appropriate. Ideally it should also be possible to record which attributes of the element have been used by the new DTD.

For a glossary or other form of semantic definition repository to be of use in the creation of DTDs it must offer the following facilities:

A mechanism for searchinng the repository for significant words and pharases.
It should be possible to do a free text search of the whole repository for relevant terms.
A mechanism for displaying repository records matched by search terms.
When a match is found the repository must be able to display the whole contents of the matched record so that the user can determine whether or not the identified component is of the type required. The displayed information may contain links to other data sources, such as the formal XML declarations that have been associated with the component in different DTDs.
A mechanism for showing what types of information are available in each selected repository record.
It should be possible to easily identify what type of data each link from the repository record takes you to. It should be possible to clearly distinguish between links that take you to complete DTDs that use the component, to declarations that can be used to add the component to a new DTD, or to links that simply take you to semantic definitions that describe the role of the component.

Ideally a semantic repository will also offer the functionality found in a DTD or declarations repository. It may also be linked to a mechanism for generating formal data models, such as UML data-entity diagrams.

The Role of Repositories in the Creation of Applications

Repositories can also be used to provide other forms of information that may be relevant to the development of XML/EDI applications. Possible extensions include:

The provision of XSL stylesheets that will produce a displayable version of a message on an XSL-aware browser.
Entries in a DTD repository may be linked to one or more style-sheets designed to display the message in a specific language/environment.

Entries in a declarations repository may be linked to one or more XSL templates that define the processing to be associated with the elements/attributes defined by that declaration set. Each template may refer to one or more sharable functions whose definitions will need to be recursively called to create an XSL stylesheet that can provide a start point for the application developer.

Entries in a semantic repository/glossary may reference sharable programs designed to process specific classes of message components.
The provision of an XSL stylesheet (or HTML form with associated CGI script) that defines a form that can be used to capture the data required for a specific message and convert it into an XML/EDI message.
Different stylesheets can be developed for the creation of the message by different language or user communities. By copying and modifying a stylesheet already used in association with the selected message in one environment an application developer can save many hours of work when developing a new application.
The provision of modules that will create messages conforming to a particular DTD by making calls to one or more databases using a database specific language such as SQL.
By providing a customizable template that can be used to generate a relevant set of database calls trade associations can save application developers working in their area many hours of work.
The provision of data validation modules that can be imported into applications to validate the contents of specific elements in a data capture form, or when received as part of a message.
DTDs and declaration repositories can include definitions of fixed attributes that can be used to invoke data validation modules using a "standardized" notation name, attribute value, namespace declaration and/or reference to an Internet uniform resource locator.
The provision of plug-and-play modules for processing data in defined notations within applications.
XML notations are defined in terms of references to Internet uniform resource locators. To be of use application developers have to be able to map these URLs to modules that are applicable in their application environments. By providing links to tools that know about specific types of data trade associations can reduce the time needed to develop new applications in which it is necessary to transfer non-textual data (e.g. for the exchange of X-ray images in healthcare informatics applications, or the exchange of manufacturing data or drawings as part of complex tendering/ordering processes).
The provision of modules that can reference nodes in a DOM and load their contents into specified field within a database.
Data receivers need to be able to map data within received messages into local databases. Various techniques can be employed for this, including the use of the XSL transformation language to create a set of SQL instructions from an XML message. By providing templates for mapping a particular message to a documented database format trade associations can make it easier for application developers to associate message contents with their local databases.

The Role of Repositories in the Management of DTDs

XML DTDs should only be updated using fully-managed procedures that lead to the creation of uniquely identified versions of DTDs. XML document type declarations should use formal public identifiers to clearly identify specific versions in addition to a system identifier containing a URL that identifies both the source and the version of the document. A typical well-managed document type declaration will have the form:

<!DOCTYPE message-x
PUBLIC "-//Standards-body-x//DTD for Message-X, 10th October 1999//EN"
SYSTEM "http://www.standards-body.org/DTDs/message-x-19991010.dtd">

When a specific version of a DTD has been identified the receiving system will be able to cache the DTD the first time it is recieved in the full knowledge that the message structure will not change while this version of the document type declaration is associated with the message.

DTD repositories should provide access to all versions of a particular DTD. They should also provide a mechanism whereby the latest version of the DTD can be identified by using a shortened form of the system identifier, using techniques similar to that employed by W3C for identifying the latest version of an XML-related specification. For example, to reference the latest version of the DTD whose specific version has been entered above the document type declaration could be shortened to read:

<!DOCTYPE message-x
PUBLIC "-//Standards-body-x//DTD for Message-X//EN"
SYSTEM "http://www.standards-body.org/DTDs/message-x.dtd">

Undated DTDs should not be cached on the receiving system. Instead the application should check each time it receives a message with this document type definition associated with it that the last version it cached is still the latest version. (This can be done using the PEP extension method to HTTP, but requires The XML/EDI Group or a similar body to get agreement between repository creators as to the format this XML message needs to take.)

Where DTDs are being designed for use within closed communities it may not be necessary to use formal public identifiers. Providing there are well defined procedures for ensuring that a community updates its DTDs in a controlled manner it may be sufficient to rely on local file naming conventions within a system identifier. This could take the form of, say:

<!DOCTYPE message-x
SYSTEM "/DTDs/message-x-v1.dtd">

The association of processing templates, such as XSL stylesheets, with messages, should be done at application level, and will not generally rely on the referencing of repository copies of stylesheets or processing modules. Normally the stylesheet or processing template will be copied from the repository to a local filestore prior to modification to meet local processing needs. Conventions should be set up between the communicating parties that allow for the storage and recall of the relevant processing instructions by reference to relative addresses which require the processing rules to be placed in a directory that is a fixed number of steps away from the directory used to store messages and the applications used to process them. This will, for example, allow XSL stylesheets to be referenced using instructions of the form:

<?xsl-stylesheet href="/processes/message-x-v1-en.xsl" type="text/xsl"?>