ISIS XML/EDI Project Deliverable D2

ISIS XML/EDI Project Deliverable D2:
Best Practices for Creating XML/EDI DTDs

11th October 1999

Background

XML/EDI document type defintions (DTDs) will generally be derived from one of the following three sources:

a formally defined data model
existing EDI messages, such as those defined in EDIFACT, X12 or other EDI syntaxes, or
an informal definition of business requirements based on existing sets of terms (semantics) which can be used as the starting point for message development.

The ISIS European XML/EDI pilot project provides specific worked-through results in the first two of these categories, and offers guidance, in this document, that is applicable to all three categories.

Three main techniques for converting existing EDI messages to XML can be identified:

direct translation of all data components within a particular directory to XML elements
use of specific message implementation guidelines (MIGs) as the starting point
development of new object libraries which retain some links to existing message definitions.

The first of these approaches is being taken by existing EDI vendors in the EDIFACT and X12 communities, and the latter by many US-based consortiums, such as CommerceOne and Rosettanet. This project has restricted itself to studying the effects of using MIGs as the starting point for developing XML/EDI DTDs based on existing messages.

The ISIS European XML/EDI Pilot Project has investigated how XML message structures can be developed to meet two very different sets of requirements:

those of the transport industry, for conveying the messages involved in the control of container movements, including booking and confirming on-land transport requirements
those of healthcare, for conveying patients' medical records between GPs, or between GPs and hospitals.

For healthcare, the starting point was the EHCR (Electronic Healthcare Record) model that was recently adopted as a pre-standard by CEN TC251. For transport, the starting point was a set of existing EDIFACT messages, as qualified by the message implementation guidelines (MIGs) in use in one user community (the Finnish trucking industry).

In the healthcare arena, where the starting point was a formal model developed in UML, it was possible to develop a set of rules for generating an XML DTD from the model which will be generally applicable to any data set defined using a UML model. These rules are presented in a separate document: Mapping from UML Generalized Message Descriptions to XML DTDs

A second project in the healthcare arena evaluated the use of simplified local DTDs for data capture and manipulaltion purposes. Data captured using the simplified DTDs can, after undergoing any required local processing, be mapped to the richer formal model for interchange with other systems through the use of XSL Transformation specifications. Details of this approach are presented in a separate document: Best practices for linking local applications to a communication standard.

In the transport arena the lack of a formal model underlying the EDIFACT messages made it difficult to identify all the inter-relationships between the data elements, composites and segments forming each message. Often these inter-relationships are defined by means of commentary within the message implementation guidelines, rather than being defined in the main directory definitions of the component parts. Nevertheless, it was possible to develop a set of rules that enable XML DTDs to be developed on the basis of the message structure defined in the MIG for a specific application of an EDIFACT message. These rules are presented in a separate document: Rules for Mapping Existing EDIFACT MIGS to XML DTDs

An alternative approach to the development of XML DTDs for EDIFACT messages is to start from a functional specification of the requirements, based on the business terms defined in the EDIFACT directory, without regard to the segments they form part of. At the time of writing, this approach has not been evaluated.

This document presents the conclusions and Best Practice guidelines that have emerged from the work of the project to date, and which are common to the two application domains (healthcare and transport) and the two formal mechansims that served as starting point for the work (UML and EDIFACT).

It is likely that additional conclusions and guidelines, or enhancements to those presented here, will emerge in the latter stages of this project, in light of the fact that:

work is continuing on building prototype software applications that make use of the XML DTDs that have been developed using the first two approaches
W3C is in the process of developing a new XML Schema specification, which will allow the definition of Structures and Datatypes using the XML syntax rather than a DTD
work is under way within the pilot project to evaluate the capabilities of the current working draft of the XML Schema Structure and Datatypes specifications, and to develop guidelines for their use in the two application domains the project has been considering.

The final report and recommendations of the pilot project will, therefore, include an update to the present document, incorporating any later findings resulting from project work.

Guidelines that are specific to one or the other approach (starting from the EDIFACT MIG, or starting from a formal UML model) are detailed in the attached documents:

Overall Conclusions

The application of data modelling techniques to message design allows all of the relationships between message components to be formally recorded. Models can either relate to a single message, or to a set of related messages. Alternatively you can model the whole business process and then extract from this the information components that need to be transmitted as a single message.

Where no suitable message currently exists, and time and expertise permits, formal modelling of requirements has obvious advantages. Where the requirement is to make the maximum use of an existing message format the starting point should be the semantics and relationships defined in a specific message implementation guideline rather than from the generalized definition of the message type provided in a message directory. While the MIG should reflect the structure of the message as defined in the general message directory this should not be allowed to mask the relationships between message components identified within the commentary of the MIG.

Note: There may be advantages in creating a formal model for the message to record the relationships between the message components used in the MIG. Often this process will identify inconsistencies in the current structure of the message and will suggest ways in which a more functionally structured message can be created.

One of the major advantages that XML will bring to the EDI community will be its unique facility to transform message structures to meet user needs. This means that an XML version of a message can be structured differently from its EDIFACT equivalent without destroying the ability of the two coding schemes to interwork. In addition local DTDs can be used to capture or manipulate subsets of larger messages for use in specific environments.

By concentrating on the functional requirements of a message rather than its physical structure EDI practioners will be better able to reuse message components and software "agents" designed to undertake the processing of particular sets of data. The hierarchical data structure used within XML provides a natural "containment" mechanism that can be utilized to group data that needs to be processed at the same time. By adopting a hierarchical approach to message structuring and processing EDI users will be able to incrementally introduce formal modelling of message components.

In analyzing the similarity and differences between the different mapping approaches taken by the project the following major principles have been identified:

Messages should be simplified by removing rarely used data components
Message component names should be self-explanatory to programmers
Wherever possible default values should be defined to reduce the size of transmitted messages to the smallest set of update information required.

Message Simplification

The messages defined in EDIFACT message directories are typically designed for multiple purposes, and therefore represent overkill as far as the design of efficient XML messages is concerned. Users of EDIFACT messages are required to prepare message implementation guidelines that identify which subset of a message is to be used for a specific application (or set of applications).

Each MIG should identify the mimimum set of permitted values from each referenced UN-approved code list, rather than allowing users to select any permitted value listed in the UN directory. For code lists approved by other agencies mechanisms should be provided for local reduction of the list of permissible codes, ideally through use of references to locally accessible code sub-lists that can be referred to using parameter entity references in the XML DTD. Values selected from code lists should be recorded as attribute values rather than the contents of elements.

In cases where the EDIFACT message allows the same message component to occur at more than one point in the structure only one of the options should be selected, unless it is obvious that there is a need for the component to appear in multiple contexts (e.g. for a free text annotation). Where the same component must occur more than once at the same level in the hierarchy each element representing the component should have its name qualified by either a property of the component or by a number (e.g. DocumentDate and DeliveryDate or Date1 and Date2).

General data models described using UML are typically defined to cover a much larger scope than is required for a single message. UML models have no in-built hierarchical structure, representing networks rather than hierarchies. Therefore UML models need to be subsetted into hierarchically structured sets that identify only the objects that are required for a particular message prior to the creation of an XML definition of the message structure. The hierarchy developed should seek to ensure that each message component only occurs once at each level in the hierarchy, unless the component name is suitably qualified or numbered.

Where an element would only contain one other element then it should be removed from the XML model and be replaced by the contained element in any models that reference it.

Message Component Naming

Modelling systems tend to use long names to uniquely identify model components within a basically flat structure. These names typically do not take into account the context in which the component will be used by a specific application.

EDIFACT message components are assigned unique alphanumeric identifiers. For each identifier there is, in the EDIFACT message directory that contains the formal definitions of each component, an English name that helps to identify the role of the component to English speakers, but this name may not be unique. The names assigned to EDIFACT message components often needs to be qualified to identify the context in which the code is being used. This is typically done by associating a qualifier attribute with the message component.

XML messages are inherently hierarchical. An XML path definition used to identify an element within a hierarchy can include any or all of the parents of the element and/or its contents (e.g. FirmBooking//Party[PartyID="152433"]).

To allow applications to take advantage of the powerful structured component identification techniques provided by XML the names of XML elements and attributes used to record EDI message components should be simplified by applying the following rules:

Element and attribute names should be meaningful to the community that is using them. Where possible unique identifiers to standardized (ideally on-line) definitions of the meanings of the name should be provided as a fixed, namespace controlled, property of the element (e.g. <!ATTLIST LocalName UN-EDIFACT:Segment CDATA #FIXED "ABC" ...).

Note: Such identifiers indicate equivalences between the components of different message types that local agents can use to automate message processing.

Any part of a class or attribute name that is implicit from its containment should be omitted from an XML element or attribute name. However, this rule should not be applied where it application would result in assigning the same name to different concepts or to similar concepts represented by a different structure.
Spaces and punctuation marks should be removed and the first letter after each space, punctuation or sequence of spaces and punctuation should be capitalised. (To avoid confusion all letters other than the first letter of the name and the first letter after each removed space should be lower case characters.)
Conjunctions should be removed by reordering the associated words (e.g. "identification of message by originator" could become OriginatorMessageIdentification).
Words which add no meaning, such as words that identify the domain of the name (e.g. "health") or the type of the contained value (e.g. "list" or "value"), should be removed from element and attribute names.
Words that are used frequently in the names of message components should be replaced, consistently, by readily intelligible abbreviations (e.g. Id for "identifier" or No for "number").

Note: The abbreviation process can be used to reduce different forms of words to a common stem (e.g. "identifier" and "identification" can both be mapped to Id).

Let the computer do the work

Wherever possible the DTD should provide information that can be used to reduce the size of a message. Useful techiques include:

Defining unchanging message components as fixed attributes whose value is predefined in the DTD and, therefore, does not need to be exchanged in the message instance.
Defining message components that only change their value rarely as default values for attributes, so that only those message that require a change need to be assigned alternative values for the relevant attribute.
Using external parameter entities to reference locally significant subsets of permitted attribute values.
Using references to unparsed external general entities to identify constraints that apply to an element. (The notation of the unparsed entity will identify which agents are needed to validate the constraints.)

Future enhancements

The techniques proposed in this paper are based on use of functionality provided in the parts of the current XML 1.0 that relate to the creation of document type definitions (DTDs) which can be used to "validate" well-formed XML messages.

During the course of the ISIS European XML/EDI Pilot Project W3C have started a new working group to define how to describe the permitted structure of XML data streams using XML-defined schemas. Associated with this schema definition language is a mechanism for defining the data type constraints that should apply to XML element contents and attribute values.

At the time of writing these guidelines (September 1999) the XML Schema proposals were still in draft form, and tools for the validation of XML data against schemas were not generally available. Whilst many of the constraints that apply to XML/EDI messages could be defined using the datatyping parts of the XML Schema proposal it is be no means clear that all required constraints could be handled within the limits of the proposal. Concerns have, for instance, been expressed about certain aspects of the management of large code lists, and the subsetting of code lists, when XML Schemas are used for data validation. In addition no facilities are currently provided for identifying relationships within multilingual codes lists whose presentation order will be dependent on the user's preferred language.

Over the longer term it is clear that the advantages (or disadvantages) of using XML Schemas to describe the structure of, and constraints that apply to, XML/EDI messages need to be fully evaluated. It is anticipated, however, that the rules stated in this paper, and the associated papers defining rules for mapping from EDIFACT MIGs and UML models to XML message structures, will apply equally well when XML Schemas are being used for message component validation.

Acknowledgements

This paper has been produced as a result of research undertaken by members of the ISIS European XML/EDI Pilot Project consortium (http://www.tieke.fi/isis-xmledi) with co-funding from the European Commission's Information Society Initiative for Standardization (ISIS).