Three main techniques for converting existing EDI messages to XML can be identified:
The ISIS European XML/EDI Pilot Project has investigated how XML message structures can be developed to meet two very different sets of requirements:
In the healthcare arena, where the starting point was a formal model developed in UML, it was possible to develop a set of rules for generating an XML DTD from the model which will be generally applicable to any data set defined using a UML model. These rules are presented in a separate document: Mapping from UML Generalized Message Descriptions to XML DTDs
A second project in the healthcare arena evaluated the use of simplified local DTDs for data capture and manipulaltion purposes. Data captured using the simplified DTDs can, after undergoing any required local processing, be mapped to the richer formal model for interchange with other systems through the use of XSL Transformation specifications. Details of this approach are presented in a separate document: Best practices for linking local applications to a communication standard.
In the transport arena the lack of a formal model underlying the EDIFACT messages made it difficult to identify all the inter-relationships between the data elements, composites and segments forming each message. Often these inter-relationships are defined by means of commentary within the message implementation guidelines, rather than being defined in the main directory definitions of the component parts. Nevertheless, it was possible to develop a set of rules that enable XML DTDs to be developed on the basis of the message structure defined in the MIG for a specific application of an EDIFACT message. These rules are presented in a separate document: Rules for Mapping Existing EDIFACT MIGS to XML DTDs
An alternative approach to the development of XML DTDs for EDIFACT messages is to start from a functional specification of the requirements, based on the business terms defined in the EDIFACT directory, without regard to the segments they form part of. At the time of writing, this approach has not been evaluated.
This document presents the conclusions and Best Practice guidelines that have emerged from the work of the project to date, and which are common to the two application domains (healthcare and transport) and the two formal mechansims that served as starting point for the work (UML and EDIFACT).
It is likely that additional conclusions and guidelines, or enhancements to those presented here, will emerge in the latter stages of this project, in light of the fact that:
Guidelines that are specific to one or the other approach (starting from the EDIFACT MIG, or starting from a formal UML model) are detailed in the attached documents:
Where no suitable message currently exists, and time and expertise permits, formal modelling of requirements has obvious advantages. Where the requirement is to make the maximum use of an existing message format the starting point should be the semantics and relationships defined in a specific message implementation guideline rather than from the generalized definition of the message type provided in a message directory. While the MIG should reflect the structure of the message as defined in the general message directory this should not be allowed to mask the relationships between message components identified within the commentary of the MIG.
Note: There may be advantages in creating a formal model for the message to record the relationships between the message components used in the MIG. Often this process will identify inconsistencies in the current structure of the message and will suggest ways in which a more functionally structured message can be created.
One of the major advantages that XML will bring to the EDI community will be its unique facility to transform message structures to meet user needs. This means that an XML version of a message can be structured differently from its EDIFACT equivalent without destroying the ability of the two coding schemes to interwork. In addition local DTDs can be used to capture or manipulate subsets of larger messages for use in specific environments.
By concentrating on the functional requirements of a message rather than its physical structure EDI practioners will be better able to reuse message components and software "agents" designed to undertake the processing of particular sets of data. The hierarchical data structure used within XML provides a natural "containment" mechanism that can be utilized to group data that needs to be processed at the same time. By adopting a hierarchical approach to message structuring and processing EDI users will be able to incrementally introduce formal modelling of message components.
In analyzing the similarity and differences between the different mapping approaches taken by the project the following major principles have been identified:
Each MIG should identify the mimimum set of permitted values from each referenced UN-approved code list, rather than allowing users to select any permitted value listed in the UN directory. For code lists approved by other agencies mechanisms should be provided for local reduction of the list of permissible codes, ideally through use of references to locally accessible code sub-lists that can be referred to using parameter entity references in the XML DTD. Values selected from code lists should be recorded as attribute values rather than the contents of elements.
In cases where the EDIFACT message allows the same message component to occur at more than one point in the structure only one of the options should be selected, unless it is obvious that there is a need for the component to appear in multiple contexts (e.g. for a free text annotation). Where the same component must occur more than once at the same level in the hierarchy each element representing the component should have its name qualified by either a property of the component or by a number (e.g. DocumentDate and DeliveryDate or Date1 and Date2).
General data models described using UML are typically defined to cover a much larger scope than is required for a single message. UML models have no in-built hierarchical structure, representing networks rather than hierarchies. Therefore UML models need to be subsetted into hierarchically structured sets that identify only the objects that are required for a particular message prior to the creation of an XML definition of the message structure. The hierarchy developed should seek to ensure that each message component only occurs once at each level in the hierarchy, unless the component name is suitably qualified or numbered.
Where an element would only contain one other element then it should be removed from the XML model and be replaced by the contained element in any models that reference it.
EDIFACT message components are assigned unique alphanumeric identifiers. For each identifier there is, in the EDIFACT message directory that contains the formal definitions of each component, an English name that helps to identify the role of the component to English speakers, but this name may not be unique. The names assigned to EDIFACT message components often needs to be qualified to identify the context in which the code is being used. This is typically done by associating a qualifier attribute with the message component.
XML messages are inherently hierarchical. An XML path definition used to identify an element within a hierarchy can include any or all of the parents of the element and/or its contents (e.g. FirmBooking//Party[PartyID="152433"]).
To allow applications to take advantage of the powerful structured component identification techniques provided by XML the names of XML elements and attributes used to record EDI message components should be simplified by applying the following rules:
During the course of the ISIS European XML/EDI Pilot Project W3C have started a new working group to define how to describe the permitted structure of XML data streams using XML-defined schemas. Associated with this schema definition language is a mechanism for defining the data type constraints that should apply to XML element contents and attribute values.
At the time of writing these guidelines (September 1999) the XML Schema proposals were still in draft form, and tools for the validation of XML data against schemas were not generally available. Whilst many of the constraints that apply to XML/EDI messages could be defined using the datatyping parts of the XML Schema proposal it is be no means clear that all required constraints could be handled within the limits of the proposal. Concerns have, for instance, been expressed about certain aspects of the management of large code lists, and the subsetting of code lists, when XML Schemas are used for data validation. In addition no facilities are currently provided for identifying relationships within multilingual codes lists whose presentation order will be dependent on the user's preferred language.
Over the longer term it is clear that the advantages (or disadvantages) of using XML Schemas to describe the structure of, and constraints that apply to, XML/EDI messages need to be fully evaluated. It is anticipated, however, that the rules stated in this paper, and the associated papers defining rules for mapping from EDIFACT MIGs and UML models to XML message structures, will apply equally well when XML Schemas are being used for message component validation.