[This local archive copy is from the official and canonical URL, http://www.personal.u-net.com/~sgml/xml-need.htm; please refer to the canonical source document if possible.]

The need for a European XML/EDI Pilot Project

Martin Bryan, Manager of The CEN/ISSS European XML/EDI Pilot Project

Background

To read some of the hype about XML currently to be found in the press you would think that XML will be the answer to the business man's prayer. Suffice it to say it is far from that. In practice, for most businesses, XML is too new to be considered as a sensible choice for business-to-business information interchange. If it is ever to serve a role in this capacity the XML family of standards will have to offer more functionality than it currently does. This paper discusses some of the functions that need to be provided before XML can become the standard mechanism for interchanging data between business processes.

One of the problems with trying to work out the role of XML in business-to-business communications is that the whole concept of electronic data interchange (EDI) is in a state of flux. What started as a set of informal agreements between companies for the exchange of data between computer programs became a set of formal agreements between industry bodies before leading to the development of international agreements that have attempted to develop messages that apply to many different industry segments on a global basis. More recently message design based on formal data modelling techniques have come to the fore, though many of the long-standing messages in everyday use tend to be so overloaded as to make object-oriented modelling of their contents impossible. In addition methods for human-to-machine information interchange using what is referred to as interactive EDI and forms-based EDI have been proposed.

EDI started as a mechanism for interchanging data in the most compact possible form, to reduce the costs involved in using the 300 baud dedicated lines that were then the only means of connecting computers. Because of this many of the long established messages make heavy use of shorthand codes to represent information.

Over time more and more EDI messages have become used for multiple purposes, with the result that fields have become overloaded with confusing sets of semantics. In many cases only a subset of the permitted coding values for a field can be use in a particular segment of a given message. Not only is context important, but in many cases this context is dependent on what is contained in other parts of the message. For example, there is often a requirement that if a particular data element is used once in one part of the message, a related data element must appear in another part of the same message.

Another major problem is provided by the widespread, if somewhat restricted, use of data typing to manage the contents of EDI data elements. In many EDI messages there are constraints on the type and length of data that can occur in a particular field. Typically the length restrictions can be traced back to the need to keep EDI messages to the smallest practical size, or to restrictions on the space that could be allocated to the field for display of its contents on an 80 character wide non-scrolling screen. The data typing restrictions on most EDI messages are primitive in that, if the entries do not need to conform to a given code list, they can be any numeric or alphanumeric value up to the maximum permitted number of characters. You do not, for example, normally find restrictions in EDI messages that say that the value must be expressed in units divisible by 25 or 50, or that the first character of a string must be the letter M.

Trying to model existing EDI messages using an XML document type definition (DTD), or even the Document Content Description (DCD) extensions proposed by Microsoft et al, is impossible. There are simply too few functions in these languages to fully specify all the constraints that need to be imposed on all the variants of many existing EDI messages. In practice, however, the whole of any of the complex messages rarely needs to be mapped to a single DTD or DCD description. Firms wishing to use EDI typically develop Message Implementation Guidelines (MIGs) that identify the subsets of the full message formats and associated code lists that they will develop applications to support. It is these MIGs that XML/EDI applications need to be mapped to.

Foreground

Why is there a need for a European XML/EDI pilot project? Before any tools can be developed to allow XML to be used for business-to-business electronic data interchange it is necessary to study the full range of the problems likely to be encountered. This is not easy to do. There are hundreds of EDI messages in current use, and in some cases there are thousands of MIGs being used to manage the use of a single EDI message between companies or industry segments.

It is, therefore, paramount that any study tries to cover as wide a range of types of message as possible, while restricting itself to a realistic number of messages to formally create DTDs for. There are a number of criteria that can be used for this purpose.

One set of criteria relates to the question of how to encourage the take up of electronic data interchange by small and medium sized enterprises (SMEs), an area in which take-up of EDI in noticeably weak within Europe. Another is how to encourage the use of electronic data interchange to increase the amount of international trade both within the European single market and with countries that are not yet part of the formal market. A third criteria is how to encourage the use of electronic data exchange between SMEs and official bodies, such as customs and taxation authorities. Yet another is to look at how official bodies can use EDI to speed up their work. Finally there is a need to develop new markets for electronic data interchange.

In the European XML/EDI Pilot Project set up as part of the European Centre for Standardization's Information Society Standardization System Electronic Commerce Workshop (CEN/ISSS ECW) we have tried to combine these approaches into the smallest possible subset. We have selected three strands for our development strategy:

  1. The use of EDI to speed up the paperwork associated with international transhipment undertaken by small transport companies
  2. The use of EDI to capture data from local administration bodies for input to a central statistics office
  3. The use of EDI to allow the movement of clinical data between hospitals throughout Europe.

Why have we chosen these three strands?

The problems associated with international freight movement are paramount to the development of a European single market. Throughout Europe a large number of SMEs are involved in international transhipment. If these firms are to be able to take advantage of EDI they will need low cost data transfer systems that can be linked to their normal accounting and data management processes. In many cases these will take the form of a single PC, possibly supplemented by portable devices that travel with a vehicle. Not only must the EDI messages ensure that the right vehicles are in the right place at the right time, they must also ensure that the vehicle concerned has the correct paperwork for use at each border that is crossed during transhipment. Therefore there is a high value-added content in being able to automate the process of requesting and undertaking transhipments that make the selection of the messages currently being used for this purpose a natural one for studying in a European context.

The problems of collecting data about transhipments are only one part of the growing demand on governments to prepare data that can be used to monitor the development of Europe. In addition, national statistics agencies are required to provide an increasing amount of information to the European Statistics Office (Eurostat). To be able to do this they need to send large numbers of forms to local administrators, and collect the results so that national statistics can be generated. The forms used for this purpose are often complex, and require large amounts of supplementary material to guide users how to complete each field. It is rare that one person can complete a single questionnaire. Data needs to be gathered from a number of sources, each the responsibility of a different department within the local administration. Simple HTML forms are not sufficient for the task. Where the data is already available in local databases, standard EDI message formats can be utilized to move data from database to database, but in many cases there are holes in the available information set that can only be filled by human input. There is a need, therefore, to combine the interrogation of local data sets with the ability to ask for human input where the local data set is incomplete. This will be one of the major areas of study for the European XML/EDI pilot project.

The healthcare industry is one in which EDI has not traditionally been employed for data interchange within Europe. Yet throughout Europe there are now major initiatives to establish electronic data interchange facilities between hospitals, and between the general practitioners responsible for day-to-day care in the community and the supporting doctors, consultants and laboratories that are provided in centralized locations. With the increasing freedom of movement within the European community it has become vital that medical records can be moved from country to country, which introduces problems relating to the use of different languages for recording notes, and the even worse difficulties of trying to decipher a doctor's handwriting when it is not in your native language! CEN has set up a special technical committee to study the problems of information interchange within healthcare (CEN TC 251). By working closely with this committee, the CEN/ISSS XML/EDI project group will be able to study the advantages of using formal modelling techniques to develop new forms of message that can be encoded using multiple techniques, including XML.

The Middle Ground

A combination of XML namespaces and SGML architectural forms provides a useful middle ground for developing tools that can manage business-to-business XML messages. We cannot put all the intelligence needed into the XML messages, and cannot rely on the tools having all the knowledge they need to decode particular types of messages. The basic concept that we wish to adopt, therefore, is that of invoking the help of 'intelligent agents' to take over the validation of the data in the message at the point where the XML parser has done as much as it is able to.

How would this work? There are a number of possibilities that the pilot project will need to evaluate. For example, by creating separate namespaces for processing specific types of commonly used data elements it will be possible to develop resources that can be shared across messages. By using XML's extended linking mechanism it might be possible to create compound records, such as those required to create questionnaires that contain separate parts, or patient records.

SGML's recently added facilities for defining meta-DTDs of 'architectural forms' suggests another way in which we can associate sharable functions with specific messages. When combined with XML's ability to allow more than one attribute list declaration to be associated with a given element, the possibility of adding local sets of processing architectures to messages provides interesting food for thought. What do I mean by this? Perhaps the best way to explain is by using a simplified example.

Let us consider the following simplified EDI message:

<?xml version="1.0"?>
<!DOCTYPE Order SYSTEM "http://www.sgml.u-net.com/xml-order.dtd">
<Order>
<MessageID>128576</MessageID>
<MessageDate>19970812</MessageDate>
<Buyer>5012345678900</Buyer>
<Supplier>6012345678900</Supplier>
<OtherParty Role="Carrier">7012345678900</OtherParty>
<Item>
<ItemID>8012345678900</ItemID>
<Quantity>90</Quantity>
<DeliveryDate>19981012</DeliveryDate>
</Item>
</Order>

The DTD used to validate this message could take the following form:

<!ENTITY % local-processing-attributes SYSTEM "/agents/orders.ent">
%local-processing-attributes;
<!ELEMENT Order        (MessageID, MessageDate, Buyer, Supplier,
                        OtherParty+, Item+) >
<!ATTLIST Order
  xmlns                CDATA #FIXED "http://www.sgml.u-net.com/order"
  xmlns:UN-EDIFACT     CDATA #FIXED "http://www.un.org/edifact/D96A"
  UN-EDIFACT:Prefix    CDATA #FIXED "UNH"
  UN-EDIFACT:MessageID CDATA #FIXED "ORDERS:D:96A:UN:SIMP01"
  SequenceNo           CDATA #IMPLIED >
<!ELEMENT MessageID    (#PCDATA) >
<!ATTLIST MessageID
  UN-EDIFACT:Prefix    CDATA #FIXED "BGM"
  MessageType          (Order|Deliver|Despatch|Movement|
                        Produce|Process|Treatment) "Order"
  EDI.Constraints      CDATA #FIXED "an..35" >
<!ELEMENT MessageDate  (#PCDATA) >
<!ATTLIST MessageDate
  UN-EDIFACT:Prefix    CDATA #FIXED  "DTM"
  DateType             CDATA #FIXED  "MessageDate"
  DateFormat           (Date|Period) "Date" >
<!ELEMENT Buyer        (#PCDATA) >
<!ATTLIST Buyer
  UN-EDIFACT:Prefix    CDATA #FIXED "NAD"
  Role                 (BY)  #FIXED "BY"
  Agency               CDATA #FIXED "EAN"
  EDI.Constraints      CDATA #FIXED "n..13" >
<!ELEMENT Supplier     (#PCDATA) >
<!ATTLIST Supplier
  UN-EDIFACT:Prefix    CDATA #FIXED "NAD"
  Role                 (SU)  #FIXED "SU"
  Agency               CDATA #FIXED "EAN"
  EDI.Constraints      CDATA #FIXED "n..13" >
<!ELEMENT OtherParty   (#PCDATA) >
<!ATTLIST OtherParty
  UN-EDIFACT:Prefix    CDATA #FIXED "NAD"
  Role                 (Carrier|ShipFrom|DeliverTo|Invoicee) #REQUIRED
  Agency               CDATA #FIXED "EAN"
  EDI.Constraints      CDATA #FIXED "n..13" >
<!ELEMENT Item         (ItemID, Quantity, DeliveryDate?) >
<!ATTLIST Item
  UN-EDIFACT:Prefix    CDATA #FIXED "LIN"
  EDI.MaxOccurs        CDATA #FIXED "200000" >
<!ELEMENT ItemID       (#PCDATA) >
<!ATTLIST ItemID
  Agency               (EAN|UPC|SuppliersArticleNo) "EAN"
  EDI.Constraints      CDATA #FIXED "n..13" >
<!ELEMENT Quantity     (#PCDATA) >
<!ATTLIST Quantity
  UN-EDIFACT:Prefix    CDATA #FIXED "QTY"
  Units                CDATA #IMPLIED
  EDI.Constraints      CDATA #FIXED "an..15" >
<!ELEMENT DeliveryDate (#PCDATA) >
<!ATTLIST DeliveryDate
  UN-EDIFACT:Prefix    CDATA #FIXED "DTM"
  DateType             CDATA #FIXED "DeliveryDate"
  DateFormat           (Date|DateTime|Period) "Date">

The first thing to notice is that the DTD starts by making a call to a local system file, called orders.ent, which is stored in an agents directory. This file allows the local system to add additional processing control properties to those provided in the DTD in the form of additional attribute list declarations for any of the elements in the order. Before considering what these attributes would do, however, let us start by looking at the processing controls that the DTD already contains.

In the declaration for the root document element, Order, two XML namespaces have been declared using an attribute whose name begins with xmlns. The first of these namespaces, which has no qualifying name, indicates that any element or attribute whose name is not qualified by a namespace is to be processed using the rules specified for processing orders by the company identified in the URL. (These rules are the ones that would normally be found in the Message Implementation Guidelines used to qualify the local use of an EDI message.) The second namespace declaration, which is qualified by the name UN-EDIFACT, indicates that attributes qualified by this namespace identifier are defined using rules in a particular message directory set up by the UN as part of its Electronic Data Interchange For Administration, Commerce and Transport series.

The next two attributes associated with the Order element tell the receiving system that this element is equivalent to an EDIFACT header (identified by the segment name UNH) and that the type of message being transmitted has been assigned a unique identifying sequence within those rules. The values assigned by the UN, or its agencies, to these attributes can be used to control the processing of the order by EDIFACT-aware XML tools.

The final attribute for the Order element tells the system that there needs to be a locally defined mechanism for implying the sequence number to be assigned to the order if one has not been specifically defined as part of the order. In this example this requirement is defined using the standard SGML #IMPLIED keyword, for which local processing is always required if you wish to derive a value. Nothing is said about how this should work in the SGML or XML standards, but in this example details of the rules to be applied have been associated with the attribute already, because the default namespace declaration indicates where you can find the rules for processing attributes of this type.

For each of the other elements that make up an order a similar set of attributes have been defined. Each element is associated with a particular type of EDIFACT segment. In some cases more than one element is associated with the same type of EDIFACT segment. In these cases one or more to the other attributes will be used to indicate which variant of the EDIFACT syntax is required. Sometimes this variant is fixed by use of an attribute value with the keyword #FIXED immediately in front of the default value. In others the user is allowed to override the default value defined in the DTD with one of a defined set of attribute values. In all cases the attributes used to control this are defined as processing rules defined by the developer of the DTD. This is done for two reasons. Firstly the names used for the options have been defined by the developer of the DTD to suit local processing requirements: such values need to be converted to a shorthand format before they can be used in an EDIFACT message. Secondly, in most cases only a subset of the values permitted by EDIFACT are likely to be valid. Therefore the rules in the MIG indicated by the default namespace statement should apply, rather than the full set of rules defined in the UN EDIFACT directory.

Most of the attributes we have discussed so far are ones that are specific to a given element or namespace. But there are two attributes whose name begins not with an XML namespace identifier followed by a comma, but with a prefix, ending with a period, that indicates they have specific relevance to EDI processors. These attributes indicate the association of an SGML architectural form defined specifically for the processing of EDI messages with the element concerned. The EDI.Constraints attribute indicates the data typing rules to be used to validate the contents of the element. For this simple example these constraints have been expressed using a variant of the simplistic data typing rules used for EDIFACT messages. The EDI.MaxOccurs attribute can be used to indicate the maximum number of times a repeatable element may occur in a message, a common constraint imposed on EDI messages which cannot be expressed in an XML DTD.

The advantage of using architectural forms rather than message-specific attributes for defining EDI related checks is that the same architectural form can be applied to many messages: a single intelligent agent can be used to ensure that the constraint is applied to all elements that need to be validated in this way.

Which brings me back nicely to the local processing attributes referenced at the start of the DTD. While in many cases the DTD supplied with the document will contain enough information to allow the receiving processor, if it is aware of how to process the architectural forms and interpret the EDIFACT specific parts of the declarations, to link the message contents to appropriate local processes. But in some cases this may not be possible. For example, note that the DTD does not define any constraints on the date. The fact that these elements are UN-EDIFACT date/time (DTM) segments constrains them to be ISO 8601 conformant dates (e.g. of the form CCYYMMDDHHMM), but it is not a good idea for the sender of the data to tell the receiver how he should process dates. For example, a UK buyer should not constrain date processing of a French supplier, who will use a different format for defining dates. The processing rules for dates must be declared locally. This can be done by adding declarations of the following form to the local orders.ent file:

<!NOTATION LocalDate SYSTEM "/agents/Dates.DLL" >
<!ATTLIST MessageDate
  DateProcessor NOTATION (LocalDate) "LocalDate">
<!ATTLIST DeliveryDate
  DateProcessor NOTATION (LocalDate) "LocalDate">

Note that I have chosen to declare the additional attributes as ones invoking a local notation processor for the processing of dates. Using this process it is possible to provide links from any XML element to local modules that provide validation checks over and above those provided by an XML document parser.

The Holes

Having indicated how you can add basic (and even advanced) data typing to XML files, or provide for constraints such as ones relating to the number of occurrences an element is permitted, you might think that XML will meet the needs of business users. Far from it. Whilst the tools it provides for semantic checking are, almost, adequate, these are not in themselves sufficient to provide an adequate basis for developing robust systems for business users. What is missing?

As any business or government department will tell you, you need forms, in triplicate at least, to trade! Forms need to be signed and filed at appropriate places in the transaction process. What facilities does XML have for forms processing, or for signing off forms? None! The Extensible Stylesheet Language (XSL) designed as an adjunct to XML provides, as initially defined, no mechanism for user input of data, whether in the form of entering or correcting the values of fields in forms, for adding a digital signature to a form on completion of the necessary checks, or for encrypting the contents of all or part of a form. As such it is not a suitable basis on which to base international trading or the distribution of questionnaires. It is presumed that such functionality will be added by the time that XSL is fully defined, and that this functionality will include facilities for submitting completed forms to multiple locations in the form of XML messages. Without such functionality XML will be a non-starter as far as international trading by SMEs is concerned.

Another problem comes with the selection of options for inclusion in an interactive form. While it is possible to create lists of permitted values for specific attributes within an XML DTD, the fact that these attribute values must be specified as name tokens, without internal spaces or punctuation, limits their user friendliness. In addition, for many EDI messages the number of options that must be listed are too large to be included in an attribute definition list, or change too often to easily remain part of a DTD. An associated problem is the need to subset the list of permitted values, dependent on selections already made within the form, or dependent on the context the values are used in within the message.

Let me give you an example of the types of problems that can occur, based on the example message shown above. In the example four elements use electronic article numbers (EANs) as a shorthand form to identify information. The Buyer, Supplier and OtherParty elements use EANs to identify the names and addresses of the various parties to the shipment. The ItemID element uses EANs to uniquely identify products being purchased (this is the digital form of their bar code). However, for the latter element, alternative sets of article number may be referred to if the default value for the Agency attribute is changed.

There are millions of EANs. The international EAN register is being added to every day. There is no way you can list all valid EANs in an attribute definition list, or even in a drop-down selection list in a form. In addition, for any particular transaction, only a small subset of EANs are valid. The MIG for using an order form will restrict the set of buyers, suppliers and carriers, etc, that can be used. Selecting a supplier will limit the number of EANs that can be listed as valid item numbers. Unless XML or XSL includes as mechanism for the dynamic creation of lists of permitted values by interrogating some local or remote database with a set of selection criteria that causes a suitable table of options to be returned it will not meet the needs of international businesses.

Another problem that EANs illustrate is that of what should be displayed to users. Obviously you do not want to display just the EAN. One number looks much like another. It is too easy to pick the wrong one by mistake. But is the name sufficient? For example, if I just replace the number by the name of the company that the data identifies will that be sufficient? What if the company is IBM or McDonalds? How will you know that you have picked the correct address identifier simply from the company name? To distinguish the correct location you need to be able to recall a number of associated fields, either displaying them in the form of a table, a pop-up window showing a structured entry, or as a set of submenus that users can used to identify the correct variant from multiple entries. Until XML/XSL browsers can provide complex functions for the selection of appropriate form values they will not be widely adopted by business users for message generation.

One final problem that concerns the capture of data in XML forms is that of trying to automate the process as far as possible. For example, if a questionnaire is to be completed by a local government department that maintains databases of local statistical information then it would be nice if the application could interrogate the local database and retrieve the relevant value automatically. If the EDI-aware browser included a standard SQL query generation tool that could be triggered by adding an attribute conforming to an internationally agreed SGML architectural form that could be added to the DTD as one of the local processing attributes, it would be relatively easy to provide locally relevant methods for completing electronic questionnaires.

Ok, so these data capture problems can be solved with a few additional functions in XSL, but is that all that is needed? By now you will have guessed my answer. No! Data generation is only one part of the business cycle. The other part concerns the processing of received data. We need to develop mechanisms to pass the contents of one or more elements within a business message or questionnaire to local database or to programs that perform the next operation is a controlled sequence of processes. To do this there needs to be a standardized mechanism within the data receiver to indicate which local workflow processes or databases the data is to be sent to. XSL has mechanisms for subsetting messages, but it does not have any mechanism for identifying the next process in a chain of processes, or for creating SQL or ODBC commands that can be used to update databases. Until we integrate XSL with things like the Simple Workflow Application Protocol (SWAP) and standardized database input/output methodologies it will be impossible to create truly portable XML-based business-to-business applications.

And in conclusion

What can we conclude from the above? Well, if we want to use XML as a tool for business information exchange we must develop tools that do more than simply parse XML data streams. Such tools must be able to:

If such facilities are to be relevant to SMEs they must be standardized so that they can be applied by a wide range of tools. The aim of the European XML/EDI pilot project is to demonstrate what needs to be done and to suggest how such functions can be standardized in such a way that they can be widely deployed throughout Europe.