[This local archive copy mirrored from the canonical site: http://www.sgmltech.com/papers/edidocsgml.htm; links may not have complete integrity, so use the canonical document at this URL if possible.]
Philippe Vijghen
This paper describes the use of SGML in the EDIDOC project for the European Space Agency. The project involved the creation of a flexible framework for exchanging different types of documents, being a gateway for workflow, document conversions, security, and communication. It is used for calls for tenders, working documents, and press releases, and also covers WWW publication.
SGML was used for many aspects including attaching the different envelopes of the messages exchanged and as a technology for defining workflow scenarios. Benefits and challenges of using SGML or XML at different levels are highlighted, both technically and organizationally.
Philippe Vijghen is a project manager at ACSE sa/nv, Brussels, a Member of the SGML Technologies Group. He is a software engineer and systems architect specializing in object-oriented distributed applications and complex document-oriented Electronic Data Interchange systems; in addition to structuring documents, these systems make use of SGML at other levels, such as for external application programming interfaces. He obtained a master's degree in Electromechanical Engineering at the Free University of Brussels (ULB), and may be contacted at phv@acse.be.
More and more companies are now moving towards EDI with their partners, in the broad meaning of the term, including collaboration on document processing. However, it is still challenging to reach an agreement on technical choices at different levels (for example document formats, electronic security mechanisms, and protocols). This can be especially so when dealing with complex workflow on a WAN, across borders.
This paper results from the experience gained during the development of a flexible framework for document-oriented EDI. This framework, called EDIDOC, has been developed for the European Space Agency. The project started in 1994 and entailed about 10,000 hours of work, leading to the implementation of a system responsible for document exchange occurring in several distinct applications of the agency, including:
More than 2,000 people are involved in the document exchanges.
At the heart of the EDIDOC system a central server acts as a clearing house, giving a potential legal value to the documents exchanged by logging them and by keeping a structured trace of the exchange in a robust relational database.
This server integrates in a very generic and flexible way the key concepts needed in electronic document exchanges:
At each of those levels the server makes sure that the documents are delivered in accordance with preferences of the recipients: in the right format, with the right security package, and the right communication protocols. It really plays the role of a gateway.
For example, a telecommunications company could use EDIDOC to send invoices to customer. According to its profile, the customer may get the invoice as:
The server also includes provision for integrating specific application logic in document exchanges. This is typically used for implementing workflow scenarios such as support for document review cycles or a chain of approvals for payment. At ESA a scenario has been developed so that when a new press release is sent to registered subscribers (in ASCII, HTML, or as a URL), EDIDOC also publishes it on the WWW and adds hyperlinks on the organization's home page.
Many utilities have been developed around the server, for example system management tools and graphical clients.
Several different groups within an organization have the need to exchange formal documents with those outside the organization. The benefit of having them all going through a single point, in a 3-tier architecture, are obvious:
The following picture illustrates the functional architecture of EDIDOC.
As can be appreciated, such a project involved many aspects that, although interesting, lie outside the scope of this paper. Where SGML is concerned, it proved to be the technical cornerstone of the implementation.
SGML impacted the project at various levels:
The benefits and challenges of using SGML at those different levels will be highlighted, both from a technical point of view and from an organizational point of view. When appropriate, it will be explained how far SGML is competing with other concepts, in light of personal experience.
Although the EDIDOC application was used as a basis for this paper, a few of the statements and ideas put forward result from experience gained on other projects.
As the documents sent through EDIDOC are checked and then converted to various formats according to the profile of the recipient, the benefits of having original documents in SGML format seems obvious.
Although format conversions of unstructured documents are possible, the set of target formats is often limited and, most of all, the result of the conversion is often not satisfactory. Also, such conversions should preferably occur on the desktop instead of on a server.
Anyone familiar with SGML and with earth gravity will easily understand why translations to SGML are called up-translation and translation from SGML down-translation . It is indeed difficult to automate the enrichment of document structures, and a fortiori to exploit unstructured documents. Converting correctly structured documents to other target formats is a lot easier.
Even in the case where up-translation to SGML sounds feasible on the server, it has the disadvantage of detecting mistakes very late. Using structured editors for producing SGML right at the source is better.
However, although SGML has technical advantages, the reality is that for many organizations (although not specifically for ESA) it is very difficult to impose SGML authoring tools on authors, for various reasons. This is especially so when the authors are employed by other organizations. Moreover, even if SGML is accepted, it is challenging to agree on a DTD when many partners are involved.
EDIDOC does not enforce the use of SGML for documents. At ESA, a radically different approach was even selected for sending calls for tenders to potential contractors. Complete flexibility is offered to authors, provided that they generate PDF files as output. Of course, the problem of having PDF as a source for distribution is that it then limits the flexibility for further conversions (only ASCII versions of documents are produced from PDF in our case). But this drawback is negligible in comparison with the problem that the set of authors would face if SGML were imposed.
However, authoring using SGML is still recommended in the following cases:
If few of those conditions are fulfilled, SGML should not be considered; it would be too much of a challenge for little benefit.
As EDIDOC messages can be transported by many different vehicles (for example email, X.400, and WWW), we had to define an envelope for the documents in order to specify the details of the exchange: originator, list of recipients, unique reference, subject, time stamps, document types and formats, security mechanisms, delivery options, groupware context, remote management options, error messages, and so on.
Here is a sample header for an EDIDOC message:
<esh version="2"> <secpack mode="clear">none</secpack> </esh> <eh version="2"> <user level="process" notif="always" clearmsg="safemsg" clearnot="safenot" nondeliv="negack" destin="most"> <sndreq> <from>phv@acse.be</from> <ref>phv@acse.be - 10/15/97 10:46:11</ref> <subject>PDF version of EDIPress manual</subject> <docinfo> <type>FLAT</type> <format>PDF</format> </docinfo> <to>sansari</to> </sndreq> </user> </eh> %PDF-1.2 document is included here...
These envelopes are generated by the producers of information to send requests to EDIDOC and are parsed by the recipients who do want to exploit the EDIDOC context. This means that people should be able to generate and process them on a lot of different platforms, using different tools. In fact we developed the following APIs and tools for parsing or producing the envelopes:
For those integrating their applications with EDIDOC, it was easier to generate and process the envelopes directly by looking at the simple SGML specification. When security algorithms are required, the API-level interfaces are preferred. Of course, the end-users just use tools, much as an Internet surfer would use a browser for viewing HTML.
Using valid XML files (called light normalized SGML at that time) for structuring data exchanged between loosely coupled applications has the following selling points:
In the classic EDI context, EDIFACT has the big advantage of having well specified and standardized formats for many types of EDI messages, such as invoices and orders. The semantics of all the fields of those messages are then clearly specified.
However, we believe that SGML, and more specifically its little child XML, is more appropriate when no existing EDIFACT message fulfils the needs. Indeed, if a new message specification has to be defined, XML has the following advantages over EDIFACT:
One of the most famous examples is that Microsoft selected a light SGML specification, à la XML, for its Open Financial Connectivity, used in the electronic dialogues between Microsoft Money and banking institutions.
In its generic way for configuring convertors, EDIDOC defined, for each type of document, a pivot format. Conversions (up or down) and conformance-checking scripts are defined relative to this format. Although EDIDOC does not impose its use, SGML is the obvious choice for being a pivot format. In fact SGML not only offers a way to specify and structure documents but it even offers a few mechanisms for processing the information.
The following features are particularly useful for processing non-SGML documents with full- blown SGML-aware toolkits.
The EDIDOC server offers the possibility to plug-in customized application logics, such as a workflow scenario for supporting document review cycles. The server keeps track of the processing evolution, dispatches the various relevant events and, most importantly, offers transparent access to the various document formats, security mechanisms, and communication schemes.
Technically, so-called groupware scenarios can be plugged in EDIDOC by implementing external scripts called "event handlers".
We thought first about including a generic workflow engine in EDIDOC. This engine would have been processing the definition of finite-state machines or Petri networks. However, we realized that there were annoying limitations for the developer in each of those models (for example "What if a new reviewer is added in our scenario?") Also, we preferred to have loosely coupled external event handlers, interacting with EDIDOC through a clean interface.
This interface had to impose as few constraints as possible on the tools used for developing programs defining the customized behaviour of the server for a workflow scenario. Also, the interface between EDIDOC and the groupware event handlers has been defined as a simple SGML (XML-compliant) interface.
Those event handlers get various kinds of event from the server:
In response to those events, the external script can request the EDIDOC server to:
All of those requests are passed between the script and the server as XML-formatted data flows that comply to specific DTDs: one for the standard intput, one for the standard output. This gives complete flexibility to the developers for the languages and tools to be used for implementing the external scripts; if they can control the standard input and output, they have all what is needed. Of course, using an SGML toolkit gives the advantage of having robust parsing with little effort.
This is the most complicated part of the paper and covers a methodology that our Group is using on several projects. This methodology helps us to define, document, and implement applications based on multiple state diagrams. This is particularly useful for implementing workflow solutions.
We have been developing a meta-DTD for specifying classes of state diagrams. From this meta- DTD, the following documents are automatically generated:
In the meta-DTD, classes are defined as sets of finite state diagrams, including states and events.
In the generated implementation-DTD, each event is mapped to an SGML element. That element can have specific attributes and can include structured data as child elements; this allows for the passing complex structured information with each event.
SGML link rules allows for the attachment of external information to SGML elements in a standard context-dependent way. We use this mechanism to attach application logic to SGML events. Thanks to the implementation-DTD, different link sets are used for each state in the diagram; they are automatically activated by standard SGML means when a new state is reached. Specific application code can then be linked to the occurence of the various events by the customization of the SGML link rules of the link set, for example for accessing a database and forwarding events to other processes. The code to execute when an event element occurs is selected by the standard use of SGML LINK.
When a new event element is input, the SGML parser also knows the new state of the state diagram by standard SGML means. This is due to the appropriate definition of the SGML group models and the OMITAG feature in the generated implementation-DTD. No specific code has to be written for doing this.
The following example illustrates a definition of state elements similar to the ones generated in the sample DTD. When an event is encountered in the input stream, the SGML parser selects implicitly the new state.
<!ELEMENT stateA O O ((event1,stateB)|(event2,stateA)> <!ELEMENT stateB O O (event3,stateC)|(event4,stateA)> ...
At run-time, for each class of diagrams, a single SGML parsing process is in charge of managing all the state-diagrams and the associated logic; there is one concurrent parsing of the various SGML DTDs defined for each state diagram. The application language has concurrent access to the various contexts.
Such an approach offers key benefits:
This paper has demonstrated where SGML is invaluable for system development and integration, based on the EDIDOC experience.
XML can be considered as a candidate of choice for structuring data in EDI applications, when no EDIFACT message fulfils the needs. It is also very useful as a way to structure interprocess communications when integrating distributed applications.
Full-blown SGML toolkits are a must for data processing. They are particularly useful for implementing data convertors, just by using standard features such as OMITAG, SHORTREF, LINK, and CONCUR. Finally, SGML and related development tools offer a nice way for addressing, at the same time, the definition, documentation, and implementation of workflow scenarios.
I wish to thank Jordi Farrès, of the European Space Agency, for having defined the EDIDOC requirements and for the numerous open-mind technical discussions during the analysis and design phases.
My thanks are also due to Stéphane Bidoul for his explanations regarding the implementation of state-diagrams by the means of SGML technology.
CGI Common Gateway Interface DTD Document Type Definition EDI Electronic Data Interchange EDIDOC Electronic Data Interchange for DOCuments EDIFACT Electronic Data Interchange for Administration, Commerce and Transport ESA European Space Agency FTP File Transfer Protocol HTML HyperText Markup Language MAC Message Authentication Code MIME Multipurpose Internet Mail Extensions PGP Pretty Good Privacy SGML Standardized Generalized Markup Language URL Uniform Ressource Locator WAN Wide Area Network WWW World Wide Web XML Extensible Markup Language
Please mail your comments to Philippe Vijghen at phv@acse.be
This paper was first published in the Conference Proceedings of SGML'97 US, December 1998, pp 213-218.
Copyright © 1997 The SGML Technologies Group