[This local archive copy is from the official and canonical URL, http://www.personal.u-net.com/~sgml/xml-edi/edi-dtds.htm, 1999-02-05; please refer to the canonical source document if possible.]
This paper looks at the role document type definitions (DTDs) will play in the management of XML-based business-to-business electronic data interchange. It is designed to clarify a number of issues that seem to be causing concern at the beginning of 1999 during discussions relating to the best way to set up and manage shareable repositories of business message types. In particular it looks at:
A well-formed XML document or message consists of a set of nested
elements which are contained within a root element.
Each element has an element name that identifies
the role played by the contents of the element. Element names are used to
identify the start and end of each element, which XML requires are clearly
identified by the presence of a start tag and an end tag
containing matched pairs of element names. An XML start tag can also
contain specifications for one or more attributes, each of
which consists of an attribute name and a quoted attribute
value. The only constraints placed on a well-formed XML document
are those required by the XML syntax, which defines which characters
are used to identify the start and end of each type of markup
used to qualify the contained data, and the way in which the component parts of
this markup must be separated by whitespace or by the use of
an =
between attribute names and attribute values.
A document type declaration can be associated with an XML message to allow a more detailed set of constraints to be applied to the structure of the message. These constraints are expressed in the form of a document type definition (DTD) that can be used to:
DTDs consist of a set of declarations, the format of which depend on the type of object being defined. Each declaration starts with a keyword that identifies the type, and therefore format, of the declaration. The permitted set of keywords is:
ELEMENT
to indicate that the declaration contains a model
that constrains the permitted contents of a particular element
ATTLIST
to indicate that the declaration defines the names,
permitted values/types and default values of a list (set) of attributes that
are to be associated with a named element
ENTITY
to indicate that the declaration contains either the
replacement text, or a pointer to a file containing the replacement text,
that is to be used to replace a named entity reference within
messages associated with the DTD
NOTATION
to indicate that the declaration assigns a locally
significant notation name to a process managed by a resource
whose location is defined as part of the notation declaration.A DTD may also contain any number of comment declarations that contain explanations of the role of the DTD and its components, etc. Comment declarations start with a pair of hyphens in place of a keyword, and have a matching pair of hyphens before the declaration close.
When a XML document parser encounters a document type decalaration
at the start of an XML message it will automatically check the sequence
in which elements occur within the XML message conform to the constraints
expressed in the element declaration for the containing element within the
DTD. It will also check that the values of any attributes conform to the
restrictions declared in any attribute list declarations associated with
the element name within the DTD. In addition it will check that any attribute declared
with a fixed or required value in the DTD is present and has the required
type of value. If a required attribute is not present, but has been assigned
a default value in the DTD, the parser will automatically add the attribute
name and default value to the list of attributes associated with the element
in the message. Where an entity reference, consisting of the entity name
preceded by &
and followed by a semicolon (;
), is encountered
the reference will be replaced by the replacement text specified in the
entity declaration with the specified mame.
If this entity has been
associated with a specific notation then the name assigned to the notation
will be searched for in the notation declarations in the DTD, and the data
will be passed to the process(es) identified by the notation declaration
with that name (providing the application has a mechanims for processing data
of that type).
As part of this validation process an XML document parser can chose to create a document object model (DOM) for the message. This model will contain a desciption of each element in the document hierarchy, and associate with each node in that tree the set of attribute names and values that the parser has detected in the message or added as the result of checking for default or fixed values within the DTD. This document object model can be made available to other processes that need to be applied the message via the widely adopted Simple API for XML (SAX) or via an application dependent API.
XML messages can refer to parts of their own document object model, or to components of the document object model of another message, by use of the XML Pointer Language (XPointer). This language allows the component parts of a document to be identified from either the contents of the DOM or by their position within the DOM.
The DOM of an XML message can be transformed into an alternative structure using the XML transformation language that is defined as part of the Extensible Stylesheet Language (XSL) specification. As part of this process the name and order of elements and attributes can be changed, and new sets of default attributes can be assigned to elements.
The XSL specification also defines a set of elements and attributes that an XML message can be transformed into so that the message can be displayed by on a computer screen by an XSL-aware tool. It is not, however, a requirement of XSL that documents be transformed to this format. XSL allows documents to be transformed into any format required for subsequent processing.
To make it easier to manage sets of element and attribute names within DTDs that are made up of components taken from more than one source, W3C has recently approved an extension to the XML specification that allows the names of element and attributes to be assigned a prefix that can be used to associate a name with the Internet uniform resource locator (URL) of the definition of the namespace.
Whilst initial examples of the use of the
Namespaces in XML have concentrated on the use of namespaces
to qualify element names, it is anticipated that a more useful role for
namespaces within commercial environments will be through the use of
namespaces to qualify attributes. The reasons for this is that a namespace
qualified attribute value represents a tuple that can be directly related to
entries within a repository. To understand this, consider the following
namespace qualified attribute value specification: x:y="z"
.
The namespace qualifier (x
) is linked, via a namespace
declaration attribute in the same or a containing element,
to an Internet resource, which will typically be a database/repository. The
namespace declaration attribute will typically have the form:
xmlns:x="http://www.standardsbody.org/repository-x"
.
The attribute name associated with this namespace qualifier (e.g.
y
) will identify a relevant field or process within the
repository, which will be used to find information or processes
relating to the value defined as the attribute value (e.g. z
).
The following types of respositories are expected to be relevant for the creation of DTDs:
An example of such a repository is the General MIGS Repository (GMR) set up by the CEN/ISSS MIGS project for the storage of EDIFACT message implementation guidelines (MIGS) produced by European and other standards bodies and trade associations.
At present such repositories do not exist, though OASIS are currently discussing the setting up of such a repository.
Examples of such repositories include the ISO Basic Semantic Repository and the UN-EDIFACT Directories (neither of which currently has pointers to sources of XML declarations for their obejcts, but both of which have plans to extend themselves to provide access to such information) and the XML/EDI Glossary currently under construction for the XML/EDI Group.
It would be possible to build a single repository to serve all these roles, but more practically it can be envisaged that there are likely to be multiple repositories, some of which will only deal with subsets of these functions. What is important is that each of these forms of repository be able to supply the relevant information to application developers in a format that will be relevant for building the DTDs needed to validate messages to be used by specific applications.
For an application developer to be able to identify an existing DTD for a relevant message type a repository must provide the following facilities:
It should be possible to select directories based on criteria such as:
For example, show all messages containing an element or attribute name that includes the word "Deliver".
If the DTD in its unexpanded form contains parameter entity references the file submitted to the developer should be a ZIP file containing all of the component parts of the DTD, rather than the fully expanded DTD as displayed to the developer for review purposes.
A repository that contains storage entities whose contents are XML declarations designed to be used to build new DTDs should provide the following facilities to an application developer:
For example, show all messages containing an element or attribute name that includes the word "Deliver".
The fact that an element is used in many DTDs that define messages similar in type to the one to be developed will help the application developer in the selection of appropriate components. (It might also help him to identify complete DTDs that could be easily tailored to meet his needs.)
Users should be shown not only the declaration for the element(s) selected but also the declarations for any elements that are included in the model group that defines the constraints that are to apply as to permitted contents of the element. If these elements permit nested elements in their contents the declarations for these elements should be added to the display recursively so that the application developer can browse the declarations associated with all permitted subcomponents of the selected element.
When an element is selected for inclusion in a new-DTD the files containing all the declarations associated with it must be added to the list required to create the new DTD.
The system should be able to remove the name of non-compulsory subelements from the model of an element, together with the associated element declarations. It should also be possible to indicate which optional attributes are to be omitted from the new DTD, and to indicate entries in permitted lists of attribute values that are not required for the new application.
Once all the elements needed to construct the DTD have been identified, and tailored appropriately, it should be possible to build a ZIP file containing all the files needed to build the new DTD for distribution to the application developer.
This information is needed to guide future DTD developers to the new DTD, or to determine the fields in which the elements would be most appropriate. Ideally it should also be possible to record which attributes of the element have been used by the new DTD.
For a glossary or other form of semantic definition repository to be of use in the creation of DTDs it must offer the following facilities:
It should be possible to do a free text search of the whole repository for relevant terms.
When a match is found the repository must be able to display the whole contents of the matched record so that the user can determine whether or not the identified component is of the type required. The displayed information may contain links to other data sources, such as the formal XML declarations that have been associated with the component in different DTDs.
It should be possible to easily identify what type of data each link from the repository record takes you to. It should be possible to clearly distinguish between links that take you to complete DTDs that use the component, to declarations that can be used to add the component to a new DTD, or to links that simply take you to semantic definitions that describe the role of the component.
Ideally a semantic repository will also offer the functionality found in a DTD or declarations repository. It may also be linked to a mechanism for generating formal data models, such as UML data-entity diagrams.
Repositories can also be used to provide other forms of information that may be relevant to the development of XML/EDI applications. Possible extensions include:
Entries in a DTD repository may be linked to one or more style-sheets designed to display the message in a specific language/environment.
Entries in a declarations repository may be linked to one or more XSL templates that define the processing to be associated with the elements/attributes defined by that declaration set. Each template may refer to one or more sharable functions whose definitions will need to be recursively called to create an XSL stylesheet that can provide a start point for the application developer.
Entries in a semantic repository/glossary may reference sharable programs designed to process specific classes of message components.
Different stylesheets can be developed for the creation of the message by different language or user communities. By copying and modifying a stylesheet already used in association with the selected message in one environment an application developer can save many hours of work when developing a new application.
By providing a customizable template that can be used to generate a relevant set of database calls trade associations can save application developers working in their area many hours of work.
DTDs and declaration repositories can include definitions of fixed attributes that can be used to invoke data validation modules using a "standardized" notation name, attribute value, namespace declaration and/or reference to an Internet uniform resource locator.
XML notations are defined in terms of references to Internet uniform resource locators. To be of use application developers have to be able to map these URLs to modules that are applicable in their application environments. By providing links to tools that know about specific types of data trade associations can reduce the time needed to develop new applications in which it is necessary to transfer non-textual data (e.g. for the exchange of X-ray images in healthcare informatics applications, or the exchange of manufacturing data or drawings as part of complex tendering/ordering processes).
Data receivers need to be able to map data within received messages into local databases. Various techniques can be employed for this, including the use of the XSL transformation language to create a set of SQL instructions from an XML message. By providing templates for mapping a particular message to a documented database format trade associations can make it easier for application developers to associate message contents with their local databases.
XML DTDs should only be updated using fully-managed procedures that lead to the creation of uniquely identified versions of DTDs. XML document type declarations should use formal public identifiers to clearly identify specific versions in addition to a system identifier containing a URL that identifies both the source and the version of the document. A typical well-managed document type declaration will have the form:
<!DOCTYPE message-x PUBLIC "-//Standards-body-x//DTD for Message-X, 10th October 1999//EN" SYSTEM "http://www.standards-body.org/DTDs/message-x-19991010.dtd">
When a specific version of a DTD has been identified the receiving system will be able to cache the DTD the first time it is recieved in the full knowledge that the message structure will not change while this version of the document type declaration is associated with the message.
DTD repositories should provide access to all versions of a particular DTD. They should also provide a mechanism whereby the latest version of the DTD can be identified by using a shortened form of the system identifier, using techniques similar to that employed by W3C for identifying the latest version of an XML-related specification. For example, to reference the latest version of the DTD whose specific version has been entered above the document type declaration could be shortened to read:
<!DOCTYPE message-x PUBLIC "-//Standards-body-x//DTD for Message-X//EN" SYSTEM "http://www.standards-body.org/DTDs/message-x.dtd">
Undated DTDs should not be cached on the receiving system. Instead the application should check each time it receives a message with this document type definition associated with it that the last version it cached is still the latest version. (This can be done using the PEP extension method to HTTP, but requires The XML/EDI Group or a similar body to get agreement between repository creators as to the format this XML message needs to take.)
Where DTDs are being designed for use within closed communities it may not be necessary to use formal public identifiers. Providing there are well defined procedures for ensuring that a community updates its DTDs in a controlled manner it may be sufficient to rely on local file naming conventions within a system identifier. This could take the form of, say:
<!DOCTYPE message-x SYSTEM "/DTDs/message-x-v1.dtd">
The association of processing templates, such as XSL stylesheets, with messages, should be done at application level, and will not generally rely on the referencing of repository copies of stylesheets or processing modules. Normally the stylesheet or processing template will be copied from the repository to a local filestore prior to modification to meet local processing needs. Conventions should be set up between the communicating parties that allow for the storage and recall of the relevant processing instructions by reference to relative addresses which require the processing rules to be placed in a directory that is a fixed number of steps away from the directory used to store messages and the applications used to process them. This will, for example, allow XSL stylesheets to be referenced using instructions of the form:
<?xsl-stylesheet href="/processes/message-x-v1-en.xsl" type="text/xsl"?>