[This local archive copy mirrored from the canonical site: http://www.texcel.no/sgml97.htm; text only; links may not have complete integrity, so use the canonical document at this URL if possible.]

XML and Modern Software Architectures

XML in the world of the Internet, JavaBeans, Software Components, and Controls

Text of a paper delivered at SGML/XML '97 by Jonathan Robie of Texcel Ventures, Inc.

To view the slides that accompany this paper, click here.


Contents


Abstract Top

As XML has brought SGML into mainstream software development, the SGML community has had to change some basic assumptions about editing environments and documents. Recently, a variety of new XML-related standards have been proposed that envision using XML as a core technology for Internet software development. These standards are in early phases, but if they come to be accepted and used, they paint a very interesting future for XML and SGML. SGML tools have generally been designed for single-user use or for use on a LAN. There are now proposed standards that define protocols for joint authoring across the Internet, specifying how documents and document components can be created, traversed, locked, changed, checked in or out, versioned, and protected from unauthorized access.

These programming interfaces may be used by distributed editing tools to allow XML documents, databases, and repositories to be edited or viewed simultaneously by many users working from different locations. By supporting these standards as they come out, SGML and XML tools can become a vital part of new Internet development.

As an open format for structured documents, XML has made structured documents a natural way to define new standards. The SGML community has generally assumed that all documents were created by human beings, and ultimately read by other human beings. Although most XML documents probably will be written and read by humans, many of the new XML standards are for environments in which documents are created or consumed by programs. Some of these use XML as a rich data interchange format; others use XML to define protocols, software installation procedures, financial transactions, document transfer schedules, and system configuration.

Finally, this paper explores the use of software components in SGML systems. Many new applications are based on the concept of reusable software components, and many SGML programs seem to use similar controls, such as tree browsers, list viewers, SGML browsers. Several suggestions are made for improving SGML software systems to support component based programming.

Only a year ago, almost all SGML documents were written by human beings, who used SGML editors to create ASCII files, and the SGML community paid relatively little attention to software architectures. XML has caused an explosion of new applications for SGML, bringing SGML into mainstream software development, and introducing SGML to Internet applications, distributed objects, exchange protocols, data interchange, and a wide variety of other software systems. As a result, SGML tool vendors are faced with many new requirements, and tools that were once designed to be used only by humans must now provide services to many different kinds of programs, including Internet programs and programs using distributed objects. The document "author" may be another program, or it may be a workgroup of people located in several countries, working on the same document at the same time with distributed editors.


A sampling of XML-related standards Top

During the last year, a number of new XML-related standards have been proposed. Most of these standards have not yet been accepted, and are still under development, but they are still a useful indicator of trends in the use of XML. These standards often suggest or require different software architectures than those traditionally used for SGML. Here is a sampling of XML standards that have been proposed or are in the process of development, and the architectures that they assume:

Standard Description
Document Object Model (DOM) A programming interface for traversing, creating, or modifying documents and document structure. Supports XML, HTML, and CSS. Will also allow DTDs to be queried, modified, or used to test document compliance. Assumes operation across the Internet.
WWW Distributed Authoring and Versioning (WebDAV) HTTP protocol extensions to allow distributed authoring. The protocol messages are defined in XML. Supports checkin, checkout, versioning, history graph, differencing, difference merging, access control, locking.
Extensible Style Language (XSL) A stylesheet language for formatting XML documents. Includes a scripting language and very sophisticated formatting.
XML Electronic Data Interchange (XML/EDI) Electronic Data Interchange lets businesses exchange data using standard formats in electronic messaging systems. XML/EDI uses XML to define these formats. Makes no assumptions about how the data will be stored by each business, allowing heterogeneous systems to exchange data. Markup is likely to be generated and interpreted by programs, with no human intervention.
Channel Definition Format (CDF) Lets the exchange of documents be specified in XML documents. Used for "push" publishing and "smart pull" publishing. A channel is a set of documents that should be treated as a unit. CDF documents specify publishing operations and schedules for channels.
Open Financial Exchange (OFE) Messaging specification for consumer financial services. Supports fund transfers, payments, and downloading of statements. Intended for use with banks, brokerages, merchants, financial advisors, government agencies. Markup is likely to be generated and interpreted by programs, with no human intervention.
HL7 Kona Proposal Specifies architectural forms that can be used in medical attachments. Allows information to be extracted automatically from documents, with specific architectural forms for different target groups. Individual applications can freely add other data to documents.
Open Software Format Description (OSD) Software distribution and update via the network, including "push" updates of software and hands-free installation. Markup is likely to be generated by a software packaging program, and used to install the software without ever being read by a human.
Resource Description Format (RDF) Describes the contents of web resources in order to enable automatic processing. May be used to describe the contents of a web site, to provide additional information for search engines or intelligent agents, to declare property rights, etc.

The above list is meant to be a representative sample of some of the more important new XML-related standards, but it is by no means exhaustive. A careful examination of the above standards shows that they are intended for environments that are very different from traditional SGML editing environments:

  • Markup may be generated or interpreted by programs, without human intervention. Several standards use XML for complex data exchange, to define protocols, schedule delivery of content, install software, or describe other system-level tasks. One proposed standard, the HL7 Kona Proposal, uses architectural forms to represent the information that is expected by particular programs, embeds this data in conventional human-readable documents, and assumes that programs which access this information will use the architectural forms to extract the data that is relevant to them.
  • Documents may be authored traditionally by humans, but in a distributed environment, with transaction services, versioning, concurrency control, queries, and other standard repository functionality.
  • The editing operations used to create, navigate, and modify documents or DTDs will be available as part of a programming interface, and accessible across machine boundaries.

Taken together, these standards paint a picture of an Internet with open standards for creating and modifying DTDs and documents, managing documents in repositories, defining and maintaining software systems via documents, and editing documents in workgroups across the Internet. This is a very attractive future — and many of the benefits will be equally relevant in more traditional SGML environments, which will soon have open binary standards to allow different tools to cooperate with each other.

Currently, it is common for each SGML tool to have its own SGML parser and its own programming API. If several SGML tools are to be used together, they are often used sequentially, letting each tool parse the SGML again before processing it, and the parsers they use may not be completely compatible with each other! Alternatively, the tools may be integrated using their programming interfaces, but each SGML editing tool has its own programming interface, each repository has its own programming interface, and each formatting engine has its own programming interface. Now that the Internet community is blessing us with open binary standards, SGML tool vendors will finally be able to write to common interfaces, without writing a new interface for each new tool their users want to use.


Documents for rich data exchange Top

Many programs need to exchange complex sets of structured data using formats that are both human-readable and easily parsed. Object-oriented programs work with very rich data, and the structural relationships among objects can also be quite complex. Unlike most standard exchange formats, XML has adequate expressive power to model the data and relationships of object-oriented programs. In addition, if these formats are defined using DTDs, existing tools can be used to validate the structure of these files.

The programs that use these formats are generally domain-specific, and manage certain kinds of data with element-specific code tailored to the purpose of those documents. For instance, a medical claims processing program expects only medical claims and other documents related to those claims, looks for specific kinds of information, and has element-specific code that processes claims in a manner that changes as government regulations and company policies change. This kind of information might be exchanged using a database, but there are significant advantages to exchanging it in XML documents instead:

  • Expressive power. Most databases are relational databases, which are not very good at expressing complex or hierarchical relationships naturally. XML represents this structure well.
  • Structural validation. If written with a DTD, the structure of XML documents can easily be validated with existing tools. The equivalent tests for the corresponding relational database structure would be fairly involved.
  • Document format. Doctors are used to writing medical documents in the form of text; XML can be used to embed structured data for automatic processing into these structured text documents.
  • Easy ASCII exchange. XML documents can be exchanged as ASCII text files that can be read by humans, sent with ASCII-based email or messaging systems, and managed with standard SGML/XML tools.
  • Software independence. Exchanging documents in this way allows the sender and receiver to use different software systems, yet still exchange data.

Another alternative would be to define ASCII-based message formats for each kind of information that might be exchanged, without using XML. But using formats based on XML has significant advantages over custom ASCII file formats:

  • XML and SGML parsers already exist, and can make it much easier to write software systems that can parse and process these formats. Designing custom formats that are easily parsed requires some expertise. Existing SGML repositories can parse these messages, store them, and perform queries based on the content and structure of the messages.
  • Physical format independence: New tags can be added to existing formats, but old programs can still find the information they need in the newer formats.
  • If written with a DTD, the structure of messages can be validated using existing XML tools.
  • Messages can be documents, and XML formatting tools can be used to display these documents in a variety of formats.
  • Architectural forms can be used to allow different information consumers to define data that they wish to extract; this information is easily found even when embedded in documents.
  • Individual programs can add tags to these documents to add information specific to a particular office or group. All applications can continue to parse them as before.

Another possibility would be to exchange this data as distributed objects, but XML documents are easier to create and exchange than distributed objects, and they do not require all applications that use them to use distributed object development systems. And XML documents can also be managed by distributed objects, making it possible to take advantage of replication services.


Document management tools as software
components
Top

In a distributed object environment, different tools may cooperate to perform general-purpose functions. Many of these tools must work for any document, regardless of DTD. This requirement is not new — most familiar SGML tools should meet at least this requirement, including parsers, editors, browsers, document repositories, and formatting engines. What is new to the SGML world is the idea of writing small software components that can work together, and that can be reused in many settings.

These components need to manage documents with as little advanced knowledge of those documents as possible. This is particularly true in a distributed object environment where compound documents may be created with components from completely different programs, or for general-purpose programs like email systems that must manage very different kinds of documents. In this section we will discuss two different kinds of programs that may be used to offer services for SGML- or XML-based documents. The first kind does not need any kind of access to a document's content, and can manage any SGML or XML document appropriately. We will call these generic document management tools.

The second kind can also process any kind of document, but it does need access to some information contained in the document, and these documents may have completely different DTDs. Due to inherent requirements, programs like these cannot be made completely generic, but the DTD dependencies can be removed by adding a layer of middleware that interprets a document for them.

Generic document management controls Top

Software components are not always visible, but it is probably easiest to understand the concept for controls that are seen as part of the GUI. For instance, a tree navigator control is found in SGML editors, browsers, and repositories. In modern software development systems like ActiveX, OpenDoc, or JavaBeans, controls like these are written generically, and the same control could be used by all of these systems. Instead of writing single monolithic programs, different functions can be divided up among separate software controls, and controls from different applications can be combined to perform specific tasks. An SGML browser might be the software component that actually displays a document, but the actual document might be located in an SGML repository, which knows how to navigate the document's structure based on its representation in the database. A tree control might be used to allow the user to navigate the structure of a large document and locate the portions that must be displayed, or a query control might be used to find particular document elements, and the results given to a list control that displays them to the user.

This kind of system allows fairly major changes to be made by swapping components. For instance, if you have a system that uses an SGML browser component, an IETM viewer, DSSSL formatting engine, or HTML conversion system might be used instead. If you have a system that accesses a document in a repository, an ASCII document could be used instead by binding it to an SGML navigator that can traverse the tree structure of an ASCII document. Not only can the same controls be used in different programs, they can also be used in different environments. For instance, the same ActiveX control could be used in a C++ program, a Visual Basic program, or embedded in a web page. This gives software developers a great deal of choice and flexibility when constructing programs.

Of course, current SGML tools typically do work together with other tools, and it is not unusual to combine several tools in the same environment. As I write this, I am using an SGML editor and an SGML viewer, which are two separate programs. But in most cases, tools cooperate by exchanging SGML files that are reparsed by each program. Not only does this mean that each SGML tool must have have its own parser, if an SGML document is processed by several tools, it is generally parsed several times in sequence (often using the same parser!). With generic controls, an application can be built by combining existing controls, such as a parser, tree control, list control, and SGML browser, and adding the code specific to the new program that is being written. This significantly simplifies software development.

DTD-dependent middleware Top

Suppose that a list control must display a list of documents sorted by date, displaying the name and author of each document. Somehow, we must provide that list control with the name, author, and date of each document. But these may be stored in completely different places, depending on the DTD of each document, and some documents may not have all three items in the first place. Some DTDs, such as the TEI DTD, may offer a variety of ways to represent dates, and a given document may have several dates, each with a different meaning. Obviously, unless we want to abandon the whole idea of listing documents by name, date, and author, we must find some way to extract this information from any kind of document, but it would be insane to attempt to write one program that would try to find this information in any document, regardless of DTD. There are two reasonable approaches that might help:

  • Require that documents represent the information in a particular way.
  • Introduce simple middleware that knows how to interpret a particular DTD for our program.

We will show that the second approach is preferable.

Industry-standard DTDs and architectural forms Top

The first approach is to require that documents represent information in a particular way. Actually, there are two variations on this approach. Industry-standard DTDs require that an entire document be represented in a particular way; architectural forms require a document to contain an identifiable component that represents data in a particular way.

Both of these ideas are useful and necessary. However, they are not sufficient to provide a general-purpose solution to our problem. After all, the whole philosophy of SGML and XML is to provide very flexible specification of document structure, and document architects generally feel free to create new document formats or change industry-standard DTDs to accommodate specific needs. Architectural forms provide more flexibility, especially for documents created by programs that shield the user from the actual sequence of elements in the generated file, but they are more awkward for the many users who create their documents directly with SGML editors (or even with ASCII editors!), and they are useful only for those DTDs that actually incorporate architectural forms.

Since most important industry-standard DTDs do not use architectural forms, they do not provide a general-purpose solution to the problem. Instead, we suggest the use of document interface objects.

Document interface objects Top

An alternative is to provide lightweight document interface objects that can perform specific functions for documents, and let them delegate operations to DTD-specific interface service objects that know how to perform these operations for a particular DTD. We have already given the example of a general-purpose email program that places documents in an inbox, and needs to sort documents by title, author's name, DTD, date, etc. For instance, a set of documents in the inbox might be listed like this:

Joe Spaniel 20 Oct 95 docbook C++ Programming Guide
John Spinoza 2 Aug 97 surgpath Surgical Pathology Report
Jonathan Robie 17 Aug 97 gcaproposal SGML as an Interchange Format
Jonathan Robie 22 Oct 97 gcapaper XML and software architecture

However, each doctype listed above has a different way of representing the author, the date the document was created, and the title. The DTD used to submit proposals for this conference represented the author's name like this:

<PERSON> <NAME>Jonathan Robie</NAME> </PERSON>

The DTD for the final paper represents it differently:

<AUTHOR> <FNAME>Jonathan</FNAME> <SURNAME>Robie</SURNAME> </AUTHOR>

The patient record is a little more difficult. There is no single author for a patient record; it is compiled by many doctors and other healthcare providers, and these providers work for many different establishments. For some systems, it may be suitable to list the record by the patient's name rather than by the author. In one of the DTDs used for demonstration purposes by the HL7 SGML SIG, the patient's name is represented like this:

<PATIENTINFO> <NAME.GRP> <FIRSTNAME>Jonathan</FIRSTNAME> <LASTNAME>Robie</LASTNAME> </NAME.GRP> </PATIENTINFO>

Instead of parsing a document to find this information, it could make calls to a Bibliography Interface Object that knows how to retrieve the necessary information. The Bibliography Interface Object might be defined in Java, CORBA, or COM, and could have methods to return each standard piece of bibliographical information. For instance, here is an excerpt from a simple BibliographyInterface declaration defined in Java. It says that a BibliographyInterface can tell the author, title, and publication date for a document:

interface BibliographyInterface { public String Author(); public String Title(); public String PublicationDate(); }

Each specific kind of document will implement these functions differently. Here is a class that implements these functions for a GCA proposal:

public class gcaproposalInterface implements BibliographyInterface { public String Author() { return myDoc.getElementContent("AUTHOR/NAME"); } public String Title() { return myDoc.getElementContent("FRONT/TITLE"); } public String PublicationDate() { return "No date in document"; } }

In the above example, myDoc is a DTD-specific delegation object that represents the document itself. The getElementContent() function retrieves the content of the first element found in the specified element structure. For instance, consider this function call:

myDoc.getElementContent("AUTHOR/NAME");

This returns the contents of the first element found within an element. This interface is quite simple, and could conceivably be implemented in a more general way by using configuration files that specify which element contains the relevant information for each function in the interface:

<INTERFACES> <INTERFACE NAME="BIBLIOGRAPHY" DOCTYPE="PROPOSAL"> <PUBID>"-//GCA//DTD SGML'97 Proposal DTD v1.0 19970324//EN"</PUBID> <AUTHOR>PROPOSAL/FRONT/AUTHOR/NAME</AUTHOR> <TITLE>PROPOSAL/FRONT/TITLE</TITLE> </INTERFACE> </INTERFACES>

In general, if the purpose of the interface is merely to return information from the document according to generic protocols, configuration files will be helpful. (Interfaces that perform more complex tasks might not be easily implemented using configuration files.) Configuration files like this can save significant amounts of coding, and they also allow new document types to be incorporated at run time. For instance, if a user receives an unknown document type in the inbox, the program can present the user with a tree control and let the user locate the author, title, and date information in the document. Once this information is known, it can be written to the configuration file, and used for other documents that have the same DOCTYPE.

Since these Document Interface Objects are relatively simple to write, and the number of DTDs that an average application will manage is fairly small, it is not necessary to have industry-standard interface objects. Tool vendors can design interfaces that are useful for their programs, and implement them for standard DTDs. However, if there were standards for these objects, then libraries of interface objects could be made available as a standard part of creating and distributing new DTDs.


Summary Top

This paper has discussed emerging XML standards and the ramifications they may have for using XML in modern software systems. As an open standard for structured documents, XML offers great promise for rich data exchange, specifying protocols and file formats for a wide variety of uses. The new standards that are emerging provide general-purpose interfaces for creating, modifying, navigating, managing, formatting, and displaying XML documents in a multiuser environment across the Internet. This will not only bring XML and SGML into new application domains, it creates open standards that can be used to help existing SGML and XML tools communicate in standard editing environments.

In the final part of the paper we have discussed how SGML and XML software components can be used to simplify software development and provide general-purpose functionality in a variety of environments. General-purpose components can be used in many different programs, and components from different programs can be combined to create powerful and flexible systems. At times, these controls may need to access information from the document and use it in ways that depend on the semantics of the document type. Document Interface Objects can be used in these cases to provide general interfaces with implementations for specific DTDs. These are generally useful approaches to developing software components for XML and SGML.


About the author Top

Jonathan Robie has recently joined Texcel Ventures as a Research Consultant. Prior to this he was the SGML Product Manager at POET Software, where he helped design an SGML repository. Mr. Robie has two years' experience with SGML repository design, seven years' experience with object-oriented databases, object-oriented design, and object-oriented languages, and a total of eleven years' post-graduate experience as a computer scientist. He has an MS. in Computer Science from Michigan State University.

Jonathan Robie
Texcel Ventures, Inc.
3207 Gibson Road
Durham
North Carolina 27703
Telephone 919 598-5728


Home Our Products User Services Contact Us Mailing List Menu bar

Copyright © Texcel N.V. All rights reserved.