[This local archive copy mirrored from the canonical site: http://www.bcs.org.uk/publicat/ebull/may98/xml.htm.]

Bulletin Contents Page | Main Menu | Consult Contents

The Computer Bulletin

May 1998

The XML files

XML is being hailed as the answer to information management through a standard way of defining the content of documents. But is it too late for companiesalready awash with word processor files and information gathered from the Web? Alfie Kirkpatrick looks into the language - and the future of information management

Anyone reading any of the many articles now appearing on XML (Extensible Markup Language) might be forgiven for thinking that XML will solve all our data and information management problems in one fell swoop. However, it is important to understand when and where XML will be useful - and also to appreciate that although it is a simple technology it cannot be used for effective information management without a corresponding organisational change.

XML is a simplified form of SGML (Standard Generalised Markup Lan-guage), which has been an international standard for over 10 years. The original aim was to provide long-term storage of information which was independent of suppliers and of changes in hardware and software technology.

The solution was radical and arguably 10 years ahead of its time: a notation which does not describe a particular type of document, either at the application level (such as a Microsoft Word document) or at the concept level (such as a memo or program specification), but which describes information objects: the parts of documents and how they relate to each other.

But if SGML is so wonderful, why is it not more widely used, and where does XML come in?

One main reason for SGML's limited success is that the advent of word processing packages has made it incredibly easy to create documents. We have come to expect information creation to be as easy as opening an application and starting to type. The idea of rigidly defining what will go into our document before we start is very alien, time consuming and costly. It requires us to think hard about the content and how it might be used in the future.

So although information management using SGML adds incredible power and versatility, it also adds to application complexity and often requires process changes in an organisation. Expensive tools and the need for specialist consultants have contributed to SGML's poor image among IT managers, and it has been adopted mainly only in large, critical applications.

When organisations wanted to change information management processes in the past, they faced the difficult task of making a business case, based on esoteric arguments about issues such as the availability and duplication of information. It has always been difficult to put hard figures to these arguments.

However, with the advent of the Internet and intranets, the business case hardly needs to be made: effective information management is becoming essential to business success, and everyone intuitively knows it.

XML will help. It will enable organisations to store information in a way that reflects their business needs, rather than the needs of an Internet browser (as is the case with HTML). Mechanisms will become widely available for delivering the same information effectively via different browsers, on paper and in other forms. In the short to medium term XML will be converted to HTML and displayed by the browser as normal.

There are two ways this can happen. Either the server converts XML to HTML on the fly, or a script on the client machine does the work.

The first approach is already used in many non-XML systems, which stream HTML into the browser. The data often comes from relational databases or other proprietary data sources. In this context, XML does not add much that is new.

The second approach is more interesting. XML can be delivered as a single file to the client browser. A script or control can then navigate, format and display the document in a highly interactive and usable way. For example, the script might display a contents list on the left and format sections on demand as the reader selects them. More complex data can be displayed as a table or graph, under the reader's control. And all this can be done without accessing the network again in any way.

Although Microsoft VB Script or JavaScript code can convert from XML to HTML fairly easily, this approach is relatively slow and difficult to maintain. The key is to define a standard mechanism for this conversion, and the World Wide Web Consortium is working on the Extensible Style Language (XSL) for just this purpose. XSL style sheets, which are themselves XML, define how a source document should be rearranged and displayed in HTML. XSL can also define the visual characteristics directly, ready for support by browsers.

Microsoft, which is squarely behind the XML effort, already has two XML parsers in Internet Explorer 4. One is written in C++, for performance, and one in Java. The source code is available for the Java parser and Microsoft hopes that it will become a widely used reference implementation. Microsoft also has an XSL processor and demonstration applications available, although unfortunately they only work with Internet Explorer 4.

Another area for XML is application-specific data delivery. A Java or ActiveX applet in a Web page can download the data it needs in XML format. The advantage is that a ready made data format and reader are available, reducing overall code size and speeding development. There are many applications which fall into this category, ranging from a statistical analyser to a chemical compound viewer.

An important initiative from the World Wide Web Consortium is the definition of a Document Object Model (DOM) for XML and HTML documents. This defines a standard set of interfaces for accessing document contents from programs. In this model, the document is represented by a document node, or entry point. This in turn has child nodes representing the sub-elements and text content, and so on. In this way an in-memory tree is built which represents the entire document. Client scripts and applications can then navigate and manipulate the document directly.

The Document Object Model also has the potential to end the current differences in the object models exposed by different browsers, notably Microsoft's Internet Explorer and Netscape's Navigator. Script writers are often driven to despair by the similar but subtly different approaches taken by the two main suppliers.

The Document Object Model is also important for client/server applications. It will be possible for client software to work efficiently and seamlessly with content from very different sources, provided the content is supplied as DOM objects. Even if network or operating system considerations make this impossible, XML can be delivered and converted to DOM objects on the client machine.

There is one big problem. Where will the information come from? XML is all about automating information delivery and display but we have already automated most of the data we can. Any information held in a database or document management product such as Lotus Notes can already be delivered fairly easily on the Internet. The information is well organised and indexed. In this area, XML will provide standardisation and new tools but will not fundamentally change the way we work.

In contrast, there is a vast amount of information held in documents produced by word processors, spreadsheets, and so on. This is precisely the information that organisations increasingly need to manage, share and re-use, and XML is well suited for this task. However, this process cannot be automated. Auto-mation can only be applied once a document's content is tightly defined and classified, and most office documents fall far short of the mark. Take 10 documents at random and try to imagine how to automatically extract the most basic element, the title, and you will see the problem.

To manage documents properly, an organisation needs to define a strict set of document classes, or types, and enforce their use. SGML applications have done this for some years, often with great success.

Once defined in this way, documents can be broken apart, stored, delivered and re-used in many ways. But this is demanding and costly. Converting existing documents and providing a familiar editing method which is as good as the word processors already in use are often the biggest obstacles.

People familiar with SGML welcome XML for its potential to widen the market and provide mainstream editing tools and processors. But many are worried that XML will dilute the discipline SGML demands to such an extent that it will become a major headache.

As a contributor to the SGML Internet newsgroup put it recently, 'The moment a significant player produces a non-validating XML editor that simply lets Joe Homepage create pages visually just like in Word, and export a vaguely formed file plus a style sheet, is the day that end-user XML takes off big time.'

The challenge for many organisations is to change the way information is created at source, to impose processes and structures which will gather and manage the information effectively. Any technology can only help an organisation if it does not shun this core responsibility and does not expect the technology to come up with all the answers.
XML in action: Simplifying electronic data interchange
The prospect of electronic commerce over the Internet has put new demands on the electronic data interchange (EDI) messaging standards developed over the last two decades, writes David Webber.

One solution is proposed by the XML/EDI Group, which intends to make EDI more versatile and lower its cost by combining it with the Internet and XML, while maintaining compatibility with existing EDI documents. This will also extend the usefulness of EDI, since XML/EDI will enable EDI documents to be read by any Web browser handling XML, as well as by dedicated XML/EDI applications.

The problem with traditional EDI is that document interchanges are costly and complicated to both set up and support. The XML/EDI Group, administered by the Graphical Communications Association Research Institute, is working on a new and more flexible approach which combines both data and processing rules.

Since March 1997 the US EDI standards body, ASC X12, has been working with the XML/EDI Group to create an XML-based equivalent of the existing X12 EDI message sets. The international Edifact EDI standards are also being considered.

Users can create a purchase order for a supplier in XML, defining and using , and tags to indicate what they want to buy. At the receiving end the supplier's software could automatically extract these tagged elements and pass the information directly into an order entry system.

Merely combining XML syntax with EDI is not enough to solve the problems of traditional EDI. But XML brings other benefits. EDI documents will be able to carry not only transaction data but also routing, work-flow, and processing information. XML/EDI documents will be self-describing: new document types could be sent and interpreted without the need to predefine them through lengthy manual procedures, as in the past.

In addition, XML/EDI documents could even include agent objects, such as Java code, or links to code on a server, which will enable them to perform calculations, validations and data transformations without the need for local programming.

Much work still needs to be done. So far the XML/EDI Group has developed an initial guidelines document, and it is now extending this to a general collection of information for implementing XML/EDI. Efforts are also under way to bring a definitive XML/EDI standard into being in the long term.

*The XML/EDI Group is at www.xmledi.net. David Webber, a BCS Affiliate member, is co-founder of the XML/EDI Group.

XML structure
XML files consist of simple mark-up, or tags, which define an element hierarchy. Elements act as containers for other elements and text, to an unlimited depth. The simplest XML document could be:
This is a document
This consists of one element (doc) and two tags (the start and end tags). HTML users will be very familiar with this syntax.

About the Author

Alfie Kirkpatrick, a BCS Student member ('always meaning to upgrade'), is an information architect at IMS International, specialising in document authoring and production software.

Bulletin Contents Page | Main Menu | Consult Contents