XML Ready for Prime Time?

[Archive copy mirrored from: http://www2.echo.lu/oii/en/xml.html, text only]

XML Ready for Prime Time?
London, 22nd April 1997

Following the announcement of the formal specification for the eXtensible Markup Language (XML) at the 6th World Wide Web conference in California on April 8th, Tim Bray, one of the editors of the new specification, chaired the first European conference devoted to XML in London on April 22nd. The presentations at this conference were:

HTML, XML or SGML - what's the difference?, Martin Bryan, The SGML Centre, UK
Delivering XML, SGML and HTML from Databases, John Chelsom, CSW Informatics, UK
XML for client-side applications - benefits and constraints, François Chahuneau, AIS, France
Publishing opportunities using XML, Peter Flynn, Silmaril, Ireland
The Lark XML Processor, Tim Bray, Textuality, Canada
Character set issues in XML, François Chahuneau, AIS, France
Let's go buy XML: how XML could change the marketplace, Pam Genussa, Database Publishing Systems, UK.

HTML, XML or SGML - what's the difference?

After a brief introduction to the conference by Tim Bray, Martin Bryan highlighted the differences between the HyperText Markup Language (HTML) currently used on the World Wide Web (WWW), the generic solution provided by ISO's Standard Generalized Markup Language (SGML) and the new XML specification issued by the SGML on the Web working group of W3C.

HTML is defined as an SGML document type definition (DTD) that contains a fixed set of data markup tags that can be used to identify many of the elements found in typical electronic documents. The key to the success of the WWW has been in the pre-defined behaviour associated with some of these elements. In particular Mr. Bryan highlighted five types of behaviour associated with HTML documents:

links between anchors
images and image maps
forms
Java applets
styles and scripts.

HTML provides users with the ability to link from one piece of data to another. It does this by embedding HTML anchors into files which reference other files through a Uniform Resource Locator (URL). It is the ability to reference material stored on other sites that is the key to the success of the WWW.

HTML allows three types of images (GIF, PNG and JPEG) to be referenced within files. In all cases the image data is stored in a separate file, which is referenced, using a URL, from an image element within the text. Images can be associated with image maps which identify hotspots within an image that act as anchors to information stored in other files, or elsewhere in the current file.

Java applets can be used to import compiled, or run-time interpretable, coded applications into a file to perform functions that a standard browsing tool cannot handle. Based on a portable interpreter, Java provides a standardized way of moving programs between applications in an open environment.

Options to allow Javascript and other types of types of program control scripts, including specialized scripts for controlling the presentation styles of HTML documents, have recently been added to HTML. At present, however, these elements only provide place holders for future extensions to the language.

SGML, which has been used in large documentation projects for more than 10 years now, does not have a predefined set of tags. Instead it requires that each document start with, or contain a pointer to, a formal document type definition (DTD) which defines the elements in the document and the order in which they can validly be used. This DTD can reference non-textual data stored in separate (external) entities using any notation processor on the local system that has been formally defined through a notation declaration. One of the problems with this approach is that there is no guarantee that the same set of notation processors will be available on all receiving systems, or that two systems will use the same name to identify a common protocol. Part 9 of ISO/IEC 9573, which should be published later this year, will help to solve this problem by providing an agreed set of naming conventions for standardized notations.

While SGML has no predefined behaviour associated with elements, many of its applications do. One of these applications defines a Hypermedia/Time-based Structuring Language (HyTime). HyTime includes methods for defining complex n-to-n linking structures, areas of images (including moving images) with which behaviour is to be associated, and event schedules that can be used to script presentation scenarios.

A new standard, to be based on HyTime, will define how Topic Navigation Maps can be used to link together files that do not directly reference one another by associating them with a common topic, for which a formal definition has been recorded.

XML is a subset of SGML that is designed to be easily usable on the WWW. It avoids optional features of SGML, especially those that require the document to be processed before its full structure can be determined. XML documents must conform to a set of rules that ensure that they are "well-formed", with no missing tags or attributes.

Like SGML, XML documents can have any elements they require, and contain references to non-textual data coded using any notation defined in the associated DTD. Like HyTime, XML documents can contain 1-to-n links to data in the same or other files. Like HTML, other files must be referenced using URLs.

At present only XML links have pre-defined behaviour associated with them, but this will change soon. A special mechanism for definining presentation behaviour through style specifications that conform to the DSSSL On-Line (DSSSL-O) subset will be defined shortly. Once this has been defined it will be possible to determine what other mechanisms are needed to associate behaviour with XML elements.

Delivering XML, SGML and HTML from Databases

John Chelsom started by explaining how TCP/IP was used to connect machines on the WWW, using protocols such as HTTP and MIME to determine which files they should exchange.

Intially Internet servers relied on sending the whole of the requested file to the requestor, where a "fat client" would do all of the processing needed to process the data. Modern systems, however, try to use the server to reduce the amount of data transmitted to a minimum by delivering small documents that just contain the information needed by the user, who can then use a cheaper "thin client" to process the data.

A Common Gateway Interface (CGI) has been defined that allows information to be passed from clients to servers. Often the required information will be captured using an HTML form, which returns the information in a format that conforms to the CGI specification. Typically the response is returned in the form of a specially generated HTML fragment rather than as an existing document.

When an HTML fragment is created from the results of a database query generated by an CGI processing script, information about the structure of the data as stored in the database is lost. This makes it difficult for those receiving the data to reuse what they receive. When XML is used in place of HTML, however, all the information needed to reconstruct the original database entries can be interchanged either as part of the document type definition or through the markup that is used to identify the role of each piece of transmitted data.

Java allows database queries to be generated directly from within the client, without having to invoke an intermediate CGI processing script at the server. A Java Database Connectivity (JDBC) standard is currently being developed. Java is currently slow (very slow if you use its interpreted form rather than its compiled form). Whilst it is sufficient for use on Intranets, where speed can be optimized, it does not work well in environments where you cannot be assured of connection to the data source. The Java approach only works when you know what is in the database before requesting it. If you don't know the data structure before you receive the database response Java applets are not the answer.

XML provides a subset of SGML that is ideal for processing using Java. A Java-encoded XML tool can process data from any database without having to be aware of its structure. Even when no formal DTD accompanies the data to define the database schema, the Java program can be sure that the data is well-formed and that, therefore, it can determine the data structures without any chance of ambiguity.

XML uses queries that are a subset of the query language defined for the Text Encoding Initiative (TEI) to allow processing of XML elements to be defined using generic rules. John Chelsom, however, felt that the full power of the SGML Document Query Language (SDQL) defined in DSSSL would be needed for many applications.

XML for client-side applications - benefits and constraints

In the first of his two presentations, François Chahuneau started by discussing the way in which SGML was being used today to prepare databases that could be queried to generate HTML fragments. There are two models, both based on server-side processing of source files:

Preprocessing SGML files to generate HTML fragments that can be delivered when requested.
Creating HTML fragments from the SGML database at the time the request is made.

On-the-fly conversion is sufficient when only a few users at a time need to access the database, such as in Intranet environments. When a system could be queried by hundreds of people at a time, pre-processing of the files is better because of imperfections in the HTTP/HTML session control mechanisms, which can lead to failure of the connection before the relevant fragment can be generated if too many requests are received at the same time.

The client-server model scales because more clients means more processing power in a "fat client" scenario. With network computing "thin clients" this is no longer true. The more clients there are the more work the server needs to do. This model does not scale properly.

Databases are a way of separating general purpose information objects from application-specific presentation mechanisms such as those provided by HTML.

It is not a good idea to store structural information within a Java application because such information cannot be communicated to other applications on the same server. The structural information needs to be retained in the data processed by the application so that it can be passed on to other applications. XML provides a mechanism for doing this.

Another advantage of XML is that you do not need to send all of an XML file from client to server. Because it XML documents are well-formed it is possible to extract any sub-tree from an XML file and treat that as an XML fragment which will conform to the same document type definition as its parent document.

XML browsers will need to be HTML compatible so that they can accept any document currently pointed to by another document. Typically they will consist of a Java plug-in that can be run within an existing HTML browser. One adavntage of this is that the XML application thereby does not need to include all the ancilliary things that nowadays go with a web browser, such as handlers for e-mail, FTP, news servers, etc.

M. Chahuneau ended by showing how the AIS Balise SGML engine has been integrated with Internet Explorer to provide a general purpose tool for querying SGML databases over the Internet. When accessing an SGML file Balise is invoked to translate the SGML document structure to an HTML format that can be displayed by the associated browser. This conversion is done at the "fat client", so reducing the load on the server. For performance reasons ActiveX is used to transfer data between the applications.

Publishing opportunities using XML

Peter Flynn, who combines consulting with teaching at University College, Cork, started by outlining the relationship between the IETF HTML committee, of which he is a member, and the W3C XML effort. He highlighted the fact that many publishers are looking for a WYSIWYG approach to electronic delivery. While XML does not give you this, it will allow publishers more control of the way data is presented to users than HTML does.

HTML files are typically delivered in small fragments as most browsers are too slow to deal with the large documents found in typical publishing applications. (Much of the code in Netscape is there simply to deal with markup errors and with elements whose use is now deprecated.)

One of the problems is how to provide navigation in environments where documents are being generated dynamically. What is the electronic equivalent of a table of contents or index on such a system?

Those thinking of developing XML systems should bear in mind that there are still more than 20 million Internet users who only have access to text-based browsers. Many of these are connected to e-mail systems, etc, through company mainframes using simple terminals. In addition tool developers should bear in mind that the Internet is increasingly being used by visually disadvantaged users.

Publishing involves the application of consistency to varied types of data. This is very hard to do with HTML. For this reason HTML has been a barrier to publishing on the Internet. XML allows consistency of logical structure to be checked prior to publication, and allows house-specific presentation styles to be developed that can be shared by applications that share the same DTD.

Publishers require multicolumn grids for creating professional publications. Such grids are difficult to do with HTML, even using the frames add-ons. Mr. Flynn expressed the need for the XML style language to cope with multicolumn and split layouts. Another area of concern was the provision of facilities to provide both synchronized and independent scrolling mechanisms for viewing multilingual texts.

Before publishers can adopt XML they must convince authors to create XML source documents. For this to be possible authors will need to be able to to create documents easily in an environment that represents the document in the form it would be displayed by a browser, i.e. using the browser's style sheet to control display on the editor.

During the lunch break Stilo Associates from Wales and Grif from France showed how they had been able to convert their SGML and HTML tools to produce and display XML files.

The Lark XML Processor

Before talking about his Lark XML processor Tim Bray gave a report on how the XML specificaiton had been received at the 6th World Wide Web conference in Santa Clara, California, at the beginning of the month.

Since the initial presentation of the first draft of the XML specification at the SGML '96 conference in Boston last November, Microsoft has looked seriously at XML, and has used if for its Channel Distribution and Web Connectivity proposals. Microsoft has also announced that it intends to make its SGML-based Open Financial Exchange specification XML compatible. It is expected that the next preview of Microsoft's Internet Explorer 4 will be XML-enabled.

Prior to the WWW conference Netscape had been saying that they saw no need to support XML. During the conference, however, it became clear that users were very interested in it. Whenever there were parallel sessions the XML segment of the session was the most well attended. By the end of the week XML was the main talking point, and Netscape had accepted that they needed to look again at their policy.

A key factor in the interest generated by XML was the publication of a draft XML-link specification which will allow a single anchor to point to multiple objects. Long seen as one of the stumbling blocks of HTML, the publication of a concrete proposal for multi-headed links, coupled with demonstrations of a number of XML tools, greatly helped XML to gain acceptance from Internet experts.

Tim also explained some of the factors that had influenced the design of XML, stressing the choice of simplified tree model to check document validity, and the problems that using the full ISO 10646 character set give in determining character offsets. He pointed out that in general support for ISO 10646 within Java is weak.

The Lark processor was designed to check the validity of the XML specification. As it was the first program that Tim had written in Java it took him some time to get to grips with basic things like string and error handling. He also found he needed to develop work arounds to Java problems. This meant that it took a few weeks to develop the processor, rather than the stated aim of producing one with a week or two's coding effort.

Lark was tested using an XML test suite developed by Michael Sperberg-McQueen, the other editor for the specification.

One of the questions that still needs to be answered is what form of API should be used to connect XML processors to other applications. An agreement on this needs to be made quickly if we are to avoid having a set of application-specific answers to this question being developed.

Character set issues in XML

François Chahuneau returned to explain the implications of XML's adoption of the ISO 10646 BMP code set. After explaining clearly the difference between a character and a glyph, M. Chahuneau explained the difference between the various character sets used in Asia. While the existing version of ISO 10646, which exactly matches Version 2.0 of Unicode, deals with the majority of Asian characters used in business it does not cover all such characters. Extensions are, therefore currently being discussed.

ISO 10646 covers most of the languages currently being used on the Internet. Cyberbit provide a shareware font that covers some 60% of the character set. It is impossible to process ISO 10646 efficiently using single byte software. XML software should be designed from stage on for multi-byte processing.

At present there are no Unicode-enabled lexical or pattern-matching algorithms in the public domain (though Microsoft are looking into providing them for the XML community). Such lexical engines will need to work in multibyte form as there is no way of doing accurate matching based on single-byte analysis.

During the tea break, Peter Murray Rust from the Virtual School of Molecular Sciences at the University of Nottingham demonstrated how his JUMBO XML processor could be used to convert Chemical Markup Language (CML) coded data into three-dimensional interactive diagrams which could be rotated to show the shape of proteins and clicked on to take you to details of individual molecules in the protein held within the XML file.

Let's go buy XML: how XML could change the marketplace

In the closing session Pam Genussa started by illustrating the effect of the 80-20 rule on the development of XML tools. XML involved implementing the 80% of SGML specification that could be implemented in 20% of the time to develop a full SGML processor. XML could handle some 80% of documents, compared with the 20% that can be handled by HTML. (Remember that HTML only deals with European character sets at present.)

When looking at who will be able to develop XML browsers it is important to remmeber that while it is easy for an SGML system vendor to restrict his system to handle only the XML subset, it is much harder for him to add on other aspects of an Internet browser, such as TCP/IP, FTP, e-mail, etc. On the other hand changing an HTML browser so that it is XML compatible often involves the introduction of formalized verification methods for which the original software was not designed. In addition, XML browser developers have to be able to support an extensible set of tags with no predefined presentation specifications.

Pam reviewed the impressive list of companies that had already announced their intentions to support XML. It was noted that the Microsoft Java development kit already contains an XML parser, and that Sun will be releasing one as part of their JavaSoft library.

The initial market for XML products will be for corporate customers rather than consumers. XML will be particularly important in the development of Intranets. XML will need to improve the usability of current SGML tools, to reduce the cost of supporting systems. This will in turn lead to an increasing awareness of SGML and a lower cost of developing general solutions. It is felt that XML will provide a useful "on-ramp" to full SGML document management.

Martin Bryan

[I'M-Europe Home Page] [INFO2000 Home Page] [OII Home Page] [Help] [Frequently Asked Questions] [Subject Index] [Text Search] [Europa WWW server]

File created: April 1997

webmaster@echo.lu

XML Ready for Prime Time? London, 22nd April 1997