[This local archive copy mirrored from the canonical site: http://www.javaworld.com/javaworld/jw-02-1998/jw-02-miko.html; links may not have complete integrity, so use the canonical document at this URL if possible.]

February 1998

Subscribe, it's free!

XML speeds along in standards land

Java's little brother has gained support from a number of competitors. Find out what this fresh-faced technology holds for you

Summary
XML 1.0 is expected to become a W3C Recommendation next month, which would make it a standard on par with HTML. XML, the so-called baby brother of Java, opens up a whole new arena to Web developers. This month's column discusses this burgeoning standard and explores the significance of XML to Java developers. (2,900 words)
Note: Charles Axel Allen of webMethods and John Tigue of DataChannel contributed to this article.

By Miko Matsumura

Mail this
article to
a friend

The World Wide Web Consortium (W3C) was founded in 1994 to develop common protocols for the evolution of the World Wide Web. W3C is the international standards organization that brought you HTML. Currently, the W3C is reviewing, among other technologies, XML (eXtensible Markup Language) 1.0, a "meta-grammar" that allows for Web automation and data interchange across multiple platforms and applications.

So why should you as a Java developer be concerned with this emerging technology? Well, Java and XML complement each other. Java provides platform-independence, XML provides application-independence; Java gives the consumer a choice of platforms, XML gives the consumer a choice of applications. XML furthers the cause of Java by furthering the cause for consumer freedom.

Java provides a platform-independent coding environment, and XML provides a similar universality in terms of how it expresses and formats data. In essence, XML provides a grammar that can be used to create self-describing data file formats. Thus Java can be viewed as the universal Virtual Machine, and XML can be viewed as the universal Virtual Document. Java is a perfect architecture and vendor-neutral language for processing these architecture and vendor-neutral documents.

Now I'm no XML expert, but I have been doing my homework -- reading the XML 1.0 specification, corresponding with members of the Working Group, and mulling over many of the XML FAQ and tutorials that are available. XML is chock-full of bewildering new acronyms and specialized language. My goal for this article is not to teach you XML -- you can study up on your own with one of several good tutorials (see Resources). Rather, my aim is to cut through as much of this language as possible and explain the significance of XML to you, the Java programmer.

Before I begin, I must acknowledge that the connections between Java and XML have been admirably elucidated by John Bosak in his seminal paper: "XML, Java, and the future of the Web." I simply aim to add my perspective on this emerging technology.

XML defined: the knit apparel analogy
XML is a simplified dialect of SGML (Standard Generalized Markup Language). For those of you unfamiliar with SGML, it is an international standard (ISO-8879) for defining descriptions of the structure and content of documents in an electronic form. XML simplifies SGML by capturing about 80 percent of SGML's functionality with only 20 percent of the complexity.

HTML, which is a description of the structure and content of a single type of document called a "Web page," is just one instance of what can be created with SGML. In other words, if HTML is a single knit sweater, SGML and XML are how-to books on knitting. By learning XML, you can create sweaters, socks, leg warmers, or any kind of knitted apparel you want!

As I noted earlier, XML currently is working its way through the W3C standards process. For more information on what this means for XML, see the sidebar W3C reviews XML.

Key characteristics
In my own (admittedly simplistic) reduction of the key characteristics of XML, I divide the capabilities offered into four categories. Simply put, XML is:

Structured
Self describing
Extensible, and
Viewer adaptive

Let's look at each of these characteristics in more detail.

Structured
XML is an extremely structured language specification. Good XML can be both well-formed and valid. More on these features in a moment.

Like SGML, XML documents utilize a DTD (document type definition) for defining the syntax, grammar, and data structure of your XML documents. A DTD also defines whether the use of each of your declared elements is required, optional, or conditional, and if the range of allowable attribute values is implied, has a default value, or is allowed to be an empty tag.

An XML parser uses a DTD to determine if a document is well-formed, meaning that it contains the properly defined start and end tags, and if it is valid, meaning that it conforms to the DTD in its entirety -- variance is not allowed, and even one error will prevent the entire document from being processed. A parser can validate automatically through a built-in DTD, through an externally defined DTD described using the <DOCTYPE> HTML element, without the use of a DTD, or through some combination of these techniques using scripted business logic rules or an externally defined set of processing instructions.

So what do we Java Hackers gain by creating a more rigid data structure? One of the significant benefits with such a structure is the ease with which you can map the document's attributes to database structures or object hierarchies. This enables a reliable mechanism for passing documents back and forth from a client's viewer to the database and back, or to fluidly export the data between two databases using a structured XML document as an intermediary. That is, we enable a reliable means of extracting information from documents (what we familiarly call parsing). Without well-formed documents, we would have to rely on pattern matching to scan a poorly formed document for elements.

Another way of putting it is that the XML structure makes documents machine-readable. Enabling machines to read the Web allows for the automatic sharing of data among different companies through a standard format. Using a DTD, which describes the grammar of novel elements in a document, you can even connect different formats through a common description. For example, a medical document like a patient record might have allergies or blood pressure or other specialized data described as DTD-specified attributes.

This kind of sharing is ideal for EDI applications (Electronic Data Interchange) and supply-chain integration. One company, webMethods, is pioneering a Java example of this technique. Its Web Automation Toolkit is a 100% Pure Java way to integrate and aggregate Web-based data sources into applications of all kinds. You can download and try it free for 30 days (see Resources).

Self-describing
Another important value inherent in XML is the possibility of self-describing information. Although XML documents are not required to be self-describing (they are required only to be well-formed), descriptions add a level of power to Web automation and navigation. These descriptions are known as Metadata (data about data) and can contain such information about the document as security (who gets to read it), popularity, what the document is about, what language the document is in, who wrote it, or anything at all that describes the information. HTML has a facility for adding Metadata (the <META> tag), but the format for interchanging different Metadata attributes is poorly defined. For example, a site that uses the attribute "author" will not be able to share this with a site using the attribute "writer."

Several different Metadata formats have been created in the XML language, including:

RDF (Resource Description Format) -- A powerful way to automatically describe what information is available.
CDF (Channel Data Format) -- A format for publishing Web information on desktops.
PICS (Platform for Internet Content Selection) -- A format initially intended to help create a "ratings" system on the Internet to protect children from adult-oriented material.
WIDL (Web Interface Definition Language) -- Enables automation of all interactions with Web documents and forms, providing a general method of representing request/response interactions over standard Web protocols, and allowing the Web to be utilized as a universal integration platform.

Metadata allows for much more powerful navigation systems. An example of this is a powerful form of search engine that could respond to the following query: "Please find me a wooden doghouse for sale in California that costs under $100." Note that I'm illustrating the complexity of the query, not that XML suddenly enables natural language processing!

Another form of data collection that Metadata enables is the use of software agents. I'm wary of this term because I have seen it used to describe anything from an applet to a Web crawler. For the purposes of this article, I will define an agent as a threaded object that collects information from more than one machine on a network on behalf of a user; in other words, I define an agent as a representative or emissary from a person. Agents are commonly categorized as "intelligent," "mobile," and/or "personal." XML Metadata allows an agent to be more personal in the sense that it has access to descriptions of the data that help it find what it is looking for (an inexpensive doghouse, for example). XML does not make an agent more mobile (though Java does) or more intelligent. Based on my experiences obtaining my Master degree in Neuroscience, I don't place a lot of faith in software "intelligence"; I'd rather my agent collect it all and let me make the decisions, thank you!

Personalization of Web content is not limited to XML alone. Both Affinicast Interaction Manager (AIM) and the Art Technology Group's Dynamo system are wonderful examples of Java-based Web personalization that can give you a sense of how Metadata personalizes the Web. Siphoning information from the Web is, for some of us, an experience akin to drinking from a spewing firehose. Java-based Web personalization reduces the spray of information to a steady, customized trickle.

An amazing (and XML-enabled) example of navigation using Java is the Perspecta Ispace navigator. This Java software allows you to seemingly "fly" through an XML information space, "pivoting" on the Metadata keys. For example, say you're reading a Java news article and suddenly decide you need to see all of the other articles written by that article's author. Perspecta navigation allows you to "fly" up and "zoom down," "pivoting" into new articles and areas of reference at will. This powerful navigation software allows users to go where they like, instead of being constrained by the links supplied by a Web page author.

Extensible
One of the central powers of XML is embedded in its name: extensibility, or the ability to be expanded or customized. Extensibility has been a sticking point with HTML ever since Netscape started inventing new tags like <BLINK>. HTML is defined by a fixed set of tags. It is not possible to add tags without breaking the standard. Despite this fact, the explosive growth of the number of tags indicates the need for tag extensibility: the creation of new tags. Remember, tags are defined by a DTD, which formally defines what applications (such as stylesheets, browsers, crawler's databases, print engines) should expect in the structure of a document.

Java could play a big role in helping create dynamically extensible browsers for XML. When such a browser encounters a newly defined tag, like <3DMOLECULE> for example, the viewer could download a Java applet that allows for the rendering of this new element type. Jumbo, a prototype browser/editor/search/renderer for XML, is a Java example of such a browser.

Viewer adaptive
One of the novel offshoots of well-formed documents is that such data can adapt to a variety of different viewing modes. For example, because XML is a dialect of SGML, the content could potentially go directly into a publishable book format, provided it had the correct DTD for your publisher. But beyond different printing options, XML creates the potential for variances among hardware, software, and human readers of the data.

On the hardware front, one of the applications being developed in XML is a "lite" version of the Web designed for use on a small screen on a cellular telephone. This document type allows for simple navigation (for example, "Press one for FAQ, Press two for Company Info") using the standard telephone interface, as well as a way to effectively translate the text into an audible speech format, if necessary. Similarly, you could use the Web Automation Toolkit (webMethods' pure Java product for integrating and aggregating Web-based data sources into applications of all kinds) to scour the various news sites and send any Java-related news headlines to your pager via e-mail. The structure of XML allows for the deconstruction of Web pages into parts that can be sent to any kind of network attached device.

Software also has an ability to view XML in these discrete pieces. I can easily imagine feeding a stock price into a spreadsheet cell and allowing for a dynamic calculation of my portfolio value -- all without use of a browser. Specialized applications of XML add even more powers to software. For instance, CDF (Channel Definition Format) is a "push" system for content, and OSD (Open Software Description) allows for the updating of software applications, usually by teaming up with the DRP (Distribution and Replication Protocol) XML application and defining a distribution file hierarchy and unique byte identifiers using MD5 checksums.

Finally, humans also gain the benefit of adaptable views. Currently, to display information originating from an XML file complete with client-side adaptable views, you must generate HTML, HDML (for PDAs), or HTML with CSS (Cascading Style Sheets) for the purposes of displaying content. CSS 2 the next version of CSS, is expected to utilize XML tables to display multiple columns and multidirectional internationalization (for languages such as Japanese that read top to bottom and right to left). With multiple style mechanisms, it is possible to create many views of the same data.

Conclusion
If XML is the universal data formatting grammar, it is only proper that a universal language like Java should be used to process this data. John Bosak put it best when he said, "XML gives Java something to do." Much in the same way that Java is the Web's powerful and universal code description language, XML is the Web's powerful and universal data grammar and syntax language, and as such, many fruitful partnerships will develop between the two. If Java is a hungry dog, XML is a meaty bone. The natural synergy of the two should provide lots of programming opportunities for Java developers. The Resources provide an excellent starting place for you to investigate further.

Resources

W3C's XML 1.0 Proposed Recommendation
http://www.w3.org/TR/PR-xml.html
Carl Davis's HTTP Explorers Client and Server
http://homepage.interaccess.com/~cdavis/java/httpexplorer.html
XML Resources at finetuning.com
http://www.finetuning.com/xml.html
Jon Bosak's "XML, Java, and the future of the Web"
http://sunsite.unc.edu/pub/sun-info/standards/xml/why/xmlapps.htm
Web Interface Definition Language (WIDL)
http://www.w3.org/TR/NOTE-widl.html
The HTTP Distribution and Replication Protocol
http://www.w3.org/TR/NOTE-drp
The Open Software Description Format (OSD)
http://www.w3.org/TR/NOTE-OSD.html
XML Enabled Mechanisms for Distributed Computing on the Web
http://208.204.84.117/public/presentation/DocumationEast/
webMethods has been doing some serious XML experimentation. Check out its demo page
http://www.webmethods.com/products/toolkit/userguide/demos.html
webMethods' WIDL White Paper, which was published in the W3C Journal
http://webMethods.com/technology/automating.html
LT XML (version 0.9.5; release date: August 21, 1997)
http://www.ltg.ed.ac.uk/software/xml/
DataChannel's XML Viewer Applet
http://208.204.84.117/XMLTreeViewer/deploy/index.html
A Proposal for XSL
http://www.w3.org/TR/NOTE-XSL.html
CDF Submission
http://www.w3.org/TR/NOTE-CDFsubmit.html
CSS2 Specification Release
http://www.w3.org/TR/WD-CSS2/
Mathematical Markup Language
http://www.w3.org/TR/WD-math/
Document Object Model (XML) Level 1
http://www.w3.org/TR/WD-DOM/level-one-xml-971209.html
An MCF Tutorial
http://www.w3.org/TR/NOTE-MCF-XML/MCF-tutorial.html
Synchronized Multimedia Integration Language
http://www.w3.org/TR/WD-smil
Parsers
An Introduction to XML Processing with Lark
http://www.textuality.com/Lark/
Pax Syntactica
http://208.204.84.117/XMLTreeViewer/
NXP - Norbert's XML Parser
http://www.edu.uni-klu.ac.at/~nmikula/NXP/

Miko's previous articles

"The real future of Java" Where are we going with Java, and who is going to take us there?
"Demo or die! The quest for the killer app" An open call for keynote demo submissions for the 1998 JavaOne conference.
"Ultranet, the next network " Sun's Java Evangelist provides his unique perspective on the four stages of the growth of the network. Read this account of how a bleeding-edge federation of startup companies strives to reinvent the next-generation network.

About the author
Miko Matsumura has a Master degree in Neuroscience from Yale University and a B.S. in Psychology from the University of Michigan. Before becoming the Java Evangelist for Sun Microsystems, he worked at HotWired as the director of research and development, at the WELL for Woodstock '94, and at the Branson School. He can now be found at miko.com. Miko's first computer was an Atari 400 with 16 kilobytes of RAM and a cassette tape recorder, which he acquired at the age of 12. He has been pondering questions about human and machine behavior ever since. He holds a first-degree black belt in Shotokan karate. Reach Miko at miko.matsumura@javaworld.com.

If you have problems with this magazine, contact webmaster@javaworld.com
URL: http://www.javaworld.com/javaworld/jw-02-1998/jw-02-miko.html
Last modified: Tuesday, January 27, 1998

W3C reviews XML

Late last year the World Wide Web Consortium's (W3C) XML Working Group released XML 1.0 specification as a Proposed Recommendation. What this means is that the Working Group has determined that the XML 1.0 specification is stable, contributes to Web interoperability, is supported for industry-wide adoption, and is ready to enter the review and voting process by all 229 W3C Member organizations.

The review process is expected to be completed next month (February 1998) with XML receiving Recommendation status, meaning that the W3C itself agrees that specification meets the criteria described above.

Working Groups are comprised of an impressive collection of W3C Working Groups are comprised of an impressive collection of W3C Working Groups are comprised of an impressive collection of W3C staff, independent experts, and participants from software companies, research facilities, government institutions, and academia from all over the world. The Working Group that produced XML 1.0 was chaired by Jon Bosak of Sun Microsystems, and included members from Microsoft (Jean Paoli), Netscape (represented by consultant Tim Bray of Textuality), and the University of Illinois at Chicago (Michael Sperberg-McQueen).

Also involved were members of a more informally organized XML special interest group, comprised of both W3C members and non-member experts. The efforts of the official working group, along with the feedback and comments from the slew of member and non-member industry experts that constituted the XML Special Interest Group, were collectively integrated into the XML 1.0 Proposed Recommendation.

Back to story

Comments:
Name:
Email:
Company Name: