XML: Mastering Information on the Web

[This local archive copy mirrored from the canonical site: http://www.sun.com/980310/xml/, 980316; links may not have complete integrity, so use the canonical document at this URL if possible.]

XML: Mastering Information on the Web

by Todd Freter

The attention paid to XML (Extensible Markup Language), whose 1.0 standard was published February 10, 1998, is impressive. XML has been heralded as the next important internet technology, the next step following HTML, and the natural and worthy companion to the Java^TM programming language itself. Enterprises of all stripes have rapturously embraced XML.

The origins of XML technology reveal much about its intent and its promise. And XML's promise no small one also implies some significant challenges for the people and organizations who want to take advantage of the XML phenomenon.

This article is first in a series about XML, its promises, and its challenges. Four are planned, but more may appear, and their order here may not be the order in which we publish them. Even so, here is a preview:

The XML Idea
A Transition from HTML to XML
XML and True Information Reuse
XML and the Ascent of Documents

These articles are not intended to explain XML to you. There are excellent resources to do that already. Here are a few:

The XML FAQ (Frequently Asked Questions)
The W3C (World Wide Web Consortium) web page about XML
The SGML/XML Page, which points at a wealth of relevant documents

Instead, these articles are about what XML means for people, for enterprises, and perhaps for the future of information itself. As a major developer of technology products that have enabled the internet's explosive growth, Sun Microsystems believes it is important to propagate open perspectives on open standards that make information more available and useful.

Today's article, "The XML Idea," addresses these issues and subjects:

HTML's problems
XML's goals for the internet
"SGML on the web"
XML is born
Coming to grips with XML
How can an XML transition happen?

The XML Idea

"HTML is our data type," Microsoft's Bill Gates said in a February 1996 interview.

That pronouncement was an emblem for the impact that the burgeoning internet and its friendly interface, the World Wide Web, had exerted on corporations, governments, and people. With everyone from billion-dollar corporations and governments to elementary school classes and private individuals publishing websites and web pages, the success of the web and its original means for presenting information, HTML, had been amply demonstrated.

HTML's Problems

However, some people who had been looking at the internet from a different perspective had concluded otherwise. For those observers, who would start developing XML, HTML had problems:

HTML is a presentation technology only. HTML does not necessarily reveal anything about the information to which HTML tags are applied. For example, we know that <h2>Apple</h2> has a definite, predictable appearance in a web browser, but is it a computer company? A fruit? A last name? A recording company? HTML doesn't usually tell. Semantics are not in HTML's bag of tricks.
HTML has a fixed tag set. You can't extend it to create new tags that are meaningful and useful to you and others. Only the W3C (World Wide Web Consortium) can do that at least properly.
Web browsers were viewed as potential application platforms, but Java technology frankly needs more to chew on than HTML offered in order to fulfill that vision. With HTML as the data standard, web-based applications relied too much on CGI scripts at the server to process the data in web pages. This contributes mightily to internet traffic and makes the web slow for many users.

This is not to denigrate HTML, but merely to establish the perspective that XML's developers held. From other valid perspectives, these problematic characterizations of HTML represent uncontestable virtues. But that wasn't the point.

XML's Goals for the Internet

While the world was flocking to the internet and HTML, a group of men and women watched with bemused concern. These were the developers, implementors, and users of HTML's parent technology, SGML (Standard Generalized Markup Language, ISO 8879:1986). These individuals and their companies had already invested heavily in SGML, which governs the semantics of their documents and of the information of which the documents were composed.

SGML, unlike HTML, assures its users an extensible tag set, and it establishes the rules by which documents (or "information products," as one expert persists in calling them) are produced. SGML yields sets of tags, as HTML is a set of tags, for characterizing what pieces of information mean. The people who used SGML and structured information systems were to become XML's developers, and they believed that SGML technology could enrich and revolutionize the web in some key ways:

EDI support
One of the principal uses of structured information is to enable data interchange. Different industries create consortia to specify the content model on which they all agree, and which they use to mark up their information so that they can share it with each other easily and efficiently. In the jargon of structured information, that content model is a DTD (Document Type Definition). Surely the web is an ideal venue for electronic data interchange. The XML developers could envision a broad range of EDI applications for which HTML was an inadequate data format, ill-equipped to express an industry's content model and its semantics.
Java technology- and client-based processing
One of the the decade's most important technologies, the Java technology, enables browsers to function as generalized application platforms. True platform independence is the result. But the fixed tag set and semantic poverty of HTML provides precious little for Java applications to process. As one XML developer has said, "XML gives Java something to do." By providing information rich in metadata specified in a standard format, XML and Java technology make it possible for more of an application's work to be processed at a client. This contrasts with the general tendency of HTML pages to rely on a CGI script back at the web server for any programmed functionality. With XML and Java technology, more client-based application processing could reduce network and internet traffic, making the web faster.
Platform-independent information
SGML, the parent technology of HTML and XML, has always offered itself as a platform-independent technology for specifying the structure and semantics of information. While enterprises wrestled with evolving information formats like Microsoft's RTF, Adobe's PostScript and MIF formats, formats from WordPerfect, Lotus, Borland, and so on, SGML represented a rigorously consistent and platform-independent form for representing information. However, during the 1980s, when the SGML standard quietly was emerging, most computer industry observers focused instead on the explosion and excitement of new computer platforms. That industrial and commercial ferment obscured the coming chaos that multiple proprietary information formats assured. Later, in the 1990s, the popular discovery of the internet, and the emergence of the web, web browsers, and Java, revealed that chaos more clearly.

"SGML on the Web"

In August, 1996 these concerned SGML experts gathered in Seattle under the auspices of the GCA (Graphic Communications Association) to investigate how SGML could emerge on the web scene and command the interest of the web community. Led by Jon Bosak of Sun Microsystems, their discussions focused on two general areas:

Classes of software applications for which HTML was an inadequate information format
Aspects of the SGML standard itself that impeded SGML's acceptance as a widespread information technology

The first discussion established the need for SGML on the web. By articulating worthwhile, even mission-critical work that could be done on the web if there were a suitable information format, the SGML experts hoped to justify SGML on the web with some compelling business cases.

The second discussion raised the thornier issue of how to "fix" SGML so that it was suitable for the web. After all, if SGML on the web were such an intuitively brilliant idea, it ought to have happened already. But HTML and its specific tag presentational tag set, not SGML and its multiple semantic tag sets, were on the web in August, 1996.

The experts laid out a plan of radical surgery for the SGML standard itself. In order to make SGML palatable to a wider audience, aspects of the standard that made logical sense but were difficult and costly to program had to be modified or even excised. It should be noted that SGML was designed as a rigorous, complete system, but ease of implementation in software applications was not the ruling priority for the SGML standard. The experts quickly established a rough laundry list of "SGML inessentials" for moving structured information onto the web.

XML Is Born

Even before this Seattle conference, Bosak and a small, carefully chosen group of SGML and structured-information experts approached the W3C to propose adding an "SGML on the web" activity to its efforts. The W3C agreed that this was worthwhile and sponsored the effort within its architecture domain. By July 1996, the effort to fit SGML on the web began.

Early in the activity, the W3C representatives who were to develop the XML standard determined that "SGML on the web" would not fly. SGML has its passionate devotees, but it also has its equally passionate detractors. The working group (originally called the "SGML Editorial Review Board") decided to refashion SGML on the web into something new, unburdened with SGML's history. To emphasize its difference from HTML, the working group named it Extensible Markup Language.

The working group members quickly set themselves an aggressive schedule in which to specify the features of XML. They planned the work in three phases:

XML: the syntax itself
XLL (Extensible Link Language): the linking semantics of XML
XSL (Extensible Stylesheet Language): the presentation of XML

The particulars of these efforts are available at the XML resources listed above. As mentioned previously, the XML 1.0 standard was approved and published by the W3C on February 10, 1998. Work on XLL and XSL is proceeding.

Coming to Grips with XML

The world has responded enthusiastically to XML. Go to any trade show or conference associated with publishing, documents, or the internet (and intranets or extranets), and you will see vendor upon vendor pledging or even demonstrating support for XML. The tools with which to create internet content are promising XML. The programs that deliver internet content are embracing XML. The systems that manage internet content are committed to XML.

What does that mean for you, the individual or enterprise who wants to take strategic advantage of XML and all that it promises?

You have to clear some hurdles. These include both familiar and new challenges:

Make your information into XML information
Rework your traditional view of information to XML's object-oriented model
Re-evaluate the applications and tools with which you develop, deliver, and manage XML information

Developing XML Information

If you have ever migrated from one application to another for developing information, say in a word processor or a spreadsheet, you know about changing your information to fit the new tool's data format. Moving your legacy data into XML is a data conversion task, but it is more. It is a strategic operation to add new business value to your information. This requires work.

Converting any information from a display format such as HTML, RTF, MIF, or PostScript to a structured format like XML will require that you understand what your information really contains. This requires a document analysis and the determination of information semantics on which different parts of your enterprise rely. If it sounds daunting, it is. But there is also good news. Many an enterprise like your own has done this already, and many enterprises in different business sectors have established industry standard information models that can be expressed in XML and, more importantly, can be shared.

Once the relevant information models and their expressions in XML are constructed, the effort to convert existing information into the XML format can proceed. It may or may not be painful, depending on the condition of your existing documents. These efforts can be done in house, or they can be completed with the help of qualified consultants.

Future articles will discuss possible transitions from today's information into XML in greater detail.

Learning to Live with XML Information

Once you have created a body of XML information, you will learn to treat it differently from the information you had before. The applications, file systems, and other software you relied on to elaborate information may not work so well with XML. Those traditional tools may not effectively expose the new value in your XML information. But again there is good news. It is clear that the marketplace is well prepared to deliver XML support in all phases of an enterprise's transition to XML. Already many software vendors are announcing, testing, and even delivering tools to aid in these critical phases of your transition:

Converting your legacy information into XML and structured formats that reflect your information's full value
Developing new, XML-structured information, and evolving your newly converted information
Managing the base of XML-structured information that results from your transition

Again, future articles in this series will discuss the issues and strategies about moving into XML and taking full advantage of structured information.

How Can an XML Transition Happen?

There are as many ways to make the XML transition as the Cartesian product of organizations, legacy data formats, and human moods. The next article in this series will look at some XML transition plans currently underway at Sun.

Todd Freter is a programs manager at Sun and responsible for developing tools and technology strategies for producing and delivering technical information.

[ Home ][ Buy ][ Java Computing ][ Products & Solutions ][ Support, Education & Consulting ]
[ Technology & Research ][ For Developers ][ Corporate Information ]