October '97 Article
XML: The New Wowser For Browsers!
By Dave Trowbridge, marketing manager, Hummingbird Communications
Organic metaphors for the World Wide
Web are seductive: they seem to clarify the boiling confusion of the Internet. And the Web
indeed shares a salient characteristic with organic life: they are both phenomena that
thrive only in the narrow, ever-changing borderlands between order and chaos.
In our bodies we may experience excessive order as, for instance, cancer, a deadly
monotony of identical cells; and chaos may strike as deadly cardiac arrhythmia, where
various parts of the heart lose the ability to communicate and synchronize. On the Web,
standards imposed by technological inertia, monopoly, or government impose deadly order,
choking innovation and growth; yet, in the absence of standards, little useful
communication takes place and the Internet remains nothing more than islands of isolated
data and computing power.
Nowhere is this dilemma more poignant than in the world of HTML, a simple markup
language now crushed under the burden of attempting to support the World Wide Web in its
incarnation as all things to all people. The difficulty of changing the HTML standard has
stifled the development of extensions to support specialized data and vertical
applications; yet without the standard, there would be no Web. Palliatives such as
plug-ins, or Java or ActiveX applets, are merely end runs around the fundamental
limitations of HTML.
But what if the browser interface could mutate to support new data types, new ways of
packaging data? What if ISVs and integrators had a standard way of extending what browsers
can do without writing applets? And what if all this were possible without sweeping away
the vast installed base of useful HTML documents and applications?
That's exactly the promise of the Extended Markup Language (XML) developed under the
aegis of the World Wide Web Consortium (W3C). Unlike HTML, which is basically an
application of the Standard Generalized Markup Language (SGML) that is hard-wired into
browsers, XML is a simplified subset of SGML. It allows content providers, programmers,
and integrators to define their own tags and document types--in effect, it's a kind of
freely-mutating HTML that can be extended to support virtually any kind of data.
XML already forms the basis of Microsoft's push technology, the Common Data
Format (CDF) and the Open Software Description OSD) specification, a new software
delivery format proposed by Marimba and Microsoft. As well, it is serving as inspiration
for the Resource Description Format (RDF), a standard for Web meta-information under
development by the World Wide Web Consortium. For integrators, XML promises a whole new
frontier by enabling the design of Web-enabled systems that specifically support vertical
applications and enable the effortless exchange of data across intranets, extranets, and
the Internet using simple, browser-based technology.
The Tao of SGML
SGML is not a language but a meta-language: a set of generalized rules used to specify
domain-specific languages, a kind of compiler, in fact, much like the programming tool
yacc (Yet Another Compiler Compiler.) Its fundamental principle is a simple one: that the
design of a language should be determined by the structure of the data it describes, not
the output medium it uses.
Most large industries already use SGML to specify a language specific to their
technology to promote cooperation. Two good examples of this are the Telecommunications
Interchange Markup (TIM) language, and the Pinnacles Groups, a semi-conductor industry
effort to develop a markup language to permit sharing semiconductor design information.
SGML is also the basis of the markup language used in Microsofts Encarta
Encyclopedia.
SGML is non-proprietary, system- and platform independent, and promotes the efficient
reuse of data. It is almost infinitely flexible, capable of describing the structure of
virtually any kind of data or information.
It is also almost infinitely complex, making it difficult to create markup languages,
and to write software that can accommodate the demands of SGML. And too many of its
options are only rarely used. In addition, SGML instances (documents) are not truly
portable. Viewing any document written in an SGML-derived markup language requires a
Document Type Description (DTD), a style sheet, and a catalogue file. (The DTD specifies
the relationships between the various elements of a document, such as the TOC, chapters,
and tables, while the style sheet determines their formatting.) If these three
meta-information sources are not available, the document can only be viewed as raw,
unformatted SGML, which is much harder to read than raw HTML.
Making HTML Portable
In the case of HTML, these limitations were overcome by hardwiring the HTML DTD (i.e.
the HTML standard du jour) and style sheet into the browser. That solved the portability
problem by ceding control of the information presented to the client, rather than the
server. (HTML style sheets, still not standardized, are an attempt to return some control
to the server originating the data.)
The downside was the near-total elimination of extensibility, resulting in
technological inertia as the familiar Catch-22 of standards asserted itself: people won't
adopt standards without trying them which they can't until vendors offer them, which they
won't until people adopt them
Only the largest software vendors can afford to count
on the "if you build it they will come" approach, a reality which rendered HTML
hostage to the ambitions of Microsoft, Netscape, and others. As a result, the current
reality of HTML is a patchwork of solutions attempting to work around the language's
fundamental limitation: it was designed for simple hypertext transmission, and cannot
adequately represent the many kinds of data that people want to transmit across the Web.
Even more important, HTML conversion inevitably destroys information. For instance,
consider an HTML table generated from a database. Without a great deal of hacking, there's
no way to import that table back into another database, for the database schema or
structure has been lost. This limitation makes the Web largely a one-way street, hampering
the exchange of data across corporate intranets and extranets.
XML: Thinning Down the Standard
XML was designed to overcome these limitations by eliminating the infrequently used
parts of SGML and rewriting the remainder for better network citizenship. XML is actually
more than just a meta language, for it also offers a standardized approach to stylesheets
and a far more powerful hyperlinking model than HTML.
XML stylesheets are written using a subset of the Document Style Semantics and
Specification Language (DSSSL), itself derived from a dialect of LISP, a powerful language
associated with artificial intelligence. They are freely extensible and Turing complete,
allowing designers to arbitrarily extend stylesheet capabilities; completely
internationalized; and possess a sophisticated rendering model that delivers professional
page layout capabilities. XML hyperlinking is a subset of HyTime, an ISO standard for
hypertext, hypermedia, and time-based multimedia, and will offer such improvements as
bidirectional links, links that can be specified and managed outside the document they
belong to, and link attributes.
But the main source of the excitement about XML is its ability to specify
network-friendly languages perfectly adapted to the data they describe. Already the health
care industry has latched on to XML as the solution to making the complex information
found in patient records truly portable. And vendors in the EDI (Electronic Data
Interchange) space are also eyeing XML as a way to bring EDI to the masses. In principle,
any integrator with the talent on board to use yacc can also exploit the power of XML to
write languages for Web-enabled vertical applications. In fact, yacc itself can be used to
specify an XML language, and perl can be used to parse XML instances, although specialized
tools will eventually eclipse these programmer tools.
XML also makes it possible to deliver information in forms that can be manipulated by
the client without further server or network involvement. For instance, instead of
downloading a merely textual table of contents for a document, XML could deliver a
structured TOC object that could be expanded or contracted by the user. Likewise,
spreadsheet or database-type information could be downloaded with its schema intact,
allowing the browser to create different views of the data locally; in addition,
XMLs preservation of structure would make possible drag and drop transfer of
information from a browser window to a database. Multimedia will benefit enormously from
XML's descriptive ability--in fact, XML may be the only hope for true integration of the
PC with television. How else to describe and rate 500+ channels?
Despite the fact that there is no XML-capable browser yet, this standard is already
pervasive. Microsoft's CDF push format is written in XML, as is the software delivery
language OSD championed by both Microsoft and Marimba. Netscape touts the Resource
Definition Framework (RDF), also based on XML, as a proposed Web standard data model; it
forms the basis of the companys new Aurora technology. Dynamic HTML, as well, draws
on XML. DataChannel is developing Xapi-J, which gives Java and Javascript programs a way
of extracting data from XML instances. Integrators expecting to capitalize on XML should
get started now, for very soon the Web will be flooded with data accessible to XML tools,
and customers will be clamoring for software that can help them capture, interpret and use
it.