Scientific American: Cyber View: The Web Learns to Read<p>: June 1998

[This local archive copy is from the official and canonical URL, http://www.sciam.com/1998/0698issue/0698cyber.html; please refer to the canonical source document if possible.]

...........

The Web Learns to Read

Imagine lifting up a page arriving from the World Wide Web to watch the computers beneath negotiate their transfers over the Internet. You would see them conversing in four distinct languages. Three of those four tongues are extremely terse and rigid; they are spoken only by machines. It is the fourth one, HyperText Markup Language (HTML), that has made the Web such a phenomenon. HTML is similar enough to English that masses of people have learned to use it to annotate documents so that almost any kind of computer can display them. But HTML is still rigid; a committee must approve every new addition to its narrow vocabulary.

In February a fifth language was approved for use on the Web--one that could have consequences nearly as profound as the development of HTML did seven years ago. Extensible Markup Language (XML), like HTML, is surprisingly easy for humans to read and write, considering that it was developed by an international group of 60 engineers. But XML is much more flexible than HTML; anyone can create words for the language. More than that, devices that can understand XML (within a few years, probably almost all the machines hooked to the Internet) will be able to do more intelligent things than simply display the information on Web pages. XML gives computers the ability to comprehend, in some sense, what they read on the Web.

To understand how, imagine that you want to rent a dacha on the shores of the Black Sea for a vacation, but you cannot read Russian. A friend in Odessa e-mails you the classified rental listings from the local paper. Even if he inserts descriptions of how the ads appeared in the newspaper--there was a line here, this word was in boldface--that hardly helps. But what if he annotated the Russian text to indicate which numbers referred to prices and which to bedrooms? Or if he highlighted each reference to a view and noted that jxtym [jhjij means "very good"? Suddenly the listings would be useful.

To browser programs, Web pages today are typically long stretches of gibberish (what we humans would see as English or Russian), with a few intelligible HTML words that describe how to arrange the chunks of gibberish on the page and what typeface to put them in. Publishers can use HTML to make Web pages pretty, but getting the pages to do anything semi-intelligent--to reorder a list of properties according to price, for example--requires a separate program. Then readers must wait while that program generates a whole new page on some distant, overburdened server and sends it to them. That costly, inefficient process is what makes the Web so clumsy and slow at providing services such as travel reservations, customized news or useful searches--and why so few companies offer such services on-line.

XML should fix those problems. It allows authors to annotate their pages with labels that describe what pieces of text are, rather than simply how they should appear. The Odessa Tribune, for example, could mark up its classifieds so that Web browsers can distinguish ads for vodka from those for dachas and can identify within each dacha listing the price, size and view of the property.

Now that XML has been certified as a Web standard, both Microsoft and Netscape have announced that the next major releases of their browsers will understand the new language. Using so-called style sheets, the programs will be able simply to display XML documents much as they format HTML pages now. But if snippets of code, known as scripts and applets, are embedded in an XML page, the browsers could also act on the information it contains. The Odessa listings could be culled to remove properties costing over 2,000 rubles or even combined with dacha listings from five other on-line newspapers.

In essence, XML offers the first universal database translator, a way to convert information in virtually any repository into a form that almost any other computer can manipulate. As such, it should eventually make Internet searches dramatically more useful, in two ways. First, surfers could limit their searches to specific kinds of Web pages: recipes, say, or news stories or product descriptions. Second, many of the most useful bits of information on the Web remain tucked inside databases that are hidden from the search robots traversing the Net in search of text for their indexes. With XML, Medline could open up its database of medical journal abstracts so that any program could search them. General Motors could do the same for its spare-parts catalogue.

XML is universal because authors are free to define new words to describe the structure of their data. Such liberty could lead to chaos were everyone to make up a new lingo. But people and companies with a common interest have a strong incentive to settle on a few choice terms. Chemists, for example, have already used XML to redefine their Chemical Markup Language, which now enables browsers to display the spectra of a molecule and its chemical structure given only a straightforward text describing the compound. Mathematicians used the standard to create a Math Markup Language, which makes it easy to display equations without converting them to images. More important, MathML formulas can be dropped directly into algebra software for computation.

Perhaps the most impressive demonstration so far of XML's flexibility is MusicML, a simple set of labels for notes, beats and rests that allows compositions to be stored as text but displayed by XML-enabled Web browsers as sheet music. With a little more programming, the browsers could probably play MusicML on synthesized instruments as well. After all, now that the Web can read data, it may as well learn to read music.

--W. Wayt Gibbs in San Francisco