[This local archive copy mirrored from the canonical site: http://www.newscientist.com/ns/980530/xml.html; links may not have complete integrity, so use the canonical document at this URL if possible.]


A new dawn

If the Web is to meet all our needs next century,
it'll take nothing short of a revolution.
So a smart way to translate anything from MnO2
to the Moonlight Sonata into a universal language
is a fine start, argues Glyn Moody

ONCE, Web search engines were a wonder. Just by entering a keyword you could sift through millions of pages in seconds to hunt down the information you needed. Nowadays, it's a wonder anyone bothers. The Internet has grown so vast and varied that it's not unusual for a search to throw up 10 000 hits. You almost need another search engine to discover anything useful.

Worse still, even though thousands of pages contain your keyword, the chances are that many do not use the word in the sense you want. Search for "gates", and you'll receive links to pages about the tiny silicon gates on transistors, the five-barred variety found on farms and, of course, Bill Gates. The fact is that the Web is pretty stupid when it comes to "understanding" the information it conveys.

The problem lies with the language that weaves the Web, HyperText Markup Language (HTML). It began life at CERN, the European laboratory for particle physics, where Tim Berners-Lee wanted an easy way for his colleagues to post their papers online. The simplicity of his solution unexpectedly triggered the revolution that has made the Net so popular today. Yet, at the same time, growing competition between software companies to make the Net ever more seductive and powerful has pushed HTML far beyond what it was designed to cope with.

Today, however, a second revolution is under way--only this time it's being planned. A new Net language, called the Extensible Markup Language (XML), promises to make the Web smarter by including machine-readable information about the structure and content of Web pages. Search engines, then, will be able to home in on just one meaning of a word.

But that's not all. XML opens the door to new languages that will allow musical notation and mathematical and chemical symbols to be sent across the Web as easily as text. Documents written in these languages will be interactive in ways we can only dream about today. Readers will be able to treat the information in these documents just like raw data--analysing or manipulating it any way they see fit.

Wildest dreams

XML should be good for industry too.As the Web grows smarter, that Cinderella of electronic communications, e-commerce, will finally get to the ball. "There are loads of start-up companies with pockets bulging with venture capital hoping to make this work," says Tim Bray, a founder of the XML working group at the World Wide Web Consortium, the nearest thing the Web has to a controlling body.

The consortium published XML 1.0 last December (http://www.w3c.org/XML/). Since then, software companies around the world have raced to apply it in almost every area of computing. "None of us could have predicted the events of the past few months in our wildest dreams," says Bray. Amid all this turmoil, the general shape of the XML Web is slowly starting to emerge.

The language that has brought the Web this far, HTML, is a "markup" language, consisting of text interspersed with tags, normally in pairs and contained within angled brackets. These tags mark out the underlying structure of the document. So, for example, a Web browser will interpret whatever is between the <H1> and </H1> tags as a major heading, and will usually display it in large, bold type.

Berners-Lee kept the tags basic and their number low. As a result, HTML conveys only a tiny amount of information about documents it marks. This is not the case with XML, which is a "metalanguage": it provides a set of rules for constructing other markup languages. As such, it lets people make up their own tags (hence the "extensible" in the title), and so can provide much richer information about the data held in documents.

Smart search

This "metadata"--data about data--might include information about the subject matter of Web pages. With XML, attaching metadata to a document is easy, at least in theory. Back with our "gates" query, the range of subject matter would include electronics, farming and billionaires. The appropriate choice could be placed between a new pair of tags, <SUBJECT> and </SUBJECT>, for example, which would allow a search engine to "understand" it--at least to the extent of knowing which kind of gates it is dealing with. So when you searched, hits would be presented grouped by subject rather than a random mix.

But this begs two big questions. What will all the extra tags on an XML page look like when viewed through a Web browser which can cope only with presentational tags such as title and heading? And if everyone is making up their own tags, how will search engines know what they refer to? The answers to these questions highlight fundamental differences between HTML and XML.

It is a basic rule of XML that content and presentation are separate. So XML tags contain no hint about how they should be displayed. This means that in future, before you can read an XML page it will have to pass through a program that will format it for you. These programs generally use "stylesheets". And one candidates for creating stylesheets uses an XML language called eXtensible Style Language (http://www.w3c.org/Style/XSL/).

Stylesheets are equivalent to the templates already used in wordprocessors to give documents the same look, says P. G. Bartlett, vice-president of marketing at the Michigan-based software company ArborText, which has done much of the work defining XSL. These stylesheets consist of formatting rules for how particular XML tags, such as <SUBJECT>, should be dealt with on screen or on page. One obvious approach would be to use a stylesheet to convert XML tags into HTML tags, so that a document could be viewed with a Web browser.

But XSL promises much more. "Different stylesheets can be applied to the same data," says Bartlett. Parts of a document, separated by different tags, could be hidden or displayed by different stylesheets. A page could contain three versions of the same product information, for example: one for a company's managers, another for its engineers and the third for its customers. Each version would be revealed by a different stylesheet, says Bartlett.

While XML and XSL revolutionise the way the Web deals with the content and look of information, another XML application, called XLink, offers an invigorating update to that other key feature of the Web--hyperlinks (http://www.w3c.org/ TR/1998/WD-xlink-19980303). XLink will introduce a number of novelties, says Eve Maler of ArborText, co-editor of XLink. "You could target a particular chunk of content such as a section, rather than pointing to a whole document." This is possible because XLink allows extra information to be added to Web addresses. You could create a hyper-link, for example, that would take you to the third speech of the second scene of the second act of a play (see Diagram on page 36).

You can also make your links "behave" in strange ways on screen, says Maler. Say you're reading a guide to hotels and the hyperlinks come in two varieties--contact details for the hotels and maps showing their locations. Place the cursor over the hotel name and a line of XLink code will turn it red if it takes you to a map and green for the contact details. And all before you click the mouse. Better still, XLink could give you a pulldown menu of the options.

For all this array of possibilities, our second big question remains: if Website designers can invent XML tags, how will search engines know what they all "mean"? Fortunately, XML comes complete with its own solution to this problem, called the Resource Description Framework (http://www.w3.org/RDF/Overview.html).

RDF allows information about a Web page to be stored as if in a structured database. Using an XML tag such as <SUBJECT> gives you a simple "blob" of metadata, says Bray. Our gates example might include things such as farming, agriculture, fencing, wood, and so on. While a human can understand this kind of list, a machine cannot. By contrast, RDF could be employed to divide up metadata into fields such as "main subject" and "secondary subject"--and you could use the secondary subject tag more than once. Other tags might include "document author" and "date of creation". This allows search engines to get smarter in future. Asked to find all documents written by Joe Bloggs about bananas before December 1997, a smart search engine would check the author and subject and then ask if the date is less than or equal to 30/11/97.

RDF itself does not specify names for the fields--it merely sets out the syntax for how different fields relate to Web pages and to one another. It is up to different groups of users to name the fields and decide which collections of fields--or schemata--are best for them. "Some communities will define 'official' schemata," says Ora Lassila of the Nokia Research Center in Boston and co-editor of the RDF standard. "For example, the library community is working on its Dublin Core schema." The Dublin Core schema consists of 15 fields that give all the basic information about electronic documents. They include title, subject, creator, publisher and date of creation.

"The problem with official schemata," says Lassila, "is that it takes a long time to get enough representatives from any community to agree on anything." He expects that rough and ready schemata will become standards by default simply because people will start using them.

The benefits of RDF will come not only from smarter searching, but also from making it easier to transfer and pool data. Using the Dublin Core, for example, it will be possible to amalgamate bibliographies from different institutions, and so create a kind of virtual, global library catalogue.

Even where no formal schema has been drawn up, XML can still help information interchange. Take two companies that hold similar information in their databases, but use different programs running on incompatible computers and use different names for their database fields--one uses SecondName, for example, while the other uses Surname. If these databases are converted to XML, then the fields become pairs of tags and the data within the fields are placed between the tags. So the first company's tags become <SECONDNAME>, </SECONDNAME> and the second firm's <SURNAME>, </SURNAME>. It is then possible to write a simple text-processing program to translate one set of tags into the other.

One area that is likely to be transformed by XML's interchange capabilities is e-commerce. In particular, electronic data interchange (EDI)--an attempt to define standard ways for companies to exchange orders electronically--may well be re-energised. "Many of us feel that XML is the best way to move EDI on to the Internet," says Patrick Drummond, a member of the EDI working group of the American forum CommerceNet. In the past, companies have been loath to switch to EDI because it needs expensive software. But XML programs are likely to be freeware, shareware or low cost.

It's not just commerce that will benefit, however. "I believe that XML will have a breakthrough impact on electronic record-keeping in healthcare," says Tom Lincoln, research professor of medical informatics at the University of Illinois at Chicago. The ability to pool medical information from many hospitals and search for, say, patterns of disease or successful treatments among the records of tens or even hundreds of thousands of people could transform epidemiology.

One other intriguing aspect of XML is that it's not confined to words and numbers. With the right tags, it can be used to convey just about anything, including mathematical, musical and chemical symbols. Already, this has led to XML applications such as the Mathematics Markup Language (http://www.w3c.org/Math/), Music Markup Language (http://www.tcf.nl/trends/trends6-en.html), and the Chemical Markup Language (http://www.venus.co.uk/omf/cml/ intro.html).

The latter, CML, can manage any existing molecular information on the Web, says its creator Peter Murray-Rust, director of the Virtual School of Molecular Sciences at Nottingham University. One of the advantages of XML is that it is designed to support different disciplines working together, he continues: "So it's very straightforward to mix text, maths and chemistry in the same document."

Automatic access

Perhaps the most astonishing feature of CML and other XML applications is that it is possible to write software that will select data held between pairs of tags and then manipulate them automatically. This makes the data not just searchable but manipulable in ways that are impossible today, says Murray-Rust. You could, say, write a program that grabbed some numbers from an XML chemistry paper, modified them in some way and sent the results to a computer-controlled chemical production process--all without human intervention.

Radical it may be, but not everyone is so sure that XML will bring unalloyed benefits. Tim Brady, vice-president of production at Yahoo, the Web's most popular destination, is unequivocal about its negative effects. "XML will make it easier for business to spam search engines," he says. That is, XML will allow the unscrupulous to add spurious metadata that will put their Web pages higher up in search lists, or even in completely inappropriate ones. Still, as Lassila says, "any technology can be used for fraud and deception".

More serious is a warning sounded by Mark Pesce, co-inventor of the Virtual Reality Modelling Language (VRML) used for creating three-dimensional worlds. His fear is that if different groups of Web page designers stick to their own tags it could lead to "Balkanisation" of the Web. Lassila admits this is a frightening idea but argues that RDF is designed specifically to stop it happening.

And Bray thinks the notion of the Web splitting up is nonsense. "There's no point in creating my own tags unless I want other people to use them," he says. "My tags will need to be included in a stylesheet and offer some software that does something interesting." These constraints will make it difficult to get tags widely accepted.

Like Bray, most people are upbeat about XML. Even those traditional rivals Microsoft and Netscape are united in supporting the XML revolution. Bartlett at ArborText has no doubts about what's going on."XML," he says, "will prove to be one of the top ten technological innovations of the first century of computing.

Glyn Moody is a writer and consultant specialising in the Internet. His e-mail address is glyn_moody@cix.co.uk

© Copyright New Scientist, RBI Limited 1998