The hypertext markup language, as we are all well aware, was an experiment that got out of the lab too soon. It was, and to a certain extent still is, a very simple way to describe a limited set of information for transmission and display on the Web. In the few short years it's been around, we've seen that various political and commercial forces have stretched the language almost to the point of breaking. So what's the next step?
Well, what if you could merge the simplicity of HTML with the unparalleled flexibility of standard generalized markup language, or SGML? That's the idea behind the extensible markup language, or XML.
I've asked Tim Bray, co-editor of the XML spec, to give us some background on the project. Tim spent three years working on one of the largest electronic publishing initiatives in history - the New Oxford English Dictionary project. He then co-founded Open Text Corp., which created one of the first large search engines on the Web. He currently has an independent consulting practice called Textuality, and is representing Netscape in the XML standards process, including their Meta Content Framework proposal.
This week, we'll take a look at the motivation behind SGML on the Web, and how that resulted in the XML project. Next week, we'll dig into some practical applications of the technology.
JEFF: Can you tell us how the XML project came about?
TIM: Going back several years, some prominent techies in the SGML community had been saying that SGML was a good idea, but it was just too hairy for real people to get into; you could crack great big problems, but sometimes not do the simple things simply. Then the Web came along and showed the power of doing simple things simply, with the Internet providing the horsepower. Anyhow, in the summer of '96, Jon Bosak, a Sun guy and longtime SGML user (he did the Novell docs site) badgered the W3C about doing something for SGML on the Web, and they said he could form a committee and see what could be done. The people he picked for the committee were the same ones from SGML-land who had been talking simplification for years. The committee is pretty heavy - almost everyone on it is a chief scientist or Internet IPO architect or standards editor or some such.
The ostensible agenda was (a) better stylesheets than CSS, (b) better hyperlinking than <a href=....>, and (c) a simpler form of the language. Once we got together, it took about 15 seconds to decide to do it in the order (c), (b), and (a). Furthermore, there were, I think, no less than five of us who had already cooked up designs for an SGML simplification. The premise was, put in everything that's proven to work and easy to implement, throw the rest out. The work was mostly done between August and November '96 - it was pretty intense. When we first trotted it out, the SGML community mostly leapt on board instantly; getting our nose into the Web-grunts' tent has been a bit tougher, but it sounds like we're making good progress on that front. Interestingly, there were a couple of places where SGML had features that were going to be a *total* pain in the ass in network deployments; the SGML gang is impressed enough with XML that they have cooked up a "technical corrigendum" to SGML to iron o
ut these wrinkles and keep XML Net-capable without losing ISO-SGML compatibility.
JEFF: We've already seen Microsoft using XML for their channel definition format (CDF) for scheduling and delivering Web-based content. Apple's work on meta content framework is now being embraced by Netscape as another XML application.
TIM: The difference between a library and a pile of books on the floor of a big room is the card catalog (which is now computerized, of course). The card catalog uses an agreed-on format and an agreed-on vocabulary to let you find books by author, title, subject, and some other things. Of course, the Web has no librarians (aside from the guys at Yahoo and so on, who are way outnumbered), but even if you could get people to put cards in the catalog for their own pages, there's no agreed-on format or vocabulary. That's what we're trying to provide with MCF and XML. Once we have this, the people who publish on the Web and have their act together absolutely will make the effort to keep their metadata up to scratch. Then I'll be able to go to a search engine and do things like pull up resources on limnology of polluted waters hosted by US universities and updated since January '97 - or entertainment magazines with articles about Beck prior to July '96 that aren't talking about Jeff Beck - o
r mailing lists that discuss dual-citizenship issues.
Historically, the Net has no metadata to speak of. But all of a sudden in recent times there have been a lot of proposals for doing metadata. The idea behind MCF is that if all the different sorts of metadata in the world share something by way of vocabulary and data model, you get quite a bit of interoperability and the ability to ask questions about all sorts of different metadata in the same framework. For example, if Wired were to define an "Internet hipness index" and start assigning it to things out there, you'd define your own property, called IHI, and even if I didn't know exactly what the semantics were, in an MCF environment I would be able to find out that the property exists, that its domain is Web sites and its range is numeric values, that it comes from Wired, and that it was last updated whenever.
It's a richer world. The Web has made for less data being stored in proprietary formats. Metadata is just as important.
Next week: Practical applications of XML.
Jeffrey Veen writes a weekly column on tools and related Web technologies for Webmonkey.