[This local archive copy mirrored from the canonical site: http://www.adtmag.com/Pub/Aug98/fe804.htm; links may not have complete integrity, so use the canonical document at this URL if possible.]
|Internet Tools & Technologies
Searching for the Holy Grail, XML style
August 1998/By Jason J. Meserve
As the cult "X-Files" television show proclaims: "The truth is out there, somewhere." UFO and government conspiracy-theory buffs had their fondest wishes come true when the FBI placed a litany of documents relating to alleged UFO sightings and other investigations on its Web site. The agency's move was the result of complying with the enormous number of requests it receives under the Freedom of Information Act.
But to find that nugget of truth, Web users must wade unassisted through the vast resources available to them on the FBI site, as well as on thousands of other sites. An Unidentified Flying Object gaining some identity and proclaimed as the solution to such data handling problems is eXtensible Markup Language (XML). Various vendors from diverse segments are driving the youthful XML technology. These include tools, database, document management and other vendors. XML promises a meta language for describing data, separates data from presentation and offers useful parsing capabilities. As it is an offshoot of the Standard Generalized Markup Language (SGML), a fairly esoteric standard that gave birth to the Hyper Text Markup Language (HTML), it is still unclear if industry players can reach agreement on XML models.
The biggest public collectors of information are search engines such as Yahoo, Altavista, Excite and Infoseek. Virtually any search phrase entered into one of these engines returns thousands of results with limited sorting and descriptions. For example, a search of "application development" on Altavista returns 180,620 possible Web pages with this topic. Which pages are about tools? Services? Application Development Trends? Never mind the estimate that less than half of all Web pages are currently indexed.
A new spin on an old technology may help the dual problem of cataloging and searching for data on the Web. The XML Specification, approved by the Cambridge, Mass.-based World Wide Web Consortium (W3C) in February 1998, provides a means for data elements within a Web page to be described in a method readable by both machines and humans. XML developers can now create their own tags for describing data elements on a page, such as name, book title or stock price.
Search engines catalog a Web page by parsing virtually every word, including meta tags, on a given page to determine the topic. The reason more than half of the world's pages are not indexed is that it takes too long for the engines to process all of the necessary data. And when they do, their "understanding" of the page topic may not be the same as what the author intended. XML can help solve these parsing and comprehension issues.
By embedding XML that describes what the topic is for each page or paragraph, a search engine would have to look in only one place in each HTML document to get a synopsis -- the rest could be ignored. This would speed a search engine's ability to capture page data and give developers more control over how the search engine interprets the information.
Visionary developers see real promise here. However, the use of XML in search engines only scratches the surface of the specification's flexibility. Database, publishing, document management, application integration and even middleware vendors are now looking at ways of using XML to make the exchange of data between applications and entities easier.
"XML adds intelligence to otherwise unintelligible, unstructured data," said Peter Jordan, president of Microstar Software Ltd., Ottawa. "Look at the data in HTML, and the only way to identify it is visually. But with XML, this says 'It's a purchase order.'"
What makes XML great?
XML provides a universal means for describing structuring data and sending it over the HTTP protocol for use anywhere, by virtually any application. Currently, HTML describes only how elements of a page should be interpreted visually by the browser; it does not describe any data that may be contained in the page. HTML is also limited to a finite set of tags to describe data. Web developers cannot create a new "headline" tag, for example. Also, different browsers interpret the same set of HTML tags differently. With XML, developers are not limited in the number of tags they can use. XML also throws out any notion of presentation; its only concern is data.
"The physical implementation of data is being done differently on all [Web] sites and all databases," said Steve Sklepowich, product manager of platform marketing at Microsoft Corp. "The goal of XML is to give one standard way to describe all data. It is going to be the meta language for describing data, regardless of what the data is."
XML is derived from SGML, a more complicated data description specification used in high-end publishing. While XML and SGML are similar in functionality, XML is missing all of the bells and whistles that make SGML complicated. For instance, programmers can declare whether a specific tag needs just an opening or closing tag around a block of data -- XML requires both. XML is also optimized for use over the Web. Today, SGML-based applications can use XML-described data.
HTML also began as an SGML derivative, but the link was broken between Versions 1.0 and 2.0. The technology has also suffered from fragmentation, with Microsoft and Netscape Communications Corp., Mountain View, Calif., both adding proprietary tags that can only be parsed by their respective Web browsers. XML may be less likely to fragment since it has no pre-defined tags. In addition, the W3C XML Working Group that developed the specification was led by such luminaries as Adobe Systems Inc., San Jose, Calif.; ArborText Inc., Ann Arbor, Mich.; Data-Channel Inc., Bellevue, Wash.; Hewlett-Packard Co., Cupertino, Calif.; Microsoft; NCSA; Netscape; SoftQuad Inc., Toronto; Sun Microsystems Inc., Mountain View, Calif.; and Vignette Corp., Austin, Texas. With such widespread input, one would hope that any one player would not compromise the specification.
The specification's strength is also derived from its history. "HTML didn't have any heritage when it was first released," explained Rita Knox, vice president and research director at Gartner Group, Stamford, Conn. "XML has two sets of heritage: everything learned from SGML and everything learned from the Web."
Nate Zelnick, project manager for industry relations at Allaire Corp., Cambridge, Mass., agrees. "The key thing [about XML] is that it follows a Web-centric development model," he said. "You're not locked into a proprietary data model or component model. Because XML is open, it can move between systems pretty easily."
The two major rules to follow when tagging information with XML are that tags must have an opening and closing (for example, <name>...</name>), and no two tags may overlap. However, tags may be embedded in one another (for example, <name>...<first name>...</first name> ...<last name>...</last name>... </name>).
The implementation of XML tags to mark text is similar to the implementation of HTML tags. This leads many to believe that XML was designed to replace HTML. However, this is not the case. Instead, XML allows the data to be separated from the presentation layer, making XML complementary to HTML.
"Because there's a hard wall between presentation and data, you get the general availability the Web was supposed to have," Zelnick said. "You get cross-platform, cross-application, arbitrary developer to arbitrary user things -- that's been Holy Grail computing."
SGML requires a rigid system of identifying what elements mean. To do this it uses a Document Type Definition (DTD). XML does not require a DTD in order for data to be exchanged. Data can be "discovered" if it does not have an accompanying DTD. XML data that does not follow a DTD is called "well-formed," while XML that adheres to a DTD is "valid." A DTD defines the rules by which tags are defined and interpreted. It also tells an XML parser which data elements are present in the document and what the relationship is between elements.
DTDs also ensure that any incoming data is correct and complete by encapsulating business rules into the requirements document. The most cited example given by vendors is that of a purchase order. The DTD can describe unit pricing, stock numbers, name, address, shipping information and more. When XML-based data is submitted, the DTD can validate the incoming information to ensure completeness.
Unlike most standards that require vendor support before being useful for development, XML can be used today despite a lack of tool support. Because it is text-based, developers can now start encapsulating documents and other data streams with XML tags. All that is needed is a simple XML parser that can be hand-written by a good code warrior or downloaded from the Web. IBM offers a free Java-based version of its XML parser on its AlphaWorks site (www.alphaworks.ibm.com).
When used internally to pass information between departmental applications, all that is needed is an informal agreement on what individual tags mean so the data is properly processed at each end of the transaction. Given the fact that departmental applications are usually written by a handful of people, deciding on tag meanings is not a major hurdle. According to Microsoft's Sklepowich, a good example of well-formed XML is data stored in a database. The database provides structure; therefore a DTD is not needed.
External use will most likely require some formal agreement on tag definitions -- a DTD. Anyone can create a DTD, but getting others to agree is not always easy. "There is nothing impeding the use of XML today," said Gartner's Knox. "The biggest thing is that people need to reach agreements on what the models will be."
If we return to the purchase order example, an extranet application that collects purchase orders from suppliers and customers via the Internet would likely require a DTD. This is because each party involved may use slightly different data structures when building an order. Validated XML can bridge this gap.
Many vertical sectors are in the process of developing industry-standard DTDs. For instance, Electronic Data Interchange (EDI) vendors are working toward the creation of a set of universal tags to be used for data interchange between applications. Other verticals include health-care, document publishing, financial services and middleware vendors.
In addition, a group of diverse vendors have formed the XML Active Content Technologies (X-Act) Council, which promotes the use of active content. X-Act (www.x-act.org) is focusing its attentions on applying XML technology to real-world applications.
Because XML can validate data using more than one DTD, there may be some duplicate use of tag names leading to parser confusion. For instance, if three different DTDs use a <DATE> tag, are they referring to a fruit, a night on the town or an ISO-standard calendar day? The Namespaces specification, currently a W3C working draft, will solve the problem of multiple schemas. Namespaces allow a prefix to be attached to each tag that directs the parser to the correct schema. "[Namespaces is] a simple concept, but an important technology," said Microsoft's Sklepowich. "The syntax used to describe different schemas is really simple stuff, but very powerful."
The W3C also has a number of other specifications they are working with that support and further enhance XML's capabilities:
What kind of animal is it?
Allaire Corp.'s Zelnick said that explaining XML is similar to describing what a particular animal looks like to someone who has never seen one. "Take someone who has never left the city, and try and explain gazelles to him or her," he said. "But sit down and show them a picture and they say, 'Oh that's all it is.'"
Milwaukee-based Siemens Power Corp. recently completed rolling out an SGML-based document management solution that keeps track of manuals for thousands of ongoing projects at the power plant manufacturer. According to Sean Sullivan, multimedia manager at Siemens Power, the SGML development and rollout took almost three years, with 10 to 25 developers working on the project at any one time. While XML represents a cheaper and quicker solution, the option was not available back in 1995 when Siemens undertook its SGML initiative, Sullivan said.
Sullivan and company are now inves-tigating XML as a means of extending Siemens' SGML capabilities to suppliers, vendors and other divisions within the parent organization. "We're looking at how we can open [our SGML solution] up to other companies that have not implemented SGML, which is quite a few," Sullivan said. "We can't go to other companies and say go back and implement SGML over a number of years, and at a cost of millions, just so we can exchange documents more easily with them."
The combination of XML and SGML allows Siemens to "radically transform" information and to integrate documents with existing back-end data, according to Sullivan. This implementation also allows Siemens to more easily validate information coming in from external sources.
"Currently we use inline components for displaying part numbers," Sullivan said. "When we move online, we want to retain the fact that this element is a part number. All of our corporate information is driven off of part numbers."
Siemens is using Interleaf, a content management system from Interleaf Inc., Waltham, Mass., to author its documents and manage its massive array of documentation. Sullivan said the firm plans to use Interleaf's upcoming XML product, code-named BladeRunner, for visually managing and producing XML data. The product is currently in development by Interleaf and Microstar, and is expected to ship in the fourth quarter.
Part of BladeRunner's benefit depends on the XSL specification being debated by the W3C. The product will use XSL to apply different styles to the data depending on the type of output desired.
"The Composer piece of BladeRunner lets users reuse and apply style to XML data," explained John Pavlov, vice president of engineering at Interleaf. "Users can have a different style for Web publishing than would be used in traditional printing -- you don't use headers and footers on a Web page. The same information is printed, but, based on the medium being outputted, the data is published differently."
Microstar's contribution helps users work with XML without knowing they are using the technology. Developers can create document templates in Microsoft Word that adhere to an XML Document Type Definition, then mail the template to external customers and suppliers. The customers, in turn, use the template as they would any Word document, unaware they are entering XML-compliant data.
This technology allows the data coming back into Interleaf and BladeRunner to be machine-validated. "Without DTD, you cannot machine validate," said Microstar's Jordan. "DTD is the magic trick."
Documentum Inc., Pleasanton, Calif., another company from the document management arena, is also planning to integrate XML into its product line. "Most people think it will be the basis for EDI and a means of accurate data exchange," said Matthew Shanahan, Documentum's director of product management and marketing. "Primarily, we see it as a tool for business-to-business exchange right now as opposed to business-to-consumer exchange."
Thompson Editorial Asset Management Solutions, or Teams, Rockville, Md., uses XML as an option for exporting data out of its high-end "content expression" server. "Coming out of Teams, products take different forms," explained Dipto Chakravarty, director, systems development for Teams. "For the online space, XML is most desirable, though HTML is most used."
Chakravarty believes XML is the weave needed to logically tie complex documents containing text, images and hyperlinks together into a single context. "This is really the Holy Grail we've been waiting for."
All over the map
Beyond publishing and document management, there are a number of other initiatives to implement XML into product lines in a wide variety of areas. Of the more interesting announcements of late, webMethods Inc., Fairfax, Va., has released its B2B Integration Server. B2B uses XML to encapsulate business services so customers, partners and suppliers may access them in a ubiquitous manner.
"We're implementing EDI for the rest of us," said Philip Merrick, president and CEO of webMethods. "XML lets us do the same kind of application, but it's better, faster, cheaper and much more flexible. Plus, it rides on top of the existing Web infrastructure."
Merrick added that implementers of his company's technology do not have to wait for their partners and other external users to convert to XML. B2B also has the ability to read in an existing HTML-based E-commerce site. "B2B leverages pure XML, as well as brings in existing E-commerce Web sites that partners have already implemented," Merrick said. "That's a win all the way."
Junglee Corp., a "virtual database company" located in Sunnyvale, Calif., has developed a system for mechanically integrating data from multiple sources, in most cases Web sites, and presenting the output in a single format. While the company began working on its initiative before the advent of XML, it has turned to the up-and-coming technology as a source for both collecting and integrating data.
On the input end, Junglee's technology can be used to collect comparative data from multiple Web sites, such as booksellers. For example, a person looking for a specific book can enter its title and Junglee will reference a multitude of sites, returning a comparative list of pricing, said Anand Rajaraman, Junglee's CTO. Junglee uses a DTD to deliver uniform data to the browsers, which formats and presents it to the user. Junglee can present a range of services beyond comparative shopping, including unified Web searches and job hunting.
"When Web sites begin using XML more, and there are more XML data sources available, it will make it easier for us to integrate those sources into our system," Rajaraman said. "There are hundreds of bookstores that are not part of our system because they only expose HTML interfaces, which makes a lot of work for us. If they begin using XML, they can be added easily."
SilkNet Software Inc., Manchester, N.H., uses XML in its Web-based self-service customer applications. According to company CTO Eric Carlson, every request directed into SilkNet's application server arrives via an XML request block sent over standard HTTP. Results are sent directly to the browser, bypassing HTML page generation using Dynamic HTML (DHTML) and scripting. For browsers that do not support DHTML, the XML is dumbed down into HTML, Carlson said.
"This allows us to provide logical services and lets the application server look up which components actually service this, which can change over time," Carlson explained. "We can also access several back-end stores and bring them into a unified view."
In the tool area, Allaire Corp. uses XML as the basis for its Visual Tool Markup Language (VTML) found in its HomeSite product. VTML allows developers to use new tags to create wizards by outlining their functionality in VTML. "We can extent out tool functionality by using XML," said the firm's Zelnick.
Data Director for Web, from Menlo Park, Calif.-based Informix Software Inc., enables developers to build applications that run using the Informix Web Integration Option. This option provides a set of tags that are used to construct template Web pages. At runtime, the Web Integration Option dynamically combines data from an Informix database with these templates to create Web pages. While the template editor does not provide explicit support for XML technology, company officials claim it can be used to build XML templates in addition to HTML.
The crystal ball
Even as it is hailed as the Holy Grail of Web technology, XML is still in its infancy. However, it is growing quickly. The general consensus among those interviewed for this article is that we will begin to see mainstream adoption of XML technology in the next 12 to 18 months as tools and user perception matures.
"The most important thing XML has done is revitalized interest [in universal data exchange,]" said Siemens' Sullivan. "I was doing SGML in 1987 with the military and it held the same promise, but no one came through. HTML took away from SGML until experience said this is not a good way to document management, a stronger solution is needed. XML is absolutely that solution."
The maturation process will begin with the browser. Today, only Microsoft's Internet Explorer supports XML, and only through the Data Source Object that binds data to HTML. Netscape has announced plans and floated trial versions of its XML-supporting browser on its Mozilla developer site. Gartner's Knox believes the public needs XML-native browsers (that interpret XML directly and do not reduce XML streams to HTML) for more mainstream adoption.
Actual use of the technology is not difficult. "Our customers are finding they can get up fairly quickly, and it is not long before they realize benefits," webMethods' Merrick said. "One does not have to commit to a whole lot more than they are already committed to."
Tools vendors such as Microsoft and Sybase Inc., Emeryville, Calif., are also ramping up support for XML with
their product lines. Microsoft's Sklepowich
According to Richard Pledereder, Sybase's technical director, his company sees XML use in its enterprise Web computing initiatives to help extend databases; in application-to-application data interchange, such as in its new Financial Server offering; and as a way of providing uniform information exchange between a data warehouse infrastructure and standard databases. "With XML, we will see an effect similar to Java on all three tiers of computing: client, middle and back end," Pledereder said. "Today, we are with XML where Java was three years ago -- talking about it, but no one is using it."
XML will have a broad impact on the way information is exchanged and processed via the Internet. The question is can vertical industries agree on a standard schema to describe their data? If so, XML should be everywhere, though not everyone will be aware they are using it, especially when it comes to the Web.
But they just might have an easier time searching for the truth.