[Archive copy mirrored from: http://www.csclub.uwaterloo.ca/u/relander/Academic/XML/xml_mw.html]
Written by Richard Lander
This report investigates the eXtensible Markup Language, which is a subset of the Standard Generalized Markup Language.
The Standard Generalized Markup Language and the HyperText Markup Language are discussed and then used to introduce the eXtensible Markup Language. The power, limitations and current usage of each are used as points of comparison and to stress the need for a new standard.
It is concluded that the eXtensible Markup Language will be the most flexible meta-language, will meet industry requirements if coupled with DSSSL and Java, and will become the basis of the World Wide Web.
It is recommended that information providers use XML in some manner, especially XML with Java to capture the power of and beyond markup languages. It is also recommended that information processing software vendors make their products XML savvy.
Information and computers have long been technological partners. Computers were needed to process information in a fast and effective manner. Software was written so that computers could process information in meaningful ways. The software industry needed standardized information formats and structures as information processing became problematic due to complex and diverse formats and structures. Generic markup languages were created as part of the solution to these problems.
Markup languages have evolved considerably since their inception. Even so, computers employing markup languages have not completely met industry requirements to process and deliver complex information. Markup languages need to be a good store of data, extensible, highly structurable, and be easily delivered over a network. No current markup language embodies all these features. Still, markup language use is extensive. Markup languages are used in both the public and private sectors, including the U.S. Military, Microsoft and Newbridge Networks.
XML (eXtensible Markup Language) has been touted by markup advocates as a new markup phenomenon. It possesses the extensibility and structuring abilities of SGML (Standard Generalized Markup Language) but lacks the complex optional features. XML is generic SGML for the World Wide Web.
This report is an analysis of current markup technology, and its ability to serve industry. SGML and HTML (HyperText Markup Language), current markup technology, are used to introduce XML: The New Markup Wave.
This report was written for an markup language savvy audience. The terms and concepts used within this report are of a high level, but necessary, in order to fully explain XML to this audience.
Standard Generalized Markup Language, commonly known as SGML, is actually a meta-language. A meta-language is a set of generalized rules used to write other, more specific languages, called markup languages. HTML, the defacto markup language commonly used on the World Wide Web, is a markup language created by using the SGML meta-language. It is used primarily for textual interchange, but can call upon other data types. SGML is an excellent store of data and is completely extensible but is hard to accommodate for network delivery.
SGML markup languages, such as HTML (HyperText Markup Language) and TIM (Telecommunications Interchange Markup), are used in a variety of areas and ways. TIM is used for information storage, is highly structured, and is the telecommunications industry specific markup language whereas HTML is used for information display, has very little structure, and is for mass use. Many of the world's biggest technology companies use SGML is some way. Most of the world's technological industries have established a standard markup language to promote industry cooperation and growth.
Newbridge Networks uses SGML markup languages to store and display electronic documentation. Microsoft uses an SGML markup language to store and display information on its Encarta Encyclopedia CD. Intel, National Semiconductor, Philips, Texas Instruments and Hitachi, all semi-conductor manufacturers, formed the Pinnacles group to design an industry-specific SGML markup language (Bosak). Companies that use SGML, especially industry-specific markup languages, will not be confronted with information compatibility when trading information with partners.
SGML's power is its flexibility. It is an open standard (ISO 8879-1986) that can be applied to almost any information structure. SGML can be as simple as HTML, or as complex as TIM. It is extensible, can be used with any degree of structure, is well established, and is an excellent store of data. SGML has many complex options that although esoteric in nature, can solve some of the most complex information processing problems.
SGML is non-proprietary and is system- and platform-independent. SGML documents created on obsolete computer systems can be transferred to other systems, preventing information loss. The same flexibility can be used between otherwise incompatible computer platforms and applications.
Information re-usability, yet another of SGML's strengths, promotes efficiency and economy in document creation through the re-use of content modules. Content that exists within a document can be used more than once or used in another document without having to reproduce it.
Software cannot be easily programmed to accommodate SGML. This problem is mainly due to the complexity of SGML's options, which are only used by a small percentage of its users. If the number of options used in high-level SGML markup languages were significantly decreased, or even dropped to zero, writing SGML-compliant software would become much easier.
SGML markup languages can not be delivered over the World Wide Web easily because of software constraints, but also because SGML instances are not portable. Viewing a high-level markup language instance requires a DTD (document-type-definition), a stylesheet and a catalogue file. A DTD contains the relationships between all the structures within an SGML markup language, such as chapters and tables. A stylesheet contains the formatting of those structures. If a high-level markup language instance loses its DTD, stylesheet or catalogue files, the instance can only be viewed as raw SGML and not in a formatted, legible form. The SGML instance must be valid when parsed against its DTD. If it is not valid, then it will not be displayed.
Industry support has always been a problem for SGML. Microsoft and Netscape control the markup language display industry and tend to promote HTML. Writing software for high-level SGML markup languages is difficult for the same reasons that inhibit network delivery. Good SGML editors are available, but SGML browsers are not as prevalent. SoftQuad's Panorama is an easy-to-use SGML browser, although given the industry clout of Netscape, it is not known or understood by the wider web of users.
While HyperText Markup Language is the best known of SGML markup languages, it is considerably different from other prominent languages. It has very little structure, and is not extensible. Many people who are familiar with HTML are unaware of SGML. The two are often thought of as separate entities, and in some ways are so. For the purpose of this report, HTML is thought of as a valid SGML markup language, although low-level and somewhat separate from other SGML markup languages.
HTML's strengths are page layout and network delivery. Web browsers display HTML instances in accordance to stylesheets that are hard-coded into them. Since Netscape Navigator and Microsoft Internet Explorer understand HTML, the processing and display of each page is fast and generally seamless. Although tables are used for page layout instead of another mechanism designed specifically for layout, HTML is a veteran at displaying content, text, graphics and other information types. Given that most Web authors are only concerned with pages as opposed to any higher-level structures like chapters, or entire documents, HTML's page layout strength has been a large part of what has led to its wide-spread use.
HTML's allure is its portability and ease of use. Since HTML is a small, fixed and simple element set, all instances are capable of using the same technology. HTML instances do not have to be valid. They need only be well-formed. HTML instances are not parsed against a DTD but are displayed in adherence to a stylesheet hard-coded into a browser. An HTML instance can be easily downloaded from a Web site anywhere in the world, viewed, understood and even applied to other instances. The huge number of personal home pages on the WWW is evidence of that. My home page is part of that phenomenon at: http://www.csclub.uwaterloo.ca/u/relander/.
HTML may be suited to network delivery and page layout, but it lacks any higher level capability. It does not understand the concept of documents, only pages. It completely lacks any structuring ability, which makes it a very poor store of information. HTML instances can only be displayed and cannot be effectively searched, re-used or validated. In most industry applications, technical information is stored in SGML, and then converted to HTML solely for the purpose of display.
The eXtensible Markup Language is a subset of SGML. XML is not a markup language, as HTML is, but a meta-language that is capable of containing markup languages in the same way as SGML. The XML specification is 26 pages long whereas the SGML specification is 500. XML only uses the parts of SGML that are used by the majority of SGML users. The pieces of the SGML specification that were used were rewritten, making XML a better-defined meta-language. This selection makes XML flexible and easily deliverable over a network.
The parts of SGML that XML uses are structure, extensibility, validation, information reusability and modularization of information. XML is non-proprietary and system- and platform-independent, making it an excellent store of information.
XML was created as a response to the limitations of current markup languages. It was designed to embody the power and flexibility of SGML and the network delivery of HTML. XML is more than a meta-language though. Its syntax is a subset of SGML, its hyperlinking a subset of HyTime, and stylesheets a subset of DSSSL (Prescod). An SGML working Group developed XML with several design goals:
1. XML shall be straightfowardly usable over the Internet.
2. XML shall support a wide variety of applications.
3. XML shall be compatible with SGML.
4. It should be easy to write programs which process XML documents.
5. The number of optional features in XML is to be kept to the absolute minimum, ideally zero.
6. XML documents should be human-legible and reasonably clear.
7. The XML design should be prepared quickly.
8. The design of XML shall be formal and concise.
9. XML documents shall be easy to create.
10. Terseness in XML markup is of minimal importance.
The main difference between XML and SGML is not in information storage, where they are very similar, but in network delivery. SGML markup languages cannot be practically delivered over the WWW because SGML instances must be accompanied by a DTD and a stylesheet and need to be valid. XML documents will be much easier to deliver because they will only need to be accompanied by a stylesheet and will not need to be valid. XML instances will only need to be well-formed, as opposed to valid, to be interpreted by a browser. XML stylesheets will have to be downloaded by XML savvy browsers and analyzed before displaying the instance. XML DTDs and stylesheets will be readily available for use and modification for users that are unable or unwilling to create their own. Information providers may require that XML browsers parse specific, possibly highly-structured instances against a DTD before being displayed.
XML will allow for structured Web delivery. XML Web sites will employ fast, complex searches, better navigation, and dynamic table of contents. With XML, an entire Web site will be able to be downloaded, and then displayed in any way, off-line. Once the desired information is downloaded, all information processing will be done client-side and will be much faster. Web sites will become more efficient because of the shift from server-dependence to client-independence.
HTML has needed Java implementations to deal with its information processing and display inadequacies. XML will use Java for information processing and display implementations that go beyond the scope of markup languages. XML will use Java to make the unimaginable happen, not just the already expected; however, it may have to be used to attain the SGML options power that XML lacks. Jon Bosak, the chair of the World Wide Web Consortium editorial review board, believes that Java will be used extensively and in new ways once XML is adopted by industry and writes, "XML gives Java something to do" (Bosak).
SGML only allows for unidirectional linking (conventional hyperlinking) within logical documents. HTML breaks the convention and allows for unidirectional links outside documents as well. XML employs several types of linking that are not used by either SGML or HTML. Jon Bosak and many other markup advocates have been disconcerted by the popularity of a markup language as simple as HTML. They see XML as a major improvement to the information systems used on the World Wide Web.
This is a far cry from the systems that were built and proven in the 1970s and 1980s. In a true hypertext system of the kind envisioned for the XML effort, there will be standardized syntax for all of the classic hypertext linking mechanisms:
- Location-independent naming
- Bidirectional links
- Links that can be specified and managed outside of documents to which they apply
- N-ary hyperlinks (eg., rings, multiple windows)
- Aggregate links (multiple sources)
- Transclusion (the link target document appears to be part of the link source document)
- Attributes on links (link types)
XML lacks many of SGML's complex features that have become indispensable to some information providers. The purpose of XML is to avoid those features to attain network delivery; however, for providers that require the full power of SGML, XML may not be a practical alternative. Some providers may still need to stay with a two-tiered process; information will still be stored in a high-level SGML markup languages, but instead of converting to HTML for display, information will be converted to a high-level XML markup language. Although the ability to use a high-level markup for network delivery with XML will be a significant advancement from the simplicity of HTML, losing the complexity of SGML will still be a loss to information providers.
Web sites created with XML markup languages will further test current networking hardware. An XML instance will be the same size as its HTML counterpart. The difference is in XML use. Since XML is capable of much more than HTML, XML will be able to accomplish much more, and as such, be accompanied by much larger files. Given the greater capabilities of XML, it will also draw more heavily on multimedia components, such as specialized font sets, graphics, video, sounds, and third-party plug-ins. In many instances, the bandwidth required to view Web documents will be much greater than it is now because information providers will be able to take advantage of XML. Information providers will be able to create Web sites that will be more like television and highly customizable. Networking hardware and software will have to improve to support the Web that XML will allow.
Dynamically changing the representation of information is one of many applications of the Java-XML duo. Information products will become completely customizable and dynamic. Users will be able to choose languages, levels of comprehension, graphical user interfaces, and even content.
WWW users are currently familiar with the pull technology paradigm. Pull requires a user to click on a hyperlink to retrieve desired information. Push technology, the next big Web and media phenomenon, is the opposite. Information providers will begin to be able to deliver information to users. Users will configure their browsers to receive information from push information providers at user-defined intervals. Push technology exists now, PointCast being the most popular example. XML will take push technology from its present state and make it completely customizable. Users will receive only the information that they want. Push-enabled browsers will soon be able to customized to, for example, download only Wired Magazine articles about developments in 128-bit cryptography or retrieve ski conditions of the 15 biggest ski resorts in North America in the Winter and European gardening tips in the Summer.
The World Wide Web is sometimes called the World Wide Wait, partially because of bandwidth-heavy Web broadcasts that occur every day. Unlike a television station, a server has to send out a separate broadcast to every recipient. When files are big, as they are in video broadcasts, huge amounts of bandwidth becomes unavailable to other Web traffic. If single copies of a broadcast were sent to Toronto, Vancouver, Chicago, Los Angeles, Paris, Auckland, and London addresses before being split into, for example, 500 copies of a 2.5Mb file to distributed within those cities, large amounts of precious bandwidth would be saved, which would result in faster broadcasts. An XML markup language instance could be used to store the e-mail or IP addresses of those recipients. It would provide the addresses when the broadcast split. XML may become a mechanism that makes the more efficient.
Much of Web document formatting is hard-coded into Web browsers and cannot be changed. Netscape Navigator always formats the HTML <H1> element in the same way. With stylesheets, the <H1> element could be made to be a certain size, font and colour in some documents and have different values in others.
Much of the rationale behind markup languages is separating content and format. The use of stylesheets such as CSS (Cascading Style Sheets) and DSSSL (Document Style Semantics and Specification Languages) will make the divide between content and format even greater. DSSSL will take document formatting to a new level.
Jade, a DSSSL engine, will make SGML and XML documents more portable. Jade, accompanied by a DSSSL stylesheet, can convert an SGML or XML instance into an RTF (Rich Text Format) file. Many word processors, including Word, can read RTF, making the huge world-wide collection of SGML documents much more accessible than before.
Given the power and flexibility of XML, especially compared to SGML, vendors will most probably adopt XML. Many high-technology companies and universities were involved in the creation of XML. The involvement of these leading companies and institutions almost guarantees the acceptance of XML.
The XML working group includes members from SoftQuad, Adobe, IBM, HP, Microsoft, Lockheed Martin, NCSA, Novell, Sun, Boston University, Oxford University, and the Universities of Illinois and Waterloo. (Maloney)
Vendor support for SGML has been good but has not taken it to the level that HTML has experienced. SGML is not a profitable standard to support because of its complexity. Writing software for all SGML options is difficult. XML lacks those options, making software development much easier. XML's structure, extensibility, and validation abilities, but lack of options will make for powerful yet relatively easy-to-program tools.
XML versions of many SGML markup languages will be written, to gain Web delivery.
XML coupled with DSSSL and Java will meet industry requirements.
XML coupled with DSSSL and Java will revolutionize the Internet and specifically the Web.
XML may be used as an on-ramp to SGML.
XML will become the basis of new digital media phenomenons, such as push technology.
Information providers who currently store their product in SGML but convert to HTML for Web delivery should rely on XML for both.
Information providers who currently take advantage of SGML options should either use XML with Java to replace that power or should employ XML markup languages for the purpose of Web delivery.
Information providers who do not want to be confused by the complexity of SGML should use XML as an on-ramp before harnessing the full power of SGML.
Information processing software vendors should make their products XML-aware so that information providers will be able to write in XML markup languages, to take advantage of XML: the new markup wave, as it crests.
Alschuler, Liora. "Netscape, On Second Thought, Warms to XML Spec." WebWeek. 8 April 1997.
Bosak, Jon. "XML, Java, and the future of the Web." Sun Microsystems: 10 March 1997. http://sunsite.unc.edu/pub/sun-info/standards/xml/why/xmlapps.html (April 1).
Maloney, Murray. "XML Press Release." San Diego, California: SoftQuad, 11 March 1997.
Prescod, Paul. E-mail interview. 18 April 1997.
World Wide Web Consortium (W3C). "Extensible Markup Language (XML): Part I. Syntax W3C Working Draft 31-Mar-97" W3C: 31 March. 1997. http://www.w3.org/pub/WWW/TR/WD-xml-lang.html (April 1).
SoftQuad and Other Web Technology Leaders Microsoft, Sun, NCSA and Dow Jones Interactive Publishing Hail XML as the Basis of a New Generation of Web-based Knowledge Publishing Applications.
Tuesday, March 11, 1997 -- San Diego, California. Today, at the Graphic Communications Association's XML conference, SoftQuad, a leading provider of content publishing tools for corporate intranets and the Internet, announced its support of a broad industry initiative to entrench eXtensible Markup Language (XML) in Web-based Knowledge Publishing tools.
XML, developed by online publishing experts from industry and academia under the auspices of a World Wide Web Consortium (W3C) working group, promises to become the next significant enabling technology for the Web. XML will provide Web publishers, information consumers, and knowledge workers with unprecedented power, flexibility and control over the creation of and access to internet and intranet content. The XML working group includes members from SoftQuad, Adobe, IBM, HP, Microsoft, Lockheed Martin, NCSA, Novell, Sun, Boston University, Oxford University, and the Universities of Illinois and Waterloo.
eXtensible Markup Language Will Allow For New Features Inside Smarter Client Applications
XML provides a standard way of adding custom markup to information-rich documents, so that complex documents can be rendered and published in any way. Today, HTML provides a very limited way of representing and publishing structured information. While HTML is easy to use, its inherent simplicity places serious constraints on the degree to which publishers and users can utilize business-critical documents and databases. XML provides the means to publish and receive any information, regardless of format or origin, in any way that they wish.
"With XML and web-based Knowledge Publishing tools, Web users will have even greater power to create, manage and access dynamic, personalized and customized content on the Web and on corporate intranets," said Murray Maloney, SoftQuad Technical Director and a member of the W3C working group responsible for the XML specification. "And with XML, end users will be able to dynamically manipulate documents in the browser, creating a whole new paradigm for presenting highly targeted and dynamic web-based information."
XML makes this possible by allowing information providers to encode data using a markup language that is designed to meet their business requirements. In effect, businesses will be able to distribute structured databases that consumers can manipulate at will. Dynamic processing on a Web browser will dramatically reduce the network delays that the current breed of server solutions have caused.
"XML is poised to solve the problem of integrating structured information, such as that in SQL databases, into the fabric of Web pages," said John Ludwig, VP of Internet Client & Collaboration Division, Microsoft. "Knowledge consumers will now have immediate, personalized access to information that customizes itself on the fly to suit their specific needs."
For example, a single mouse click might transform overall sales figures from a database into an annual report, or into a pie chart showing regional breakdown or a bar graph showing a breakdown of products sold. XML, together with dynamic HTML, CSS, Java and other Web technologies, make almost anything possible.
Industry Leaders SoftQuad, Microsoft, NCSA, Sun, and Dow Jones Interactive Publishing Share Vision of Dynamic Content for the Web.
"We're very excited about this new Web initiative," said David Gurney, Chief Executive Officer, SoftQuad International. "By implementing XML technology in SoftQuad's Knowledge Publishing tools, we broaden our ability to offer customers strong, integrated publishing solutions that can tap the corporate knowledge base in ways that no other solutions can, making corporate intranets richer, more active sources of information."
"As a leader in the effort to bring the advantages of structured data to Web publishing, Sun applauds SoftQuad's commitment to XML-based product development," said Jon Bosak, SunSoft's Online Information Technology Architect and Chairman of the W3C working group responsible for the XML specification. "Standardized extensible markup makes possible a new generation of Web applications that use richly structured data to exchange information and drive Java programs. SoftQuad's initiative represents the first wave of what we expect to become a large and important body of high-level applications based on the XML model of data representation."
"Microsoft is excited by SoftQuad's initiative in promoting the development of eXtensible Markup Language to enable dynamic content on the Web," said John Ludwig, VP of Internet Client & Collaboration Division, Microsoft. "XML is a powerful means to provide data awareness, giving users the ability to manipulate and input data efficiently, with minimal load on the server. The result will be faster, richer and more interactive information on the Web."
"As developers of the NCSA Mosaic(tm) web browser, and long-time supporters of research on the World Wide Web, NCSA is very excited that SoftQuad has committed to providing support for the eXtensible Markup Language," said Larry Smarr, Director. "XML is a simple, truly extensible method for creating arbitrary structured markup, yielding an enormous potential for businesses and institutions to supply more sophisticated and usable structured content on the World Wide Web."
"At The Wall Street Journal Interactive Edition, we've had to deal with a host of proprietary extensions to HTML in order to make use of new features in Web browsing and casting software," said Alan Karben, Manager, Multimedia, The Wall Street Journal Interactive Edition. "With the advent of XML I hope will come an acknowledgment in the industry that one doesn't have to shoehorn new tags into a quasi-HTML framework to create useful and novel Web client applications. What XML stands for, eXtensible Markup Language, can actually mean more for Internet users than new ways to make information look pretty. Maintaining the intelligence of documents all the way to the point of delivery allows for new features inside smarter client applications. And that's exciting news for Web publishers and surfers alike."
SoftQuad is the leading provider of multi-platform, standards-based Knowledge Publishing applications that enhance business processes. SoftQuad is recognized worldwide for its pioneering work in structured document publishing and, through its newly-acquired Alpha Software unit, structured databases. SoftQuad is a founding member and active participant in the World Wide Web Consortium, the Internet Engineering Task Force and Editorial Review Boards. Based in Toronto, Canada, SoftQuad International employs more than 200 people with additional sales offices across North America, and European operations based in London, with offices in Paris and Munich.
Electronic access to SoftQuad's press materials can be accessed at http://www.softquad.com/press/ and through Newsdesk International at http://www.newsdesk.com or 1-800-636-6092.
Forward-looking statements in this release are made pursuant to the safe harbor provisions of the Private Securities Litigation Reform Act of 1995. Investors are cautioned that all forward-looking statements involve risk and uncertainties, including without limitation risks of intellectual property rights and litigation, risks in technology development and commercialization, risks in product development and market acceptance of and demand for the Company's products, risks of downturns in economic conditions generally, and in the software application development tools and business intelligence tools markets specifically, risks associated with competition and competitive pricing pressures, risks associated with foreign sales and high customer concentration and other risks detailed in the Company's filings with the Securities and Exchange Commission.
Murray Maloney Technical Director SoftQuad Inc.
By Liora Alschuler
Saying that so far Netscape sees no technical barriers to implementation of a draft standard for the eXtensible Markup Language (XML), the browser giant's program manager confirmed that it is now examining the draft standard with interest.
This is a complete reversal of the company's position two weeks ago, which was essentially "Not now, not ever." ["XML Could Sidestep HTML Split," March 24, page 28.] The XML standard is being advanced by the World Wide Web Consortium (W3C).
Carl Cargill, Netscape's standards program manager, said the decision to implement the new data type and to what degree it might be implemented will be based on the company's view of XML's value to its installed base of customers. No time frame was given for any degree of implementation.
With a tip of the hat to the W3C Editorial Review Board, which he characterized as having fought a "quiet little war," Cargill said a key to the changeover is the realization that XML can live alongside HTML and is not intended as a replacement.
A new draft XML specification, which now includes XML-Link, will be presented at this week's WWW6 Conference by Jon Bosak of Sun Microsystems, chair of the Editorial Review Board.
The board expects to complete the specification, including a standard for application of styles, within the next six months. Microsoft, which is active on the review board, has gone further than Netscape in its public support of the initiative, although neither browser vendor has indicated precisely how or when it will field XML products. A first crop of XML products will be shown at WWW6, including parsers, editors, browsers, style engines, converters, processors, and servers from established Web companies, SGML vendors, research facilities, and individuals.
An Internet publishing specialist at a large high-tech firm, on hearing that Netscape is taking a serious look at the new standard, said, "We are heaving a sigh of relief. It's obvious that XML is going to be used for the large-scale integration of data into the Web.
We were looking at a situation where Microsoft was going to walk away with it. As Netscape partners, that would have put us in a very uncomfortable position." Microsoft is basing two emerging data types--its Channel Definition Format for push services and the Open Financial Exchange for banking and brokerage services--on XML.