[This local archive copy mirrored from the canonical site: http://www.zilker.net/business/pci/sun/dec/documents.html (December 1997); links may not have complete integrity, so use the canonical document at this URL if possible.]

Solving the problem of publishing online documents

By Stuart Culshaw

This article attempts to clarify the relationships among SGML, HTML and XML, and focuses on the advantages of XML as the future of online document publishing. If you already publish online documents - or are planning to but are having problems putting all the pieces together - this article is for you.

SGML (Standard Generalized Markup Language) is based on the idea that every document has a structure, however simple or complex, and that this structure is independent of the content of the document. SGML is not just another file format (such as Word, WordPerfect, or PDF), nor is it a page description language (such as troff, runoff, or TeX). It can best be described as a "meta-language," a language for describing other languages. SGML defines a standardized method and syntax for describing the structure of documents.

The rules that describe the structural elements for a particular class of documents are defined in a document type definition (DTD). SGML authoring tools use the DTD to ensure that the rules the document designer imposed to ensure that documents remain consistent are strictly observed. SGML-based authoring systems thus ensure that all documents comply with a predefined model for the type of document in question, whether it be a memo, a sales contract, a maintenance manual or an illustrated parts catalog.

Furthermore, information on a network that connects many different types of computers has to be usable on all of them. Corporations cannot afford to be restricted to one make, model, or manufacturer, or to cede control of their data format to private hands. SGML-based document systems meet all these requirements and can greatly improve the flow of information among the various authors, reviewers, and users of these documents through their ability to create, edit, exchange and manage all types of documents using the same set of tools, formats and protocols.

So if SGML is so great, why aren't we all using it? Well, actually, more people are using it than you might think. In fact, if you've ever created a document for the Web, you've probably used SGML yourself without even realizing it because HTML (the format used for Web documents) is an application of SGML. The SGML standard has been around for more than 10 years, but during most of this time SGML-based systems have remained the preserve of large corporations. They can afford the high initial investment required to implement an SGML-based documentation process or have been forced to make the investment to comply with government or industry policy.

However, as SGML authoring and document management tools have improved, and their advantages have become clear, SGML has become the format of choice for large, mission-critical applications in many sectors of industry and commerce. Automobile manufacturers are adopting SGML for maintenance documentation, insurance companies for policy proposals, and computer companies for software documentation.


HTML for documents

HTML (Hyper Text Markup Language) is just one of the many thousands of different document types that have been created using SGML. It was designed specifically to enable documents to be published on the World Wide Web. It defines a fixed set of document elements with markup that lets you describe simple (i.e., not highly structured) documents containing headings, paragraphs, lists, illustrations, and so on. One of the principal advantages of HTML as a document type for the Web is its built-in support for hypertext and multimedia, enabling the construction of easy and intuitive user interfaces for accessing published information. This user interface is implemented in browsers that are now the universal application on desktops.

The second major contribution of HTML, in light of the rapidly increasing adoption of intranets for producing and managing corporate information and documentation, has been to educate a very large audience as to the main advantages of distributed information systems. These advantages include the ability to exchange information between different computer systems and applications through the use of standard formats and protocols, and the power of hypertext to organize a set of documents to be searched, accessed and consulted interactively.

However, HTML seems to have reached the limit of its usefulness as a way of describing information. Because HTML has a limited set of tags and very loose document structure rules, it offers only one way of describing documents. HTML is based on the idea that by making the document structure rules as generic as possible, it should be possible to describe any kind of document using a single set of markup elements. The inherent simplicity of HTML, while it has obviously been important to the success of this format, is a severe limitation to the representation of professional documents.

Furthermore, the rapid (and often hectic) development of HTML, from the initial 1.0 Version to the current 3.2 Version (HTML 4.0 is on its way), has led to its becoming overburdened with dozens of interesting but often incompatible inventions from different manufacturers. As an increasing number of companies deploy intranets within their organizations and look to the Web for their mass publishing requirements, it has become evident that many new applications require a more robust and flexible infrastructure HTML can offer.

Conversely, the inherent complexity of SGML limits its adoption in full-scale applications by a large number of non-expert users on the Internet or on corporate intranets. What is required is a way of combining the richness of SGML with the simplicity of HTML for publishing and accessing documents online. Enter XML.


XML: The missing piece

XML (Extensible Markup Language) aims to bridge the gap between SGML and HTML. Essentially a simplified and modernized remake of SGML that removes many SGML's more complex and less-used features, XML is based on a flexible model that will make writing programs to handle XML much easier than writing programs for full SGML. XML will also make it easier for authors to produce documents for many different output media (such as paper, online help, or the World Wide Web) from a single source.

The term "extensible" refers to the fact that XML is not a fixed format like HTML. Instead, like SGML, XML is a "meta-language" that can be used to define an infinite number of customized markup languages. The XML standard provides all the advantages of the powerful content-based markup and scalability of SGML while enabling business-critical documents to be published as easily as is currently possible with HTML. (Readers with programming experience may find it useful to think of XML as SGML- rather than HTML++.)

XML is particularly suitable for Web applications because of its ability to handle documents without the need to know the structure model that was used to create them. With SGML, a document must be distributed in its complete form, and it must fully comply with the DTD that was used to create it. It is impossible to read or process an SGML document without knowing the DTD. This is not a constraint in industrial projects where SGML documents are created, edited and managed through well-defined procedures on secure intranets. But this characteristic of SGML prevents it from being freely used to publish documents on the Internet at large, where users on the client side have no way of knowing which DTD was used to create the document.

By enabling documents to be published without the precise description of markup used to create them, XML removes the main obstacle to the direct exchange, editing, publishing and accessing of SGML documents on the Web and on corporate intranets. The only requirement is that documents be "well formed"; that is, document elements must be nested to enable the creation of a tree structure so the browser can parse the file correctly (so that it can apply a style sheet, enable linking, etc.).

Document types can be explicitly tailored to an audience, so the cumbersome fudging that has to take place with HTML to achieve the desired layout should become a thing of the past: Authors and designers will be free to invent their own markup elements. Unlike some current ways of using HTML, XML provides a standardized framework. For software to make sense of your documents, you can't just make it up as you go along and hope that any old tag will do; you need to follow a pattern or model.

Information will be more accessible and reusable because the more flexible markup of XML can be used by any XML software instead of being partly restricted to specific manufacturers as has become the case with HTML. XML will allow organizations to create their own customized markup languages for exchanging information in their domain (music, chemistry, electronics, finance, linguistics, history, engineering, etc.).

You might be asking yourself how the Web browser or document processing system can possibly know what everything means if you have defined it all yourself. One answer lies in the fact that XML enables you to specify the meaning or purpose of each document element in the markup. You can identify a part reference with a <PARTREF> tag or an example of program code using a <PROGCODE> tag, and you can combine explicit tag names with explicit attributes: <PROGCODE LANGUAGE = "Java"> or <PARTREF STOCKITEM = "Yes"> to add further meaning.

Another answer lies in the use of style sheets. One of the characteristics of the original HTML specification was that each markup element had an associated, implied format (H1 elements should be formatted as large headings, LI elements as list items, etc.).

This characteristic made it possible to publish HTML documents directly to the Web without waiting for the industry to agree on a common style sheet standard. It has proved much too restrictive for the needs of more sophisticated HTML documents, however, and it is an obviously impossible approach for XML, where there is no predefined set of document elements.

The Cascading Style Sheets (CSS) standard was developed by the W3C (World Wide Web Consortium) to address the problem of Web document formatting. Originally designed for HTML, CSS is fully compatible with XML and makes it possible to specify the appearance of custom markup elements using any of a wide range of possibilities. XML will also work with other style sheet standards designed for SGML.

Other advantages of XML include greatly improved hypertext linking capabilities and provision for multilingual document encoding through built-in support for the UNICODE standard. Both of these features offer significant advantages for large-scale hypertext document publishing to a worldwide audience.


A look ahead

While HTML will continue to play an important role for the content it currently represents, XML opens up a whole range of new possibilities for online document publishing. The XML standard is still in the draft stage and not due for official publication until this December, but all the major Web tool vendors - including Microsoft and Netscape - have already announced their support for XML in future versions of their software, and more uses for XML are being found every day.

Standards or protocols that are currently under discussion or development include: Microsoft's Channel Definition Format, an XML-based metadata specification for push publishing applications; MathML, an XML application for describing the structure and content of math expressions; and even an XML interface for accessing information in databases. For more information about any of these applications, read some of the online reference documents listed below.

If, like many people, you've never quite been able to get your brain around SGML but have been impressed by the possibilities for information publishing and retrieval offered by the World Wide Web, you should take a serious look at XML. It could be the final piece of the puzzle that you've been looking for.

Where to find out more


Stuart Culshaw is technical communication manager and Webmaster at Grif S.A., an SGML, HTML, and XML software development company. He can be contacted at Stuart.Culshaw@grif.fr or check the company's Web site at http://www.grif.com . This article is reprinted with permission of Intercom Magazine, a publication of the Society for Technical Communication based in Arlington, VA.

Back to table of contents