Using SGML on the Web

Hans C. Arents, Hypermedia project coordinator
K.U.Leuven, dept. MTM

W. de Croylaan 2

B-3001

Leuven

Belgium

Hans.Arents@mtm.kuleuven.ac.be

Abstract

The amazing success of the World-Wide Web (the Web for short) as a hypermedia electronic document delivery system on top of the Internet has had a profound effect on the visibility of SGML (Standard Generalized Markup Language). Based on the use of HTML (HyperText Markup Language), the Web has become the world's largest and most successful SGML application. However, opinion remains strongly divided on whether we have to start using full-blown SGML to put electronic documents on the Web, or whether we can stick to using simple HTML. In this article I will argue that the conflict between SGML and HTML is unnecessary, since both have an important role to fulfil on the Web. At present, the most appropriate use of SGML on the Web appears to be as a "back-end" content markup language, while HTML appears to be best suited as the "front-end" presentation markup language. Only for those applications that need special functionalities not yet supported by HTML (such as intelligent search or user-specific presentation) does it make sense to use full-blown SGML on the Web.

Introduction

In the past two years, the HyperText Markup Language (HTML) has done more to popularize the notion of Standard Generalized Markup Language than any single preceding use of SGML. As a result, SGML has now turned from a little-known document interchange standard into a notion familiar to almost anyone developing electronic document solutions. Used on the Web through a graphical client such as Netscape Navigator or NCSA Mosaic, HTML documents and their associated image, sound, and digital video files allow companies and organisations to deliver sophisticated Internet publications and services. The success of the Web has resulted in a growing awareness of the possibilities of electronic document engineering, and has led to an impressive increase in demand for (SGML-based) electronic document software tools and solutions. However, this sudden interest in the Web and HTML has also been somewhat of a mixed blessing for the SGML world: increasingly, people are wondering if you still need to use SGML if you can simply use HTML and all of its the associated tools. In this article I will argue that one indeed needs to use SGML if one wants to overcome the present limitations of HTML. The best way to do this appears to be using SGML as a "back-end" content markup language, and HTML as the corresponding "front-end" presentation markup language for document delivery over the Internet.

The limitations of HTML

As the popularity of the Web increases, a growing number of companies and organizations are distributing their electronic documents over the Internet (or internally over TCP/IP-based local area networks, so-called "intranets") using HTML. Now that HTML version 3.0 (an enhanced version of HTML, with support for tables, maths, and style sheets) is being finalized, it appears that for the foreseeable future HTML will remain the dominant markup language on the Web. However, HTML does have a number of critical shortcomings, which cannot be rectified simply by adding a few new tags and a standardized style sheet mechanism.

Lack of support for document structure

HTML markup, as it is being used today and as it is defined in the HTML DTD, is flat. There is no concept of hierarchy or other document structure, and HTML tags are basically being applied in a linear manner. E.g. the choice of heading level in an HTML document (which implicitly refers to a certain nesting depth by way of a number: <H1>, <H2>, and so on) is guided primarily by layout concerns. In reality, complex documents are hierarchical documents, with a document structure which can be used to better support navigation (e.g. through collapsible document outlines, Fig. 1) and search (e.g. a search limited to particular document elements). Until HTML truly supports document hierarchy and Web browsers let the reader explore and use this hierarchy, HTML remains unsuitable for technical or lengthy documents, whereas SGML is a prime choice for this task.

Figure 1. Supporting better navigation through collapsible document outlines

Lack of emphasis on content markup

The HTML of today aspires to match early SGML DTDs which tried to capture presentation, more than content. Years of experience with real-life documents and their possible applications has taught the SGML community to focus more on better information modelling, by using content tagging rather than presentation tagging. Content tagging or descriptive markup is essential to capturing the meaning of the different components of a document. Having access to this "meaning" of the documents allows one to re-format and re-purpose the electronic documents for totally new uses. From the very beginning, HTML has had difficulties endorsing this notion of descriptive markup, and has combined presentation tags with content tags. What we see happening now is that slowly but relentlessly HTML is turning into a document presentation language, as more and more Web sites are focusing on looks rather than on contents. This has been accelerated by the fact that popular Web browser companies such as Netscape and Microsoft seem to spend more effort on introducing new presentation tags than on supporting new meaningful content tags. As a result, HTML will probably become less and less suitable for content-oriented markup, while descriptive markup still remains a key concept of SGML.

The lack of document validation

Most, if not all, Web browsers do not perform any kind of validation of a HTML document against the HTML DTD. This lack of validation is increasingly leading to chaos on the Web because people assume that if a document works under a particular browser, it is a well-formed document and will work under any browser, when in fact it may not. Most browsers take HTML documents as they get them, process those elements which they recognize and ignore the elements which they do not recognize. This means that browsers can unilaterally provide support for elements that may not be in the official DTD (Netscape's Navigator and Microsoft's Internet Explorer being only two examples). This undoes the very benefits of data interchange standardization by allowing Web browser creators, rather than Web document owners, to control the data standard. Having HTML conform to the official SGML standard means that Web information providers and users can use existing SGML software tools, in addition to those designed specifically for HTML. It also enables HTML documents to be readily integrated into existing SGML-based electronic document management systems.

The lack of robust linking functionality

HTML has a very weak linking mechanism: it lets you link to another document somewhere on the Internet, using a simple addressing mechanism (URLs or Universal Resource Locators). If you are lucky, the document you are referring to still exists, and has not moved (or disappeared altogether) the next time somebody wants to follow your link. There is no mechanism built-in into the Web's HTTP (HyperText Transfer Protocol) protocol that will notify you of the fact that some links in your documents are pointing to nowhere, or that somebody else's links are pointing to your own documents, so you should not change their location.As a result, the Web's linking functionality is very brittle and limited in scope. SGML on the other hand is complemented by the ISO standard HyTime (ISO 10744), which is itself an application of SGML. HyTime lets one express complex linking relationships and address any kind of document. Using techniques such as location ladders, that specify links as a sequence of steps, HyTime links will not break easily, even if the target SGML document is edited.

The lack of system independence

One of the main goals of SGML has always been to abstract away the file system and other system dependencies. The concept of entities is a key point in making documents interchangeable across computer platforms. SGML has a powerful apparatus for entity classification, notation handling, public identifiers, and description of system-dependent entities, that is still sorely lacking in HTML. On the contrary, HTML even builds in system dependencies in each and every link. E.g. the URL (Universal Resource Locator) referring to the address of a document on the Web is case-sensitive depending on the fact whether one is referring to a document on a UNIX platform or to a document on a non-UNIX platform.

Arguments in favour of HTML

In the face of these serious shortcomings, some have suggested doing away with HTML all together as the underlying markup language of the Web, or using it only for very simple documents. In their view, Web documents should be based on general SGML - that is, using an unlimited variety of DTDs at each information provider's discretion. Rather than try to standardize on a single version of HTML which tries to address all possible uses of on-line electronic documents, they propose to give information providers the full power of SGML and let them design their own uses. However, one can argue very convincingly that HTML should remain the primary data format that forms the backbone of the Web, for three important reasons:

HTML is the de facto Web markup language: millions of pages are already available in HTML format, and trying to introduce a new standard would be confusing, counter-productive, and certain to fail.
HTML is adequate for real-life use: most documents on the Web are relatively simple in contents, so forcing people to use full-blown SGML (i.e. requiring them to use application-specific DTDs) would be overkill.
HTML provides a common set of semantics: the use of HTML assures that similar functionalities (links, bookmarks, ...) are present and supported in most Web browsers. Without these common semantics, it would not be possible to easily navigate, search, format, or reuse Web documents in a standardized way.

A better approach therefore is to consider the Web as being an interface to SGML-based electronic documents, delivered over the Internet, and to use SGML as a "back-end" content markup language and HTML as the corresponding "front-end" presentation markup language. This corresponds best to the original ideas behind the design of HTML: it was not designed as a universal DTD for all possible types of electronic documents, it was designed as a simple and effective language to make available linkable texts on a variety of computer platforms.

Using SGML to overcome the shortcomings of HTML

The overall quality of successful sites on the Web can often be traced directly back to the use of SGML as the "back-end" content markup language. And since SGML is a general solution that can be applied to other document problems, using SGML for Web projects allows one to gain valuable experience that can be effectively applied to other electronic document management projects. The use of SGML on the Web is focused in four areas: document down-conversion, document parsing and validation, hyperlink generation and maintenance, and document maintenance, each capable of addressing some of the shortcomings of HTML which I outlined above.

Document down-conversion

When SGML is used as the primary document format, a process of `down-conversion' or 'down-translation' is required (Fig. 2). This means that the document to be published is held in a very richly marked-up form (using a company-specific DTD), rich enough to support conversion to HTML as well as to proprietary formats (such as for CD-ROM publishing) or as yet unknown formats (such as the Blackbird format for the Microsoft Network). Conversion filters can produce electronic documents for all these various formats, starting from the same SGML source format. This allows a company to protect its initial investments in the rapidly evolving world of on-line electronic publishing, where the dominant document format may sometimes change overnight.

Figure 2. Using SGML as the "back-end" language for different applications.

Note that two important issues need to be addressed when down-converting SGML documents to HTML documents: partitioning and transformation. Partitioning is the process of optimizing electronic document size for on-line delivery. Many Web browsers currently available work best for documents, or document fragments, which are only a few "pages" in length. However, many real-life SGML documents are much longer than this. One possible solution is to break SGML documents into "components" that are managed and retrieved as units. This allows users to retrieve only as large or small a component as they require. Transformation is the process of mapping arbitrary SGML elements to the appropriate HTML elements. The complexity of this process is directly related to the design of the original application's DTD. For example, if the DTD is designed to simply emulate the markup inserted by a word processing system, then the transformation to HTML can be very simple. However, if a more structural content-based markup is used, then the transformation may be very difficult, since the elements identify the semantics of the data rather than the presentation. It is important not to underestimate the complexities involved in generating HTML from SGML. Just as using page layout software does not guarantee good typography, using an SGML transformer like Omnimark or Balise to create HTML does not guarantee good hypertext. Good hypertext has to be carefully designed, and remains very hard to generate automatically.

Document parsing and validation

When SGML is used as the source format for the HTML documents, SGML-based editors can be used during the authoring process to validate the structure and the tagging of the documents. Only a few HTML editors (e.g. HoTMetal Pro from SoftQuad or Spider from InContext) support this at present. This validation ensures markup consistency and makes further processing of the electronic documents much easier, e.g. during down-conversion markup errors can be detected and automatically corrected. Using a translating parser also opens other additional possibilities. For example, additional document management tags (e.g. author information, creation date, etc.) can be automatically added, which are dropped before installation of the HTML documents on the Web server.

Hyperlink generation and maintenance

One of the most important benefits of using SGML when generating HTML documents is in assuring a higher degree of link reliability. It is clear that cross references within a document and to other documents must be given very close attention. They should be made as robust (and as plentiful) as possible, and they should be implemented in line with accepted Web practice (after all, it is hyperlinking that makes the Web what it is). SGML conversion programs can turn logical document names into full URL hyperlinks, and SGML filter programs can automatically generate hyperlinked tables of contents and other navigational constructs, or split large documents into manageable sized chunks. A primary advantage of this approach to link generation and maintenance is that the value of a URL is changed in only one place, i.e. the original SGML source. If a link needs to be changed, the documents that reference it can be reparsed, and this whole process can be automated so that the network of links between all the HTML documents remains up to date. Hyperlink maintenance might not seem a critical issue when there are only a few documents on a Web server, but it already becomes a big problem with just a few dozen documents. Manual hyperlink maintenance is both very time consuming and virtually impossible to do without errors. The result is inevitably user frustration caused by dangling hyperlinks.

Document maintenance

The biggest problem with most new Web servers is that because they are so easy to put up and feed HTML documents into, many are started as part-time adhoc projects. Initial system installation and the creation of HTML documents is so deceptively simple that any serious planning is often overlooked. However, very rapidly the amount of documents on the server grows so fast that by the time one realizes it needs to be managed, there is just too much information already stored to still go back and do it right. Using SGML from the very beginning forces you think beforehand about issues of scalability, document storage and document quality issues, which will save you a lot of time and effort later on. An HTML-only approach to managing documents also locks users into a specific version of HTML. Migration to future HTML versions is questionable because of the lack of markup consistency in documents created in this kind of adhoc environment. In short, the HTML-only approach is a short-term and short-sighted solution that limits the long-term value of the electronic documents you worked so hard to create and make available on-line. Following the SGML approach adds value to your document data, and prepares you for future uses of your HTML documents that you cannot yet foresee at present.

Using SGML directly on the Web

Clearly, a well-managed use of SGML as the "back-end" language for Web publications and services allows us to overcome some of the limitations of HTML. However, sometimes it is desirable to have the richest possible mark-up available, for example when one wants to process the electronic document further with SGML tools after delivery over the Internet. The most general way of doing this involves bypassing the normal Web browser and using an external viewer which allows users to view SGML documents directly. An example of such an external SGML browser is Panorama PRO from SoftQuad, which is bundled with Enhanced NCSA Mosaic and Spyglass Mosaic, but which can also be used as a helper application of Netscape Navigator. Panorama Pro provides broader presentation capabilities, more powerful context-sensitive searching (e.g. restricting the area of search by SGML element, see Fig. 3) and more enhanced hypertext linking than an ordinary Web browser.

Figure 3. Searching for a word in a level 1 header of an SGML-based Web document.

Panorama Pro supports stylesheets (which give you control over display attributes such as font, size, weight, color, spacing and automatic numbering), navigators (different interactive displays of a document's table of contents), personal link layers (annotations and bookmarks saved as HyTime-conforming SGML), tables and maths, etc. Another example of this trend towards full SGML support on the Web is EBT's newly released DynaText version 3.0, which includes support for HTML 2.0, as well as an abstract document interface capable of supporting CD-ROM, LAN/WAN and Web platforms. Both SoftQuad and EBT are established corporate electronic publishing solution providers who did not have a Web browser and publishing solution when the Web took off as a medium last year. In many cases, users had already started using other browser and Web publishing tools. By adding Web browser functionality to their SGML tools, they hope to convince corporate users of the superiority of SGML vs. HTML as the basis of an electronic document delivery tool. In the future, some SGML software developers might even try to integrate SGML support into the Web browser itself, rather than externalizing it into an external viewer. This would give the user more function and performance through a single unified interface. But a Web browser is unlikely to provide SGML support on a par with a stand-alone SGML system, just as it is unlikely to provide graphics facilities equal to those of specialized graphics programs.

Conclusions

Many companies and organizations are using SGML because it represents a format-neutral method for marking up their data. This is especially important in cases where the data may have an extremely long life and be subjected to a range of possible uses. One of the real strengths of SGML is that it can be transformed into the format needed at the time that it is used. HTML on the other hand represents a publicly available and widely accepted method for the delivery of documents over the Internet. It therefore makes sense to maintain an HTML-centric Web, where HTML and SGML coexist and are used purposefully and intelligently together, primarily by using SGML as a "back-end" content markup language and HTML as the corresponding "front-end" presentation markup language. In the short term HTML (version 2.0 or 3.0) will remain the primary data format for the Web, with specialized SGML applications using this HTML as their Web presentation format. In the long term, HTML itself will become more complex (incorporating functionalities like tables, frames and stylesheets) and authoring/producing HTML-based documents will require more and more support of full-fledged SGML tools.

About Hans C. Arents

Hans C. Arents is the hypermedia project coordinator of the Materials Information Processing Systems (MIPS) group of the department MTM (K.U.Leuven), where he has been conducting advanced research in the field of hypermedia systems and electronic document engineering since 1988. His research interests include the development of off-line and on-line hypermedia systems for materials engineering information, using artificial intelligence and object-oriented programming techniques. He is now working on the design and implementation of MatWeb, an on-line hypermedia information service based on World-Wide Web technology, which should become a virtual center of competence and documentation for corrosion and material experts.