[Archive copy mirrored from: http://www.passage.com/pubs/white/pracsgml.htm, text only]

Passage Systems Incorporated

How Practical is Practical SGML?

Robert J. Glushko, Passage Systems, Inc.


Journal of Computer Documentation, 20(2), May 1996, 39-43.

Contents


Five years ago the first edition of Eric van Herwijnen's Practical SGML was for me and many other people a Rosetta Stone that enabled us to understand SGML syntax. I eagerly looked forward to the second edition because today my business is "practical SGML" - helping organizations use SGML as a means for managing information in non-proprietary ways so that the same content can be delivered in multiple formats or channels, such as CD-ROM, the World Wide Web, print, or Braille. But there is a paradox to my review - the second edition is significantly better than the first in many respects, but feels much less practical than the first edition did. The world has changed more dramatically than the author's perspective on SGML has, and Practical SGML, with its strong focus on SGML syntax, no longer is the best book to start with for someone new to SGML.

An Introduction to SGML

SGML, the Standard Generalized Markup Language, is an international standard (ISO 8879) that defines a syntax for the character representation of structured information, in contrast to the proprietary binary formats of word processors. An SGML document contains the textual content (the words) along with syntactic markers (the "markup" or "tags") that indicate the boundaries and relationships among the information structures (the "elements") that contain the content (chapters, titles, paragraphs, glossary terms, cross references, and so on).

Because different organizations produce different types of publications, no single set of elements can describe what is important in all of them. Product manuals, catalogs, policies and procedures, directories, and textbooks are examples of publication types; each has distinctive kinds of information. SGML defines the syntax for specifying whatever set of elements and their relationships is needed to meet any requirements for describing information. SGML is thus not a specific markup language, but a Standard and Generalized way to create Markup Languages.

Each markup language defined using SGML is called a document type, whose formal definition in SGML syntax is the Document Type Definition or DTD. One of the best known document types is HyperText Markup Language or HTML, the set of elements used for creating documents for the Internet's World Wide Web. Other well known SGML document types are DocBook, used by many computer hardware and software companies for technical documentation, and Pinnacles, used by semiconductor firms for component datasheets. The formal definition of document types in SGML allows documents to be interpreted and validated according to the formal definition; thus, the specification of a document type not only tells you how to put together a document, it also allows you verify that you have done so correctly.

SGML is a powerful way of thinking about information management and publishing. SGML information can be read by people, sent to any computer, and used by any application that can use a DTD to identify the elements. As a result, SGML is being widely adopted by companies seeking more flexible use and reuse of the information they create.

A Look Back at Practical SGML, 1st Edition

In 1990 the "SGML business" was far different and much less mature than it is today. Compliance with the Defense Department's CALS initiative, which required the use of a specific SGML document type for information interchange, was a major driver for SGML adoption. HTML and the World Wide Web did not exist, and no computer hardware or software vendor had yet delivered SGML-based online documentation. Pioneers in these early days tried to learn SGML from the SGML Handbook (Goldfarb, 1990).

From the minute it was published, the Handbook was the authoritative book on the SGML standard. It is largely the text of the SGML standard (ISO 8879) annotated by its editor, Charles Goldfarb. Every serious SGML project must have someone who is comfortable with the Handbook. But just as the Oxford English Dictionary is the definitive dictionary of the English language and may not be very useful as a guide for new speakers of English, the Handbook is an intimidating starting point for most people trying to learn SGML.

In this context, when the first edition of Eric van Herwijnen's Practical SGML (1990) appeared, it seemed that I'd found the Rosetta Stone for SGML. The book was primarily a feature-by-feature exposition of SGML syntax, but its presentation was far less formal and abstract than the Handbook's. Practical SGML also briefly addressed authoring, document analysis, developing Document Type Definitions (DTDs), and other topics outside the scope of the ISO 8879 standard but which I needed to understand to make sense of SGML. Thousands of other people must have felt the same way about the first edition. Within a few years there were 10,000 copies in print, the book had been translated into several languages, and it had become the most recommended "starter text" for SGML.

In 1992 I co-founded Passage Systems, a consulting, software, and data conversion services company that helps companies make the "passage" from print to online publishing. Passage Systems specializes in "practical" SGML - using SGML, not as an end in itself, but as a means for managing information in non-proprietary ways so that the same content can be delivered in multiple formats or channels, such as CD-ROM, the World Wide Web, print, or Braille. Our trainers and consultants have introduced SGML to hundreds, maybe thousands of people, including most of our own employees. We own many copies of the SGML Handbook and many copies of Practical SGML.

Review of the 2nd Edition

In 1995 I eagerly purchased the second edition of Practical SGML, expecting it to be as important a book for the new era of SGML as the first edition had been in the initial era. In the preface van Herwijnen sets the unequivocal target for Practical SGML "to withstand the test of time and make it the definitive introductory book about SGML" (p. xvi).

The second edition has many strengths. It represents a substantial rewriting of the first edition. While it remains largely a simplified introduction to SGML syntax, the second edition is kinder and gentler than the first edition. It contains dozens of exercises that help readers verify their understanding as they work through the book. Icons in the margin, labeled as "opinions", "advice", or "warnings", highlight information to which the author wants the reader to pay special attention.

Some topics critical for newcomers to SGML are covered more thoroughly, especially document analysis and DTD construction. Some advanced topics from the first edition have been treated more lightly or eliminated. The book's thorough Glossary and Index make it an excellent quick reference.

The book, of course, is not flawless. Practical SGML still has a distinct bias toward programmers, which the author intends, but which makes it less readable for writers and managers. One manifestation of this bias is the recommendation to explain the meaning of each element as comments in the computer-readable part of the DTD (p. 108- 109). This puts important information where it would be found only by SGML syntax experts. It would be more useful to explain the document type in a separate document written for authors, trainers, managers, or others with a stake in the analysis. Mixing this design rationale and usage information with its syntactic encoding makes it less accessible.

Practical SGML's assertion that "document analysis and writing DTDs are very akin to programming" (p. xvi) overemphasizes the last stage of translating a document type into SGML syntax. It glosses over the effort and iteration involved in eliciting information requirements, reaching a consensus among the stakeholders, and making the DTD compatible with the capabilities and processes of the organization that will use it. This more complex perspective is comprehensively treated in Maler and El Andaloussi's Developing SGML DTDs (1996), which is likely to become the standard reference.

Another significant weakness of the book is its lack of discussion of how to use SGML to represent tables. Tables occur frequently in publications and there are numerous tables in Practical SGML, but there is absolutely no discussion of how to use SGML to model tables. SGML does not prescribe any particular way to represent tables, and while the CALS table model is often used, its two-dimensional geometric approach is too weak to represent many types of commonly occurring tables (Thompson, 1995). It would be useful pedagogically in an introductory book to contrast geometric markup for tables with more content-based alternatives.

The author has an occasional bad habit of letting his expert knowledge of SGML syntax overcome his responsibility not to lead SGML beginners into temptation:

  • "Although SGML technically allows you to make these changes [to the reserved names used in the reference concrete syntax], I have personally never had the need to do so. But if you want to... (p. 142)"
  • "My recommendation is not to use minimization, but if you absolutely want to, this section tells you how to do it (p. 152)".

Perhaps because they have a certain cleverness to them that appeals to the programmer in van Herwijnen, Practical SGML has lengthy descriptions of SGML minimization features (OMITTAG, SHORTTAG, and DATATAG). These were invented to reduce typing when SGML markup was entered "by hand", but SHORTTAG and DATATAG are almost never used today. OMITTAG is used by SGML experts (so they can quickly modify SGML documents with text editors like vi, for example), but OMITTAG should never be used by SGML beginners. Beginners should create SGML using a "native" SGML editor or by running a program that converts word processor source files.

A Paradox

When I itemize the strengths and weaknesses of the second edition and compare it to the first, I can convince myself that the second is significantly better. But the second edition feels much less practical than the first one did for me five years ago, and it does not achieve the ambitious goals the author sets for it. While I recommend Practical SGML for someone who needs to learn SGML syntax, I can no longer recommend it as the book from which to learn about SGML.

If Practical SGML does not fully live up to its title, it is because the context in which any treatment of SGML must be evaluated today has profoundly changed. The first edition appeared when many people were learning SGML because they had to, in order to comply with CALS or a similar contractual mandate to create SGML documents for interchange, not for online delivery. Many of these people were programmers or had a programming or text processing background. In this context, learning SGML meant learning its syntax, so Practical SGML's more accessible treatment of SGML syntax could be heralded as a practical contribution.

Today, most people who hear about SGML have more discretion about whether to use it, most of them are not programmers, and the ubiquitous hype surrounding the World Wide Web has already exposed them to text markup via HTML. For a book about SGML to be practical today, it is no longer enough that it be easier to read than the SGML Handbook. A practical book must situate and justify SGML with respect to other formats for information management and delivery, especially HTML on the World Wide Web. It must do so for writers and production personnel rather than for programmers, and it must do so in an end-to-end organizational context. These are harder challenges and they impose a higher threshold for making the claim of "practical" in a book's title.

SGML {and, vs} HTML

The first chapter of Practical SGML makes the case for SGML, but I would have made a stronger one. The relentless marketing pace of new releases of word processing software, adding new features at the cost of format incompatibility, poses a continuous threat to the longevity of information. Many alternatives exist for delivering information electronically on CD-ROM, on the Internet, and in embedded applications, but each may require different markup formats. Even if you don't author using SGML, "thinking SGML" by authoring in a structure-aware manner to facilitate conversion of your most important information to SGML is a good defense against these proprietary pressures. It is with respect to HTML and the World Wide Web that these pressures are most intense. It is a measure of the explosive growth of the Web in the last year that Practical SGML, published late in 1994, completely fails to mention it. (Of course, this means that Practical SGML was being written even earlier, so I do not intend it as a criticism of the author that his crystal ball didn't predict the Web phenomenon).

Much of the increased attention to SGML in the past few years is directly due to the hype surrounding HTML and the Web, but the relationship between HTML and SGML is often misunderstood. HTML is properly described as an "application" of SGML, not a "subset" or "descendant" and certainly not a "replacement" for it, although I've heard all three claims. Much of the great appeal of HTML is its simplicity as a small set of elements for headings, lists, and links, which has made it easy for many software firms to create programs that convert word processing files to HTML or "browsers" that interpret the HTML elements as formatting instructions. The HTML elements and the rules that specify their valid use are documented in a DTD maintained by the World Wide Web Consortium.

Unfortunately, with the exception of programs customized for handling HTML developed by vendors of SGML technology, most HTML technology doesn't treat it as an application of SGML. These tools do not parse their HTML input or output to ensure it is valid with respect to the HTML DTD. Worse yet, some vendors have unilaterally added elements to HTML that only their browsers understand. This is a well-motivated but short-sighted attempt to give Web publishers more flexibility and control over document appearance. It would be far better if all Web browsing technology made use of arbitrary SGML so that any document could use any DTD and the browser could interpret the elements appropriately.

Until this happens, however, the net result of unvalidated HTML and proprietary extensions is to limit the interoperability and technology independence of Web documents. Users can get locked into whichever "dialect" of HTML is most successfully promoted by a browser vendor.

It is easy to get on the Web using HTML in a way that misses the point of its SGML foundations. I believe, however, that except for individuals who publish a small number of Web pages, only an SGML-centric strategy is sustainable after the initial excitement and heroic "hand-crafting" of the first Web publication. An organization publishing large amounts of information on the Web, and for whom timeliness, consistency, and reuse are important goals, needs an SGML- centric approach. Such an approach treats HTML, in whatever proprietary dialects are necessary to exploit the formatting characteristics of different browsers, as output formats created by "down-translation" from SGML. In the future revealed by my crystal ball, when all Web browsers can make full use of arbitrary SGML, no huge legacy of HTML- tagged information would have to be converted.

The SGML can be created using an SGML editor or converted from word processing source files, but in either case it should be SGML according to a document model that better meets the needs of your organization, not HTML, which is too simple and format-oriented to capture the complete structural and content model of your information. The HTML generated from SGML can also be automatically modularized, more consistently and completely linked, indexed by content type, and more efficiently maintained than HTML created by hand or from word processing files without an intermediate SGML representation (Suttor, 1996).

An End-to-end Organizational Perspective on SGML

In addition to emphasizing the syntax of SGML over its implementation, Practical SGML is aimed at individuals, which is unfortunate because individuals generally adopt SGML when their organizations do. The ideal result of an SGML implementation is a coherent end-to end system: an integrated set of authoring, conversion, validation, indexing, delivery, and database management software that enables an organization to meet all its requirements for creating, managing, and disseminating information. In such a system information from different authors and sources might be created with different tools, but after coming together as SGML it can be reassembled or linked to create the completed document for delivery to a customer. A technological perspective on SGML emphasizes the mechanisms for validation at tool boundaries, for shared development, for configuration management, and for version control that enables the numerous pieces of information to come together. But this technological perspective is insufficient to ensure a successful adoption of SGML.

Adopting SGML may require learning new syntax and acquiring new software, but more important to successful adoption is institutionalizing a conceptual shift in focus from fixed printed publications to "information repositories" or "document databases". This new perspective on information management and delivery always involves paying greater attention to the information requirements of customers and users. Changes to authoring and production processes are often required when an organization recognizes the benefits of structural encoding and reuse, partly to overcome years of using proprietary WYSIWYG word processing software to make publications look a particular way for a single purpose. (A case study of SGML adoption written from this organizational and process perspective is described by Glushko and Kershner, 1993).

This broader perspective requires that the organization carefully assess the impact of the end-to-end system on the people who work with it. What are the costs and benefits? Are these costs one time, incurred only during the transition to the new system, or are they recurring? Are the existing skills of the people involved sufficient for them to carry out their new tasks, or do they need additional training? These questions have nothing to do with SGML syntax, but if they are not answered an SGML project will fail.

Conclusion

SGML enables an attractive and sustainable vision of information management, but achieving it takes effort and a long-term perspective. It can be hard to resist the seduction of HTML and the World Wide Web, especially when SGML and HTML share the same syntax and the latter seems so much easier to use than the former. Thus it is essential that someone who wants to learn about SGML does so from a book that isn't so focused on syntax as is Practical SGML. A book that fits this requirement is ABCD...SGML by Liora Alschuler (1995), which is an excellent first book for managers and writers because it deals with issues and case studies. Practical SGML makes a good second book on SGML from which to learn syntax after the idea of SGML is firmly understood. The ideal book would explain the ideas of SGML and its syntax at the same time, but this book does not yet exist.

Two very useful sources of information about SGML on the World Wide Web are the "SGML Web Page" (http://www.sil.org/sgml) and the site developed by SGML Open, a consortium of SGML vendors (http://www.sgmlopen.org).

References

Alschuler, L. (1995). ABCD...SGML. London: International Thompson Computer Press.

Glushko, R., and Kershner, K. (1993). Silicon Graphics' IRIS InSight: An SGML success story. Technical Communication, 40 (3), 394-402.

Goldfarb, C. (1990). The SGML Handbook. Oxford: Clarendon Press.

Maler, E., and El Andaloussi, J. (1996). Developing SGML DTDs. Upper Saddle River, NJ: Prentice Hall PTR.

Suttor, J. (1996). HTML and Internet Publishing. Course Notes published by the Graphics Communication Association, Alexandria VA.

Thompson, M. (1995). A tables manifesto. Proceedings of SGML'95, 411-414.

Van Herwijnen, E. (1990). Practical SGML (1st Edition). Boston: Kluwer Academic Publishers.

Van Herwijnen, E. (1994). Practical SGML (2nd Edition). Boston: Kluwer Academic Publishers.


HOME | MAP | CONTACT
PRODUCTS | SERVICES | DEMONSTRATIONS | PUBLICATIONS | EMPLOYMENT | ABOUT PASSAGE

Copyright © 1997 Passage Systems Inc.
Comments: webmaster@passage.com