[Mirrored from: http://www.arbortext.com/wp.html]

Getting Started with SGML

A Guide to the Standard Generalized Markup Language and Its Role in Information Management

An ArborText White Paper

(c) 1992, 1995, ArborText, Inc. This file may be redistributed electronically as long as it remains wholly intact, including this notice and copyright. This file must not be redistributed in hard-copy form. ArborText will freely distribute this document in its original published form on request.

Table of Contents

The Business Challenge
Unleashing the Power of Information
Getting to Know SGML
What Does SGML Give Me?
Is SGML Right for Me?
What is a Good SGML System?
Who Uses SGML Now?
What is CALS?

The Business Challenge

The explosive success of the Internet is an obvious example of an information revolution that's well under way. Companies that appreciate the tremendous cost and value of information are reengineering their processes for creating, distributing and accessing information. The opportunities in each of these areas can be enormous:

Creation: By some estimates, 20% of our GNP is spent on generating new information. And over 90% of that information is in documents, not databases. Do you really know how much your organization invests in the creation of information?

In conventional word processing and desktop publishing systems, your authors spend up to 30% of their time searching for information, and another 30% of their time applying styles and squeezing paragraphs so that each printed page looks nice. And every 18 months, technology changes completely, so you're continually paying for data conversions as software and hardware become obsolete.

Distribution: A few years ago, you could provide your information on paper alone. Then CD-ROM technology became low-cost and widespread, so you've either already faced or soon expect to face the massive re-publishing effort needed to make all your information available electronically. And in just the last year, the World Wide Web has thundered out of nowhere, creating yet another new format for your information.

At the same time, your customers want your information tuned to their needs: they don't want to wade through huge technical manuals that describe all system variations and all possible uses for all possible users -- they want information tailored to their own needs, so they can get it and use it fast.

Access: In the U.S. alone, businesses produce 92 billion documents every year -- and that number is skyrocketing. Can your people easily access the information you create in your own company? How about the information you receive from other companies?

An organization's future can depend on how effectively it identifies, manages, and uses its information. The latest thinking in information management takes an enterprise-wide approach to the creation, distribution and maintenance of information. Organizations that have undertaken this approach have realized enormous improvements in the cost, accuracy, timeliness and variety of the information they create and use.

As part of this movement, companies in some industries are joining together to develop standards for exchanging information with each other and with their customers. Companies that keep up-to-date with these standards will be able to do business more efficiently and compete more effectively in global markets. This white paper describes how one such standard, the Standard Generalized Markup Language (SGML), works as part of an overall information management strategy.

Unleashing the Power of Information

Traditional documents and the methods for handling them suffer many limitations. The printed document is often the result of a sophisticated information process. Once it's printed, however, the document represents a dead-end in the information flow because it has no link to the electronic information base.

Raw data may start in the form of technical specifications or engineering data. This information must be gathered, sorted, organized, and then manually assembled into hard copy documents. With each step in the documentation process, the information may have changed by mistake. The further removed the result is from the original source of information, the greater the risk of erroneous data. The problem can become so large that a majority of documents go out of date as soon as they are printed.

A systematic approach to information management treats text and graphic data as part of an organization's electronic information base. This gives everyone access to the information. By taking a broad view of the information creation and delivery process, you can see documents as including any kind of information -- the output from a database query, a printed document, an online diagnostic manual, an illustrated parts catalog, a collection of video clips, or a home page on the World Wide Web (Internet).

SGML allows you to manage information as data objects instead of as characters on a page. Rather than a stream of indistinguishable bits and bytes, the data is "chunked" into discrete elements of information. This technology enables you to store and reuse the information efficiently, share it with many users, and maintain it in a database.

Getting to Know SGML

This white paper provides an introduction to existing SGML technology, its advantages and benefits, as well as an overview of some related standards and how they fit into an overall approach to managing information. We also define some of the terminology and acronyms to familiarize you with the language associated with SGML.

While SGML is a fairly recent technology, the use of "markup" in computer-based documents has existed for a while. Let's first look at earlier markup schemes that led to SGML.

What is markup?

Markup is everything in a document that is not content. Markup originally referred to the handwritten notations that a designer would add to typewritten text; these notations contained instructions to a typesetter about how to lay out the copy and what typeface to use. This kind of markup is known as "procedural markup."

Procedural markup: Most electronic publishing systems today, such as word processing software and desktop publishing software, use procedural markup. Procedural markup is typically unique to a specific software package such as Microsoft Word and Quark XPress. Each has its own set of markup codes that make sense only to itself. This markup usually takes the form of formatting codes that are mixed in with the text of the document. Procedural markup codes apply to a single way of presenting the information, such as a printed page, and provide no capability to define appearance for other media, such as CD-ROM and Internet.

Descriptive markup: Descriptive markup, also known as "generic markup," describes the purpose of the text in a document, rather than its physical appearance on the page. The basic concept of descriptive markup is that the content of a document should remain separate from its style. Descriptive markup is based on the structure of a document and identifies elements within that structure -- such as a chapter, a section, or a table of contents -- using notations that describe what the element is, not how it appears. By separating presentation information (i.e., style) from the structure, descriptive markup allows for multiple presentations of the same information. For example, you can publish on paper, on-line, on CD-ROM and on the World Wide Web (Internet), all from one set of source files.

Drawbacks of procedural markup: Producers of technical documentation increasingly prefer descriptive markup over procedural markup. Procedural markup is tedious and expensive; authors can spend 15% to 50% of their time on the appearance of each page. If style guidelines change, or if you need to present the same information in a different format, massive re-formatting can be required. When a company changes software or hardware systems, enormous data translation tasks arise, often resulting in errors. Because procedural markup is tied to one final printed product, you cannot change formats easily. Interchanging documents based on procedural markup works easily only if both parties have the same system.

What is SGML?

The Standard Generalized Markup Language, or SGML, is an international standard (ISO 8879) published in 1986. SGML prescribes a standard format for embedding descriptive markup within a document. More importantly, and crucial to its real value and power, SGML also specifies a standard method for describing the structure of a document.

In other words, SGML allows you to set up hierarchical models for each type of document you produce. SGML forces each element in the structure, which is labeled with descriptive markup such as "chapter," "title" and "paragraph," to fit in the logical, predictable structure of your document.

SGML supports an infinite variety of document structures. Users typically design a different document structure for each category of information the produce: information bulletins, technical manuals, parts catalogs, design specifications, reports, letters and memos.

SGML allows you to create documents that are independent of any specific hardware or software. Since SGML documents conform to an international standard, they are portable. You can exchange them seamlessly with users who have different systems.

The world of photography demonstrates the power of standards: SGML is to documents as standardized film speed is to cameras. Today you can purchase a roll of film marked "ISO 100," put the film in your camera, set the camera's film speed to 100 (which many cameras do automatically), and you're ready to shoot. You don't have to worry that the brand of film is not compatible with your particular make of camera. The film and camera manufacturing industries -- through the International Organization for Standardization (ISO) and American Standards Association (ASA) -- have agreed on standards for film speeds. Many industries plan to use SGML so that documents work as easily on different computers as film works in different cameras.

How does SGML work?

You can break a typical document into three layers: structure , content, and style. SGML separates these three aspects, but deals mainly with the relationship between structure and content.

Structure: At the heart of an SGML application is a file called the DTD, or Document Type Definition. The DTD describes the structure of a document, much like a database schema describes the types of information it handles and the relationships between fields. A DTD provides a framework for the elements (such as chapters and chapter headings, sections, and topics) that constitute a document.

A DTD also specifies rules for the relationships between elements; for example, "a chapter heading must be the first element after the start of a chapter"; or: "each list must contain at least two items." These rules, which the DTD defines, help ensure that documents have a consistent, logical structure. A DTD accompanies a document wherever it goes. A "document instance" is a document whose content has been tagged in conformance with a particular DTD.

Content: Content is the information itself: content includes titles, paragraphs, lists, tables, graphics, and audio. The method for identifying the content's position within the DTD structure is called "tagging." Creating an SGML document involves inserting tags around content. These tags mark the beginning and end of each part of the structure. In the following example, "<par>" indicates the start of a paragraph, and "</par>" indicates the end:

is the information itself.</par>

You can nest elements within other elements; in the following example, the paragraph ("<par>") is an element within the topic ("<topic>"):

is the information itself.</par></topic> 

The structure of a particular document is revealed by the nesting of tags:

<section><subhead>Content</subhead><par>Content is the  information

Fortunately, human beings usually don't have to deal with manually typing in tags and checking to make sure all the tags are there. Some SGML-based authoring software programs make it easy to enter tags by clicking on pull-down menus that list only those tags that are valid at the cursor's current position in the document. These programs rely on a software module called a "parser" that verifies that the document follows the rules of the DTD. (The parser also verifies that the DTD itself is structurally correct.) The following illustration shows how an SGML-based authoring program would display the previous example:

Style: SGML itself has nothing to do with setting standards for style, so most systems still rely on proprietary methods. Two efforts to develop standards-based style sheets have resulted in the mature OS and the still unreleased DSSSL.

The U.S. Department of Defense CALS initiative developed its own standard, known as the Output Specification (OS). The OS is in the form of a particular DTD that allows the user to create a Formatting Output Specification Instance, or FOSI (usually pronounced "fossy"), that is well suited to both printed and electronic output.

A FOSI is essentially a powerful style sheet that specifies the formatting for each tag in a DTD. With the FOSI, the document, and the DTD, you have a complete interchange package for printed documents.

In early 1995, an ISO committee released a draft of the Document Style Semantics and Specification Language (DSSSL), which will eventually become an international standard for presenting SGML-based documents. Official release is expected later this year.

The complete DSSSL standard covers a broad scope, so subsets are being developed to handle varying levels of functionality. A subset whose functionality is approximately equivalent to FOSIs is expected, and work on tools to convert FOSIs to and from DSSSL is under way.

Many military contracts currently require FOSIs, and many non-defense firms have also embraced the OS because it's a mature and supported standard. It is expected that both DSSSL and FOSIs will remain important standards for the forseeable future.

What Does SGML Give Me?

SGML has become mainstream technology that you can use with confidence. Your adoption of SGML will allow your organization to gain the maximum value from your generation and use of information:

Increased productivity

A structured approach to documents reminds writers how to organize information, and keeps content separate from style. This separation enables you to set up centrally-controlled style guidelines, so authors can focus on content rather than appearance. That alone can as much as double your authors' productivity.

You can also improve efficiency by keeping only one copy of information that's used by many so that authors don't re-create the same information.


A printed document is just one of many possible products from SGML-based information. For example, a technical publications group can use tags to identify a procedure with a sequence of tasks. In this case, you identify the beginning and end of the procedure, and each step in the procedure. The procedure can now appear in several forms: maintenance and operational manuals, online technical manuals, training guides, etc. More importantly, since the tags are machine-readable, a computer can manage and maintain the different uses of the task from a single source.

Information longevity

Because SGML is a simple, standard file format, you'll never again have to convert your documents when a hardware or software system becomes obsolete. Once you define documents, the information will always be available. The information carries with it everything needed to create a document. So even when your hardware or software becomes obsolete, your information remains usable and available.

Improved data integrity

Document structure helps ensure that the right information is in the right place, bringing more organization to your information. Because SGML eliminates data translation, you reduce the risk of losing information by filtering data from one format to another.

Better data control

With SGML, you can define and manipulate information elements at any level of detail. Tagged elements can have attributes that provide characteristics or properties about the element. This attribute information is not intended for printing but can help with managing the data elements. For example, an ID (identifier) attribute can uniquely identify a single paragraph, a whole section, a legal notice, an illustration, a task, or any element, as seen in the following example:

     <para:id=431>Content is the information itself.</par>

Because IDs are machine-readable, they can link related information and be used for a variety of information management controls. These controls can help you to:


Because SGML works with structured document components, you can build entire documents out of information from various parts of the organization. This feature enables users to share the latest information without duplicating it. An example of this might be a standard legal notice or copyright statement appearing in documents throughout a company. The legal department maintains this module of information, updating it on occasion. A single tag in a document can pull in the current notice and you can print it on demand in any number of publications, eliminating needless duplication of information.

Portability of information

Today, information networks proliferate where different computers, operating systems, and applications must share information. In these sort of networks, portability becomes the key in making sure all who need it can access the information. Because SGML is hardware and software independent, you can exchange documents easily among different systems.

Flexibility beyond traditional publishing

The information you create today may be used a year from now in ways you haven't yet anticipated. (When we first wrote that sentence, the need to publish on the World Wide Web did not even exist! The spectacular growth of the Web serves as dramatic proof that we simply cannot anticipate all the purposes for which our information may eventually be used.)

SGML permits you to use your information for applications beyond traditional publishing. For example:

Is SGML Right for Me?

In the life cycle of a product, the cost of gathering, producing, and maintaining the necessary technical information can exceed the initial hardware and equipment cost. For many industries, technical information is part of a deliverable product, or a product in itself that must be rigorously maintained. Any industry whose product line is heavily dependent on information can benefit from SGML.

In evaluating how SGML can help your organization, you may wish to consider some strategic business issues to help in your information management plan. A strategic approach should prompt you to examine your current information needs and your current document management methodology. Some questions to consider include:

By examining your requirements, you can evaluate how SGML fits into your information management strategy. Standardizing on SGML doesn't mean you need to use it for all documents; SGML is most useful for documents with a definable structure. Since SGML handles documents as collections of distinguishable data elements, it is useful to think in terms of modules of information, rather than complete printed documents.

SGML is most useful as a tool in an integrated information management strategy. Making such a strategic choice and planning the implementation should be decided by a company's high-level management. There will be initial implementation costs in moving to SGML. But the payback comes from benefits which accrue over time and enhance your information investment. Any organization that exchanges information between systems, applications, departments, and companies will realize these benefits.

What is a Good SGML System?

By design, SGML is meant to be customized. Just as there's no single database that can serve all the needs of every organization, there are no one-size-fits-all SGML applications. Since each organization's information requirements are different, there are many DTDs. More organizations are looking at industry-wide information needs and developing standards for handling that information.

A number of products on the market handle SGML to some degree. But not all products handle all the features of the standard. The sections below describe some basic requirements.

Provides real-time interactive parsing

An invaluable feature in a system is real-time, interactive SGML validation. This feature allows the software to provide context-sensitive editing assistance based on the cursor's current position in the document. For example, if the cursor is immediately after the beginning tag for a section, and all sections must have a section heading, the software allows you to insert only a section heading tag. This feature ensures that the author does correct tagging at all times. By contrast, systems that use batch parsing allow authors to insert tags and text without checking each action against the DTD. Batch parsing makes tagging and parsing a repetitive, trial-and-error process.

Uses real SGML

If your authoring software merely produces SGML as output, then you're still tied to a proprietary format, and still at the mercy of software and hardware obsolescence. A publishing system that uses SGML as its native file format is superior to a system one that filters the data into SGML. In the latter approach, authors create documents in one format, then filter parts of the document into SGML, and then run the SGML through a validating parser. When the parser finds errors, the author must correct the original document, then filter and parse the changes again. The author must repeat this cycle until the entire document parses successfully. This approach can add tedious steps to the publishing process. A system that creates native SGML eliminates the costly, time-consuming, and error-prone process of retrofitting documents into valid SGML.

Supports any DTD

Some SGML systems lock you into a fixed set of DTDs supplied with the software. To be fully usable, a good SGML product allows you to create a variety of document types. This feature is sometimes called the ability to handle "arbitrary" or user-defined DTDs.

Supports SGML features

The developers of SGML built into the standard a number of features that facilitate automated publishing and document re-use. A fully-featured SGML publishing package should support this functionality. Some of the basic features to look for include:

Who Uses SGML Now?

Early in its history, the primary adopters of SGML were defense contractors. In the last two years, however, the trickle has turned into a torrent. Many leading organizations have recognized the benefits SGML offers and have adopted it for information management.

Several industry groups exist to standardize information exchange among their members and between members and their vendors and customers:

Many SGML applications are in commercial use. Other industries moving to SGML include pharmaceuticals, automotive, and manufacturing.

Overseas, SGML is gaining wide acceptance. The European Airbus, a consortium of companies in the commercial airline industry in Europe, adopted SGML. Telecommunications, aerospace, manufacturing, and other commercial and military interests throughout Europe are also using SGML.

What is CALS?

CALS stands for Continuous Acquisition and Life-Cycle Support (recently renamed from Computer-aided Acquisition and Logistic Support). It is a large-scale, long-term information management project initiated by the U.S. Department of Defense (DoD). Since the DoD receives goods and services from a wide range of suppliers, contractors and subcontractors, it constantly handles massive quantities of technical information. Today's weapon systems are technologically complex and can have a life span of 20 years of more. As a result, the amount of technical data needed to support and maintain these systems is overwhelming.

The aim of CALS is to reduce the cost of supporting and maintaining military equipment. Through CALS, the government also hopes to reduce costs in the initial design and engineering stages. SGML is a part of the overall CALS program which includes a comprehensive array of standards.

The CALS standards that apply to maintaining technical information include:


Here's a few resources for more information; the SGML Resources page contains links to many on-line resources as well.

Conferences, tutorials, and training

The Graphic Communications Association (GCA) was instrumental in the development of SGML. The GCA provides conferences, tutorials, newsletters, and publication sales for both members and non-members.

Graphic Communications Association
100 Dangerfield Rd.
Alexandria, VA 22314-2804

SGML Open is a non-profit, international consortium of providers of SGML products and services, dedicated to accelerating the further adoption, application, and implementation of SGML.

218 Parliament Drive
Coraopolis, PA 15108

ArborText also offers a range of introductory to advanced level SGML training courses, including DTD and FOSI training.

Books on SGML

Bryan, Martin. SGML: An Author's Guide to the Standard Generalized Markup Language , Addison-Wesley (1988) ISBN 0-201-17537-5

Goldfarb, Charles. The SGML Handbook, Oxford University Press, (1990) ISBN 0-19-863737-9

Van Herwijnen, Eric. Practical SGML, Second Edition, Kluwer Academic Publishers, 101 Philip Drive, Assinippi Park, Norwell, Massachusetts 02061, (1994) ISBN 0-7923-9434-8


ASCII: (American Standard Code for Information Interchange): This standard character encoding scheme is used extensively in data transmission.

ANSI: (American National Standards Institute) This group is the U.S. member organization that belongs to the ISO, the International Organization for Standardization.

attribute: An attribute provides more information about an element such as classification level, unique reference identifiers, or formatting information.

CCITT Group 4: (International Consultative Committee on Telegraphy and Telephony) This CALS standard for raster graphics incorporates tiling, which divides a large image into smaller tiles. You can exchange graphic files in CCITT/4 format in a compressed state so they take up much less file space.

CITIS: (Contractor Integrated Technical Information Service) As part of CALS Phase II, CITIS is a draft functional specification for services. DoD acquisition managers designed CITIS as a plan to gain access to product-related digital technical information.

CGM: (Computer Graphics Metafile) CGM is one of the CALS standard formats for representing 2-D technical illustrations. CGM is an object-oriented graphic format.

DSSSL: (Document Style Semantics and Specification Language) This draft international standard (DIS 10179) applies to the specification of processing information for SGML documents. DSSSL is expected to become an international standard this year.

DTD: (Document Type Definition) A DTD is the formal definition of the elements, structures, and rules for marking up a given type of SGML document. You can store a DTD at the beginning of the document or externally in a separate file.

EDI: (Electronic Data Interchange) This is a set of computer interchange standards for business documents such as invoices, bills, and purchase orders.

element: An element is a piece of data within a document that may contain either text or other subelements such as a paragraph, a chapter, and so on.

element declaration: A statement in the DTD defining an element and declaring the order in which it may appear in the document and what other elements it may include.

entity: An entity is a self-contained piece of data that can be referenced as a unit. You can refer to an entity by a symbolic name in the DTD or the document. An entity can be a string of characters, a symbol character (unavailable on a standard keyboard), a separate text file, or a separate graphic file.

entity declaration: A statement in the DTD or document that assigns an SGML name to an entity so you can reference it.

FOSI: (Formatting Output Specification Instance) A FOSI is used for formatting SGML documents for printing and other outputs. It is a separate file that contains formatting information for each element in a document.

HTML: (HyperText Markup Language) This is the format of files published on the World Wide Web. HTML is an application of SGML; to author in HTML using SGML-based authoring software, you simply need the HTML DTD.

IGES: (Initial Graphics Exchange Specification) The IGES standard for engineering, product design, and manufacturing drawings is one of the CALS standard graphics formats.

Internet: The Internet is a worldwide communications network originally developed by the U.S. Department of Defense as a distributed system with no single point of failure. Long the province of scientists and academics, the development of easy-to-use software for accessing the net has generated an explosion in commercial use.

ISO: (International Organization for Standardization) The ISO is an industry-supported organization that establishes world-wide standards for everything from data interchange formats to film speed specifications.

markup: Markup is anything added to the content of the document that describes the text.

parser: A parser is a specialized software program that recognizes SGML markup in a document. A parser that reads a DTD and checks and reports on markup errors is a validating SGML parser. A parser can be built into an SGML editor to prevent incorrect tagging and to check whether a document contains all the required elements.

PDES/STEP: (Product Data Exchange Standard/Standard for the Exchange of Product Model Data). PDES/STEP are standards under development for communicating a complete product model with sufficient information content that advanced CAD/CAM applications can interpret. PDES is under development as a national standard and STEP is under development as its international counterpart.

tag: In the world of SGML, a tag is a marker embedded in a document that indicates the purpose or function of the element. Each element has a beginning tag and an end tag.

World Wide Web: Often referred to as WWW or the Web, this usually refers to information available on the Internet that can be easily accessed with access software usually called a "browser." Organizations publish their information on the Web in a format known as HTML; this information is usually referred to as their "home page."

Home Page QuickFind