On the Road to SGML

by Ludo Van Vooren

Permission to redistribute this whitepaper is granted, provided that no changes are made, and that this notice and the above attributions are included in all copies.

Introduction

For companies trying to improve their information management systems, the options are few and widely diverse. Some choose to improve the current system by upgrading to the latest version of the software, adding memory or getting more disk space. However, these kinds of improvements yield relatively small benefits and have the same effect as applying a small 'Band-Aid' to a wound. The wound in this case might be the lack of software features or a system efficiency incapable of keeping up with the ever more complex information. The 'Band-Aid' will stop the bleeding for a while, but it won't heal the wound. Statistics show that there is no significant white collar productivity increase when new technology is applied without process re-engineering.

More and more companies are successful at re-engineering their information management systems because they are applying new strategies and building open systems that will keep them successful over the long run. Most of these new systems rely on the Standard Generalized Markup Language (SGML). This radical approach has been paying-off for its implementors. Most of them believe they have only found the tip of the iceberg for the potential of this new technology. A lot of applications are expanded beyond their original scope once they are first used, because so much more can be done very easily. Rarely has there been more excitement over an ISO computer standard. In fact, SGML (ISO8879) is ISO's most widely accepted standard specification ever! Yet after looking more closely at SGML, one might wonder how a system currently in place could ever be modified to take advantage of the technology. The approach suggested by SGML is so different and profound, it seems at first that you can't get there from here, or at least, that it would take a long time before benefits will become apparent.

Despite that first impression, SGML does not have to be an all-or-nothing investment. There are specific steps you can take today to get your implementation underway. Most importantly, substantial benefits can be obtained in the short term. This presentation was written to share with you some of the lessons learned over the years building SGML systems. We will start by reviewing the true benefits of SGML. These are very important to focus the implementation on tangible goals. The paper then explains the kind of system you eventually want to build, pointing out its essential requirements. By pointing common problems found in most current systems, you can identify the areas that need immediate attention. We will review the transitional system architecture that can be implemented right away to carry-on the tasks previously identified. Finally, we will outline a generic implementation plan for the transitional system.

Benefits of SGML

When SGML was first developed in the early 1980's, the main problem targeted was the interchange of documents between various systems. For the longest time, SGML was believed to be simply another coding standard that many software packages would understand and thus make document transfer easy. Seven years after its official release by ISO in 1986, many SGML applications have been built without interchange in mind. These systems have exploited SGML's reusability, search ability and automatic management benefits. That is really where the big pay-offs have been.

SGML is a universal document modeling language allowing organizations to structure and manage information in a neutral way so that it can be reused in many different forms, by many people and across many divergent systems. Because the information is stored in a form independent of its usage, content can be assembled in many different ways and given a form only at the last minute. For example, procedures can be stored individually, assembled to describe a specific job and given a different format if they are used in a reference or training manual.

Because SGML identifies what the information is instead of what it looks like, documents stored in SGML are real text databases. These 'textbases' provide extremely powerful searching capabilities not available in normal text searching systems. For example, in a traditional text system all words are equal. In SGML, a word in a chapter title or the value of a part number can be distinguished from the same information anywhere else in the document. But because SGML organizes documents in the form of a tree structure, more sophisticated queries can be resolved. For example, an SGML system can return the title of all the sections containing a reference to a specific figure.

A less-known benefit of SGML is its ability to create automatic management tasks which until now required manual processing. Because the information in the document is unmistakably distinguishable by a computer program, applications relying on information contained in the document can be created. For example, document management systems can automatically route a document based on its type, the procedure it describes or its language identification attribute. Furthermore, programs can be written to automatically generate reports or change information contained in documents. Reports showing unresolved references, location of large unstructured elements or lists of all part numbers have helped tremendously in increasing the accuracy of the documents. Programs capable of automatically and reliably changing catalog prices, document effectivity or figure references save valuable time and cost in large production systems.

When planning your SGML system, it is important to be aware of these benefits because they are crucial to the development of a successful implementation.

Requirements of an SGML System

To achieve maximum benefits, you will need a system that totally embraces SGML. That system will be composed of a true SGML authoring tool, an information repository (that includes access control, searching and routing), and an SGML fragment server. The delivery part of the system will probably have a composition and/or viewing system. The architecture of an ultimate SGML system is described in Figure 1.

Authoring

Viewing

Composition

Access / Routing / Searching

Non-SGML Storage

SGML Fragment

(e.g. Graphics)

Server

Figure 1 - Ultimate SGML system architecture [Figure not available]

In this system, the information is organized in SGML elements. These elements are logically combined in multiple structures used for authoring, viewing and storing. Documents do not exist as such. They become a series of SGML pointers connected to other SGML elements in the repository. Every non-SGML element such as graphics, sound and video has a corresponding SGML element in the Fragment Server. Access information can be attached to any SGML element.

With this scenario, a user would query the Fragment Server for specific information. The information retrieved can be viewed or composed for printing in a format chosen by the user. Elements included in the documents retrieved can be gathered into a new document. The information can be copied or linked to its original source. Of course, copyright and access can be enforced. New information can be authored to complete the document. Finally, a query could provide a list of appropriate illustrations that can be linked into the new 'virtual' document. The user can print a copy by applying a personal format to the content. The new document is now checked-in the repository and routed to field offices, where it will be automatically connected to an on-line help system.

The two key enabling factors are: object oriented technology and content identification. Object oriented technology allows the management and assembly of information elements at an extremely small granularity. This gives you the greatest flexibility possible. The content mark-up allows the system to keep track of what the object is, not what it is used for. This means that all the information about each object is self contained and accessible to the management system.

Common current system

A common current system will consist of one or more authoring tools also used for viewing and composition. The documents produced and accessed by this tool are stored in a repository managing access, routing and searching. The simplest implementation of this repository is the computer file system. The problem in transforming this kind of system into an ultimate SGML system is that its two key enabling factors are not present.

Most systems are document based. This means that the repository system stores objects that are complete documents. Information is not shared between documents. For example, if an illustration is used in multiple documents it is physically copied in each document. The searchable information about each document is very poor. Systems might store the document author, date of last revision and title. If a single paragraph of the document is classified, the document cannot be accessed by unprivileged users even though the rest of the document might be very useful to them. A document based approach limits the access control, searching and routing to entire documents.

Documents in the system are stored in the form in which they have been authored. The information they contain is locked in a specific look only modifiable through a laborious manual process. Multi-author efforts cause consistency problems as all the authors might not adhere to the same formatting conventions for the various parts of the document they are in charge of completing.

It is very important to notice that the key enabling factors are related to the documents, not to the system architecture or technology. The more you increase the object-orientation and content identification of your documents, the more benefits you will be able to get out of your system. Your current system is architected the way it is because the documents do not allow it to provide any other features. The only reason why the ultimate system described in Figure 1 provides all those wonderful functionalities is because the information stored in the system allows it to do so.

Content

Ultimate

Identification

System

Current

System

Object-Oriented

Figure 2 - Transition to SGML system [Figure not available]

Transitional system architecture

The key to start your transition to SGML is to slightly improve your existing system so it will allow you to start increasing your document value. You don't need to necessarily change the tools you are currently using, but you must start using them differently and with more rigor.

Increasing the content identification in your documents will mostly affect your authoring environment. Your system needs to be well organized by types of documents. There should be a limited number of types of documents you are producing. There should be clear rules (not guidelines!) about the structure and content of each of them. You should enforce the use of standard style sheets for each of the document types. The style sheets should be written with the idea of content identification in mind. For example, they should use styles like 'Chapter Title' instead of 'Bold Centered Title'. Most authoring tools have customizable interfaces today. Use this feature to prevent deviation from the style sheet by removing style creation and override from the standard interface. Table editing should be limited to the table manipulation tool included in most authoring tools. These tools capture the richest information about tables. You should also use any information sharing tool available. For example, many software packages allow you to include graphics by reference. Some even allow you to share standard 'boilerplate' text among documents. Because these mechanisms will be present in the ultimate system, it is good to start using them as early as possible. Finally, installing a batch structure checking process is also very valuable for enforcing the structure and content of the documents being created.

All these measures are intended to increase the consistency and content identification in your documents. These are absolutely indispensable when you convert these documents to SGML. The information you will discover during the implementation of these measures should be recorded as it will be extremely valuable when you start writing the Document Type Definitions (DTDs) necessary for the SGML system. In the mean time, you will benefit from more consistent documents. You will also receive gains from the ability to implement programs capable of accessing the specific information identified by the styles used in the document. Finally, the format of the documents can be easily changed based on the application of different style sheet definitions. For example, a style sheet based on 12 point type can be applied for printed composition. When the document is used on-line, a 14 point type style can be applied.

You can also apply an object-oriented approach in steps. Although your system is organized in 'document' objects and the ultimate system is organized in tiny information elements, there is a very interesting middle ground that can be reached as a satisfactory first step. The idea is to organize your repository in 'information modules'. An information module is a set of information that makes sense on its own. A paragraph is too small to be an information module as it does not make sense outside a larger context. Examples of information modules are: specific procedures, equipment descriptions, etc... As shown in Figure 3, these information modules are authored individually and then 'assembled' for various usage.

Documents become virtual structures composed of pointers to the various information modules stored in the repository. Because the information modules are managed individually, they improve the access, routing and searching functionalities of the system. Some information modules will be used in multiple documents, some other might be created for specific uses. For example, a cover page information module might be created for a printed document while a graphical navigation aid information module might be created for an on-line document.

By starting to break up some of your document types in information modules you are preparing your system for a more object oriented approach. When you convert the information modules into SGML, each of the elements will be accessible. The information modules will remain the most frequent 'anchor' points for searching and assembling documents in the ultimate system. In the mean time, this approach permits you to manage your information in smaller pieces than documents, allowing you to implement more efficient access and routing mechanism.

Repository

Information

Modules

Figure 3 - Transitional system architecture [Figure not available]

Combining the idea of 'assembling' documents out of information modules with the suggestions of content identification, you start to obtain value from the reusability, search ability and automatic management benefits of the SGML approach. It is important to remember that this is a transitional system, not a substitute for the ultimate SGML system. By eventually converting your documents and system to SGML (once they are ready!), you will achieve even bigger benefits!

Implementation plan

Although the implementation details will depend on each case, the following steps will traditionally apply:

Install 'real' document repository system. Controlling your documents is essential to the implementation of the transitional system. A document repository system will help you manage, classify and maintain all the documents you are required to control. Pick a system that will be capable to manage SGML and non-SGML objects as you want to be able to upgrade the program when you are ready to manage SGML objects at an element level.
Catalog the documents in the repository to identify the various document types. Identify the minimum searchable information for each document and enter it in the document object description.
Implement document viewing. Now that the documents are stored in a secured repository, you won't be able to open them with an authoring tool to view them. You want to have a more reliable and efficient way to access the information stored in the repository. A document viewer will also give greater flexibility in searching the document content as they will usually provide hypertext navigation and complex word searches.
Increase your document value. This is where you start implementing content identification and object-oriented management. Based on your environment, you can choose to apply them differently. Many companies prefer to do it one document type at a time. Some others will implement information modules across all document types and implement content identification to specific information modules. Regardless of their implementation sequence the following tasks are usually required:
Build/modify style sheet. This combines a document design and document analysis effort. Content oriented styles have to be built for all the information found in a particular document type. This can also reveal the various information modules contained in a document type. Stylesheet usage and document structuring rules should be documented.
Modify authoring tool interface. This is to help enforce the style sheet by removing from the interface style overriding functions.
Implement batch structure checking function. This function will be used to enforce the style sheet utilization and document structuring rules. It will also be used to 'clean-up' the existing documents by pointing out the inconsistencies.
Break documents in information modules. Recreate the original documents by building virtual documents using the various information modules created.
Write Document Type Definitions (DTDs). For the document types that have reached a satisfactory level of consistency and content identification, it is time to convert them to SGML. A DTD is absolutely required to move to an SGML environment. This could also be done in case you have SGML delivery requirements or you start authoring a new type of document directly in SGML.
Implement SGML authoring environment. For the document types for which you have developed DTDs, you need to have a reliable SGML authoring and revision environment.
Convert documents to SGML. For the document types that have a DTD and are ready to be moved to SGML a simple conversion process can be implemented to convert from the proprietary coding system to SGML.

Moral of the approach

This document describes a sensible migration to SGML. It shows a way to start implementing the SGML approach by working on your current documents not on your current system. Once your document value is increased, the system can evolve painlessly to take advantage of the new information available. The benefits will increase along the way, and eventually you will be able to operate the ultimate SGML system to fully leverage your information investment.