SGML Document Management

[Mirrored from: http://www.passage.com/pubs/white/eliot/sgmldocm/sgmldocm.htm [June 24, 1997], and originally from "SGML Document Management" (W. Eliot Kimber)]

Author: W. Eliot Kimber

Copyright © 1995, Passage Systems, Inc.

If you are using Panorama or PanoramaPRO , you can view the SGML version of this paper .

Table of Contents


As SGML moves into the main stream as a preferred information representation method, enterprises are faced with the problem of managing their SGML data. The management of documents, especially large documents or documents critical to an enterprise's functioning, has always presented a significant challenge. Most document management schemes have depended either on the use of a proprietary document format or treat documents as undifferentiated blobs.

Attempts to apply information management techniques borrowed from relational databases and program source code management have largely failed. Relational databases are inappropriate because they are intended for data that breaks down well into small, discrete units that organize into tables, which documents largely do not do. Program code management systems fail because documents are generally not record-oriented, complicating the problems of change tracking and management that largely depend on the record-oriented nature of most programing languages. In addition, the wide variety of proprietary and non-standard data representations for documents makes it difficult to design a general-purpose document management system. All of these problems are compounded by the difficulty of sharing documents among disparate operating systems with different ways of representing storage objects (files) and references to them.

SGML holds the potential to solve some of these problems. SGML has the following properties that make its management much more tractable:

These properties of SGML make it possible to create SGML-based document management systems that can manage any kind of data while presenting a simple and consistent view of the data to any application or processing system that wants to get at it.

This paper describes in general terms how these properties of SGML determine the design of generalized SGML document management systems. It goes on to show how generalized SGML document management systems can be used to manage any kind of data, including proprietary word processing and desktop publishing formats.

Documents and Storage Abstraction in SGML

Most SGML applications focus on SGML's power to describe the logical structure of documents. It is through SGML's facilities for describing the logical structure of documents that high-function retrieval and presentation of complex documents is enabled. Online presentation tools like Panorama and DynaText and information retrieval tools like OpenText's PAT system derive their power in large part from the structured SGML documents they process. However, the logical structure of a document is only one view of the data.

Documents also have a storage structure, which determines how the document is broken into pieces for the purposes of storage and access management, separate from any logical structure the document may have. An SGML document may consist of a single storage object or thousands. An SGML document may include both SGML data and non-SGML data (such as graphics and multimedia objects). A fundamental principle of SGML is that the storage structure of a document need have no direct relationship to its logical structure.

Toward this end, SGML defines an abstract storage mechanism for SGML documents called entities.

Another central tenet of SGML is that the only form of an SGML document you can trust is the character data stream that is parsed by an SGML parser. This is because it is only the SGML data stream that can be parsed and validated according to the requirements of ISO 8879. All other representations of an SGML document are, by definition, non-standard, and therefore suspect. This paranoid view of SGML data means that a true SGML system must always make the SGML data stream available to processing applications, regardless of whatever other forms of the same document it may maintain for optimization purposes. It also means that an SGML storage system must be able to store SGML data without changing the character string that represents the document when it stores it. In other words, to be a true SGML storage system, it must be possible to store an SGML document in the system then do a byte-level compare of the stored version and the original with a difference of zero.

SGML document management systems may offer storage options that require changing the original character string, but these must be options, not simply the cost of using the system.

These core principles of SGML suggest that any storage system that "decomposes" SGML for storage purposes into some form other than into its tree of entities is either not a true SGML system or will be highly inefficient for certain tasks as the overhead of maintaining complete knowledge of the original character stream along with the decomposed form is probably greater than using some other indexing scheme on the character data itself. An example of the former is a system like BasisPlus, which essentially loads an SGML document into a hierarchical database of elements, losing all knowledge of the original entities. By contrast, the PAT retrieval system from OpenText simply reads an SGML document and indexes the boundaries of all relevant constructs, independent of how the data is actually stored. The PAT retrieval index does not interfere with storage access of the SGML document, yet retrieval of any specific portion of the SGML document is highly efficient. In the PAT case, because the SGML data has not been modified by PAT, any other SGML application can also access the document using normal SGML access means, whereas in the BasisPlus case, processing applications can only access the document through the BasisPlus server and will only see that part of the original SGML data stream that BasisPlus maintains.

Finally, SGML's paranoid view of SGML says that because SGML data is simply character data in the character set of the system its stored on you should always be able to access it with whatever the minimal character data access methods are on the system, e.g., whatever the system editor is for your operating system. This suggests that any SGML document management system that stores the SGML data somewhere other than the file system itself should provide some form of API or device driver that provides access to the SGML data as though it were on the file system. However, this is not a hard and fast requirement of the SGML standard.

SGML's Abstract Storage Objects: Entities

In SGML, the storage objects a document is organized into are called entities and an SGML document consists of one or more entities. Each entity represents a single storage object, typically (but not necessarily) represented by a file in the local file system. Entities are either SGML entities , which contain data that is to be directly processed by an SGML parser, or data entities, which contain data that is not to be processed by an SGML parser, but is still part of an SGML document. Typical data entities are graphics or other audio-visual objects included in documents. Entities form a tree, rooted at the highest-level entity (the document entity).

Because SGML is a system and application-independent data representation, it must provide a way to avoid tying entities to specific operating systems or data management systems or databases. SGML does this by making entities abstract storage objects that are mapped onto real storage objects by an "entity manager". This indirect method of defining the storage location of entities separates the storage view seen by processing applications from the details of how the data is actually stored, making it easier to move SGML documents from one storage system to another (such as from one operating system to another or from one document management system to another).

This indirect mapping of entities to real storage objects also makes it possible for different storage systems to use different storage and retrieval optimizations without the need to rework the abstract storage model. For example, an interactive document management system might use a storage optimization appropriate for high-volume changes, while a CD-ROM-based retrieval system might use a very different optimization scheme tuned for retrieving from CD-ROM. In both cases, the abstract storage structure is the same for a given document. Document users and processing systems are quite possibly unaware that different storage optimizations are being used.

Storage Object Names: Public and System Identifiers

To further separate SGML entities from specific storage systems, SGML provides a system-independent naming system for entities called public identifiers. Public identifiers are intended to be the "public" name of entities. On a given a system, public identifiers are mapped to system-specific locations. System-specific locations could be filenames in a file system or object IDs in some document management system. A single entity could be represented by multiple storage objects or it could be a portion of a larger storage object. The mapping of entities, which are an abstract storage representation, to actual storage is entirely determined by the system that manages the storage of the data.

Within an SGML document, entities are declared at the beginning of the document (so that processing systems need not process the entire document in order to know what storage objects it is composed of). An entity declaration associates a local, symbolic name (the entity name) with either a system-specific location, a public identifier, or both. A typical entity declaration looks like this:

<!ENTITY Chapter1 SYSTEM "chap1.sgm"

The parts of the declaration are:

An SGML-defined keyword indicating that this is an entity declaration
The entity name. The entity name is the name used within the document to refer to the entity. It is completely arbitrary and has no meaning outside the scope of the document in which the entity is declared.
An SGML--defined keyword indicating that what follows is a system-specific storage location, in this case, a filename.
The system-specific storage location for this entity. In most processing systems, this value is a filename, but the SGML standard imposes no restrictions on what the system identifier can refer to. For example, it could refer to a member of Bento container, an object ID in an object-oriented database, a URL for a network-accessible object, or something else entirely.

An SGML entity can also be declared with a public identifier in place of a system identifier, like so:

Chapter One//EN" >

Here the system identifier has been replaced by a public identifier, as indicated by the PUBLIC keyword. The quoted string following it is the public identifier (in this case a formal public identifier). When an entity is declared in this way, it is up to the processing system to map the public identifier to a system identifier. The SGML Open consortium has defined a simple public identifier mapping scheme that a number of tools have implemented.

Through public identifier maps, documents can be moved from one system to another without the need to change the declarations of their component entities. Different systems or applications may have different mappings for the same public identifier.

The use of public identifiers is crucial to the task of managing SGML documents because it provides a way to refer to entities without needing to know their system-specific locations. This means that different document management systems can manage the same SGML entities simply by maintaining their own mapping of public identifiers to system identifiers.

Public identifiers also have the property that they can be made universally unique through the use of formal public identifiers. Formal public identifiers are public identifiers that conform to the syntax of another ISO standard, ISO 9070, which defines the rules for forming public identifiers. Formal public identifiers consist of two main fields: the owner name and the object descriptor. When the owner name is a registered owner name, it is guaranteed unique within the scope of all possible formal public identifiers. Similar schemes can be used within the scope of a given owner name to ensure the uniqueness of names within that scope (the presumption being that name owners are capable of controlling the definition of their own object descriptors). For example, an enterprise might use its existing document control numbers as part of the public identifiers for the documents it produces.

Because formal public identifiers can be made reliably unique, document management systems can use public identifiers as the unique names of the objects they manage, rather than depending on things like object identifiers or other implementation-dependent naming schemes. At a minimum, an entity's public identifier can be used as a synonym for some system-specific unique name, removing the need to expose that name to users or systems that interact with the document management system.

If a document management system can access objects based on their public identifiers, then it can provide a simple and natural access API to any other SGML application by discussing objects in terms of their public identifiers, at least initially. For example, when a system first requests an object from the document management system, it could do so by requesting an entity with a particular public identifier. The document management system finds the entity and either returns it directly or returns a pointer to it, which will be some system and implementation-specific location, such as an object identifier, a memory location, or the location of a file on the local file system.

The Functions of an SGML Document Management System

The functions expected from a document management system vary from group to group, but in general, the following functions are expected:

Because SGML documents are trees of entities, SGML document management systems often provide functions for working with the entire tree of entities as a unit or working with any component entity individually.

You can think of an SGML document as layers of abstractions, each built on the one below. At the lowest level is the SGML data stream itself. The SGML data stream is just a stream of characters that are interpreted by an SGML parser into the next two layers of abstraction: the entity layer, and above that, the element layer. Above the element layer there may be another layer representing hyperlink relationships beyond the hierarchy of the SGML elements.

Each of these layers of abstraction naturally translates into a layer in a document management system, as shown in SYSTEM-LAYERS . The data stream layer is managed by the actual storage system. It is organized into one or more real storage objects.

Figure 1. The layers o f an SGML Document Management System

The entity layer is managed by the entity manager. Starting with the SGML document entity, the entity manager passes the SGML data stream to the parser which interprets it and recognizes entity declarations and references. The entity manager then communicates with the storage system to resolve the entities references into real storage objects, getting their data and providing it to the parser. The parser communicates entity declarations and references to the entity manager as it finds them.

The parser also recognizes the SGML element markup, communicating the element and data constructs to whatever processing application is communicating with it (such as an editor, browser, or retrieval system). The processing application then interprets the elements and their data to build up the relationship layer, which it then uses for its own purposes.

Note that the parser and entity manager can be almost completely independent. The entity manager has to provide a way for the parser to communicate entity declarations and references to it and the parser has to accept the SGML data stream provided by the entity manager. In addition, processing applications may need to communicate with the entity manager to process data entities, which are not parsed by the SGML parser but are left for the processing application to handle.

These layers provide natural places for management APIs. The API between any two layers is relatively simple, but taken in total, the layers create a well generalized, robust, and flexible system. For the purposes of document management, most of the intelligence is in the entity manager layer, as it is the entity manager that has to manage all the data access functions, including mapping public identifiers to storage objects, implementing access control, and the like.

SGML Document Management APIs

The boundaries between the layers shown in Figure 1. System Layers provide natural points for public APIs.

The entity manager communicates with whatever the storage system is, whether it's the file system provided by the operating system or some sort of database, such as an object oriented database. This API will be determined by the storage system itself.

The SGML parser communicates with the entity manager by passing it entity declarations and references. The entity manager returns either the data stream of characters in the SGML entities or a system-specific pointer to non-SGML data entities. There may be other communications between the parser and entity manager, for example to control the flow to data to the parser. Regardless of how this communication is defined and implemented, the parser must get a character data stream from the entity manager and the entity manager must provide a way to map entity references and declarations to system objects.

Processing applications communicate with both the SGML parser and the entity manager. From the parser, applications get a representation of the elements, data content, and other SGML-defined constructs in the SGML document. There is no standard for this representation (nor should there be) as different forms of communication will be appropriate for different applications and environments. In addition, the parser and application may be tightly bound, as in the case of most SGML editors and browsers, or completely separate, as in the case of the SGMLS SGML parser, which simply emits a new file in a format that is easy to process with the Perl programming language.

Processing applications will also need to communicate with the entity manager to access data entities. At a minimum, the processing application should be able to pass a local entity name or public identifier to the entity manager and get back either the data in the entity or a system-specific pointer to the object by which the application can then access the data.

Note that the only API that is at all standardized is the one between the entity manager and the SGML parser and processing application. The communication to the entity manager is SGML entity declarations, public identifiers, or entity names (within the context of the document being processed). The communication from the entity manager to the parser is the SGML data stream, the details of which are defined by the SGML standard. The communication to the processing application is a system-specific pointer to the entity data, which either the entity manager or the underlying storage system can resolve (e.g., filenames, database object IDs, shared memory locations, etc.).

This singular level of standardization emphasizes that an "SGML document management system" is first and foremost an SGML entity manager. Given a generalized entity management system, a wide variety of SGML parsers, processing applications, retrieval systems, and access user interfaces can be built on top of the entity management layer without worrying too much about the entity manager changing.

This layering also emphasizes that other features of document management systems, such as search and retrieval, workflow management, and so on, are all really SGML applications built on top of a generalized entity management layer. If these parts of a total document management system are designed as SGML applications that communicate with a generalized entity manager, then they can be implemented on top of any entity manager with which they can communicate using the more or less standard entity access API.

Managing Non-SGML Documents

Given a generalized SGML entity manager, it is possible to manage non-SGML documents (such as word processor and desktop publisher documents) with the same system, and with the same level of functionality, by using SGML to model the structure of the non-SGML documents. These SGML documents used to model the structure of non-SGML documents are sometimes called shell documents.

For each different document format, you define a simple SGML document type that reflects the possible storage organizations of the document. For example, Frame documents can be single, standalone files or books, which then contain one or more chapter files. Files can then contain graphics by reference. The SGML document contains elements that represent these different storage constructs (book, chapter, graphic) and then refer to the actual Frame files as data entities. The non-SGML documents are declared as data entities within the shell document.

The entity declarations provide all the information the SGML entity manager needs to have in order to manage the non-SGML documents. A simple SGML processing application (possibly part of the document management user interface) then resolves references to the data entities just as it would for any other document type (e.g., to resolve references to graphics or audio-visual objects). Just as you can launch a browser or editor for a graphic, you can launch an editor for a Frame or Wordperfect document. And because data entities must have a declared data content notation, the entity manager always knows what the data type of the object is. Support for new data types merely requires adding a new notation-to-processor mapping--the rest of the functionality is inherent in the entity manager itself.


In short, SGML document management is primarily SGML entity management. Through the various data abstractions defined by the SGML standard, flexible, modular, and robust document management systems can be built that work with a wide variety of systems and tools, fully support the interchange of documents among different systems, and provide the appropriate balance of standardization and optimization to let system designers distinguish their products by providing different sets of features and optimizations for different specific tasks, environments, and user types. The only requirement is that the management system preserve access to the SGML data stream as a data stream and not deprive the owners of SGML documents the right to determine what their data looks like and ensure that any SGML processing system can access the data using the simplest possible API, ideally through standard character file I/O.

In addition to providing a robust storage management system, true SGML document management systems also provide the potential for building very power information management applications by making the powerful abstractions of SGML available in general processing environment. For example, given a generalized SGML entity management system, a variety of powerful retrieval systems, such as DynaText and PAT, can be applied to the data with a minimum of extra work. Once this level of retrieval is available, then applications that depend on efficient retrieval of SGML documents, such as high-function composition systems and sophisticated hypermedia authoring systems can be built quickly and easily on top of the existing layers and APIs.