[Mirrored from: http://www.3-cities.com/~conrad/delivery.htm]

Tools for Implementing SGML-Based Information Systems: Viewers and Browsers, Text Retrieval Engines, and CD-ROMs

By	Kurt Conrad The Sagebrush Group

Abstract

This paper/presentation is intended to be a general introduction to the issues and concepts involved in the selection of software tools for the electronic delivery and retrieval of SGML (Standard Generalized Markup Language) documents. In addition, some of the issues unique to CD-ROM publishing will be explored.

Abstract
Introduction
Basic Concepts
Viewers and Browsers
Text Retrieval Engines
CD-ROM Publishing
Conclusion
Stemma

Introduction

The electronic document delivery market is one of the most complex and volatile ones in the entire computing industry. An incredibly vast array of products and fundamentally different technology approaches dominate this industry segment. As computers are beginning to be viewed as tools for producing more than just paper documents, software vendors are almost tripping over themselves to integrate electronic delivery capabilities into their product offerings. For some, SGML is just one of the available options. For others, SGML is a fundamental enabling technology.

This paper provides those who are new to SGML with an introduction to some of the core concepts that can influence the selection of electronic delivery and retrieval software. It describes the capabilities of a wide range of tools by focusing more on product categories than specific vendors. This is because 1) the market place is changing rapidly, 2) almost any viewer or browser can be used to deliver SGML documents, and 3) time and space do not allow for a complete survey of specific products.

This paper describes many of the things that I keep in mind when I'm looking at SGML browsing technologies. It starts by introducing a number of basic concepts that relate to electronic delivery and describing some of the major implementation choices. This is followed by a series of sections that describe categories of tools: a schema for classifying viewers and browsers, an overview of some of the query methods which are used in text retrieval systems, and an examination of a few of the issues and trends which are unique to CD-ROM publishing.

Basic Concepts

A number of economic, organizational, and technical factors influence the design choices and investment decisions that are necessary to implement SGML. Many of these issues have direct bearing on which electronic delivery technologies should be deployed and the returns on investment which can be expected. Some of the more important issues include: the ways that SGML changes cost-benefit profiles within the document lifecycle; how divergent stakeholder interests and metadata requirements can impact electronic delivery; and how technology trends are blurring the distinction between information producers and consumers.

Changes to Cost-Benefit Profiles

There are many different ways to view a document lifecycle. When dealing electronic delivery technologies, I prefer this one:

Research	The acquisition of information, including the interpretation of information contained within documents
Authoring	The creation of new documents
Editing	The revision of documents to make them conform to various structural and content standards
Formatting	The revision of documents to make them conform to various appearance or encoding standards
Publishing	The transformation of documents to a specific published form (e.g., paper or CD-ROM)
Delivery	The distribution of documents
Storage	The holding of documents
Retrieval	The locating and accessing of documents
Viewing	The reading of documents

Unlike other possible views of the document lifecycle, this one helps to differentiate the steps that involve the mechanical processing of data from those that focus on the way that humans interact with the information contained within documents. This distinction is important because of the fundamental economics of the transition to an SGML-based document management process: the use of SGML generally shifts the cost burden upstream and shifts the realization of benefits downstream.

Up front costs are increased in a variety of ways. Document analysis, Document Type Definition (DTD) development, new tools and training requirements, and conversions of legacy data are significant expenses. The imposition of new quality control requirements often increases costs during the authoring and editing phases. If authors and editors continue to work with unstructured tools, additional conversion costs are added during the formatting phase of the lifecycle.

In return, SGML gives individuals and organizations better ways to publish, deliver, store, retrieve, view, and interact with their documents. Some of these potential benefits are concerned primarily with mechanical efficiency, others with human interaction and performance. The choices that an organization or project team makes when balancing these potentially competing measures of value have tremendous impact on how (and even whether) potential and intended benefits are fully realized.

Stakeholder Interests and Metadata Requirements

Metadata (data about data) is at the core of these choices. Information, by itself, is not terribly valuable anymore. There is simply too much of it. Metadata, by contrast, is increasing in importance because it provides the "handles" needed by computers to determine how to process the data and the "hooks" needed by humans to help identify which pieces of information are relevant to their interests.

What is metadata? The SGML tags within a document instance are metadata. They describe the role of each element within the context of the document's structure. Attributes are metadata, as they further describe important characteristics of the data within the SGML instance. Titles, authors, publication dates, and index numbers are metadata, as are annotations, bookmarks, and other navigational aides.

TV Guide is one of the best examples of metadata and its increasing importance. With the exception the horoscopes and advertisements, TV Guide is almost entirely metadata, and not so long ago, Wired magazine reported that TV Guide makes more money than the four major networks, combined.

When SGML is used to develop a vendor and processing-neutral markup language, the resulting DTD is a formalized framework for capturing and storing metadata. As such, this metadata framework represents a negotiated balance between the divergent stakeholder interests that exist at different points in the information lifecycle.

It is not uncommon, for example for authors and editors to desire a simple markup language that is easy to use. In some cases, however, the interests of authors and editors may diverge, with authors desiring greater flexibility and editors wanting a greater emphasis on rigorous structures and automated validation.

A similar divergence can exist among those stakeholders who are primarily concerned with the mechanical efficiency of the document lifecycle (the formatting, publishing, delivery, and storage phases) and those stakeholders whose interests are targeted at the retrieval, viewing, and research phases.

Stakeholders, such as publishers, that are mostly concerned with mechanical efficiency will usually express their interests in terms of cost savings and by using phrases like "create once, publish many." They will often focus on the structural aspects of the document, as these will usually support the range of publishing variations that are expected. In addition, structure-oriented DTDs will usually support a wide variety of document instances and require little maintenance (thus yielding additional cost savings).

Information consumers, on the other hand, usually desire richer, more complex sets of metadata. Instead of being satisfied with a DTD that reflects the generic structures of the document (e.g., chapter and title), tags that capture the meaning of the data (e.g, purpose, scope, rationale, part number, voltage, person, software package, company) are preferred.

Rich metadata allows documents to better function as databases and can have important benefits when using retrieval tools that support context-sensitive searches. By making retrieval easier and more cost effective, this human-centered approach to SGML can enhance the way that people interact with documents to enrich collaboration, learning, decision making, innovation, and the acquisition and development of knowledge.

These benefits cannot be measured easily in strictly financial terms, and since such an approach is usually more expensive than the development and use of structure-oriented DTDs, many organizations find it difficult to justify the additional cost. At the same time, these organic measures of value can be central to the SGML implementation effort and a major source of strategic value. As the information density of business transactions continues to increase, organizations that deliver richer, more useful information products to their customers are likely to realize competitive advantages.

Blurring the Distinction Between Producers and Consumers

In traditional paper-based publishing, the various steps in the document lifecycle were finite and discrete, and each phase produced a paper artifact which required human involvement. These fundamental dynamics are so firmly entrenched that even where computers are used, human labor is often required to integrate and interpret individual pieces of information throughout the document lifecycle. Although vast amounts of paper have been replaced with electronic deliverables, different proprietary encodings often act as barriers to exchange and reuse.

SGML-based document management approaches, on the other hand, have been proven to reduce the need for humans to perform mechanical transformation of data and allow them to focus on more creative, knowledge-intensive activities. Because of this, traditional divisions of labor are being de-emphasized and the distinction between information producers and consumers is being blurred.

The very term browsing is an example of this trend. As will be seen in the next section, high-end SGML browsers integrate a wide variety of viewing, retrieval, navigation, and data collection tools. These help to close the gap between viewing and authoring and make the document lifecycle truly a cycle. While none of the tools have the same powerful authoring capabilities found in dedicated editors, the functional trends are fairly clear.

Viewers and Browsers

A wide variety of tools can be used for displaying SGML data. Generally, they fall into three classes: Readers, Viewers, and Browsers. Readers are used to display the contents of files without any interpretation or rendering. Viewers add interpretation and rendering capabilities but base most of their rendering on formatting codes (metadata) which were designed to support the printing of paper hard copy. Browsers abandon the page metaphor to provide an electronic delivery environment that is more in tune with the capabilities and constraints of computer displays. In addition, they are generally more powerful and better able to exploit the information content of an SGML-encoded document to offer improved navigation and retrieval.

This paper uses the following schema to distinguish individual categories of Readers, Viewers, and Browsers:

Text Readers
Native File Viewers
Raster Viewers
Page Viewers
Binary Browsers
Fixed DTD Browsers
Arbitrary DTD Browsers

These categories of tools are differentiated primarily by the way the information is encoded for delivery. This delivery encoding is closely related to the richness of the metadata that the software can make use of and this relationship has important implications on the document lifecycle. It is not uncommon for SGML DTDs to be designed around the strengths or weaknesses of a particular viewer or browser. Because of this, the metadata that the delivery tool supports can not only limit the options for user interaction and the potential returns on investment, but even the long-term value of SGML documents.

Please note, where specific products are mentioned, they are only used as examples and do not comprise an exhaustive listing of the products in each category. In addition, most vendors are aggressively improving their products and are working to incorporate more robust support for SGML. It is possible that I may have missed an important product announcement or even that some of the vendors mentioned in this document may announce new product offerings at SGML'95 and change their placement in this schema.

Text Readers

Text Readers simply display the contents of the file. They give you a WYSIWOD (What You See Is What's On Disk) view of your data. If the file only contains text, it usually looks pretty good. If it contains binary data, the non-ASCII characters are displayed in place and the file can be hard to read. In the vast majority of cases, when SGML data is displayed using a text reader, the tags are displayed as ASCII character streams. Most people don't like Text Readers because of their inability to provide a richly formatted visual representation of the document.

Although I don't know of many SGML implementations that use Text Readers as a primary delivery tool, it is not out of the question. A fairly simple filter could be used to convert SGML into an untagged ASCII representation, using carriage returns, line feeds, spaces, tabs and perhaps even punctuation for visual formatting. You would probably end up with something that looked an awful lot like UNIX Man pages: text files that contain fairly consistent formatting and an implicit structure. Vernon Beurg's List program is my favorite tool in this product category.

Native File Viewers

This class of viewing software is used to view word processing and desktop publishing files in their native format. cc:Mail, for example, uses Native File Viewers (Outside In and Quickview Plus) to display attached files. Microsoft Windows 95 includes a utility called Quickview, which can also be used to view a variety of native file formats. In some cases, Native File Viewers do not exist as separate products but are only available as functions within other software products. A few companies, like Mastersoft, do sell Native File Viewers in both the OEM and consumer markets.

The ability to view native word processing and desktop publishing files means that there is virtually no publishing process, to speak of. This is another class of viewer that isn't used very often to deliver SGML data, but if it were, the publishing process would involve conversion of the SGML data into some proprietary editing format.

Generally, the quality of the rendering is fairly limited. In some cases, interpretation of the proprietary formatting codes is imperfect and doesn't match the formatting of the native editing environment. In addition, support for embedded graphics tends to be a problem. For most implementations, these aren't major concerns, as the primary goal is to provide low-cost access to legacy documents. For example, many OpenText text retrieval implementations use the Mastersoft Word for Word software to support indexing and viewing of a wide variety of native file formats.

Raster Viewers

Raster Viewers are designed to display bitmapped images (usually TIFF and CCITT Group 4). This allows them to provide a good representation of the printed page, preserving its layout, typography, illustrations, and other visual elements. It is not unusual to find Native File Viewers and Raster Viewers combined into the same product (e.g., AutoVue Professional).

Production costs are fairly limited, usually involving the simple scanning of paper documents. Raster Viewers are very popular in the insurance industry, where imaging systems provide a fairly low-cost alternative to the routing of paper. Raster Viewers are also often used in conjunction with more robust SGML delivery tools to render and display the graphics images which may be referenced in an SGML document instance.

Raster Viewers aren't a very attractive alternative for displaying textual data, however. To a computer, paper is dead and scanned pages aren't a whole lot better. Because raster images are just a collection of dots, they are not very useful for searching or retrieving. As a result, some hybrid systems use combined image/text approach, where OCR is used to convert the scanned images to text files, but the record copy of the data is still the bitmap.

Not many implementations use Raster Viewers to deliver SGML data and those that do are probably based on the scanning of paper documents. Raster Viewers can provide some degree of interactivity, allowing highlighting and annotations to the page images. Normally, these notations are stored in a separate file and displayed as overlays at the time of presentation.

Page Viewers

Adobe Acrobat, WordPerfect Envoy, and No Hands Common Ground are examples of the products that fit into this category. All of these products use proprietary file formats that store page images. In most cases, these files are produced, not by scanning paper documents, but by printing through a special filter or series of filters. This makes the publishing process fairly easy. It also provides a better visual rendering than is normally seen with Native File Viewers because the native applications print engine is involved in the rendering.

They have many important advantages over the viewing of raster images. The biggest single advantage of image viewers is that they capture textual data in a searchable form (not just as a series of dots). Because the file formats are proprietary, however, the range of tools that can be used for search and retrieval can be rather limited and, in some cases, be constrained to the vendor's own product offerings. Another advantage over raster images is that the Page Viewers usually have better support for color.

Besides the ability to support highlighting and annotations, some Page Viewers also provide mechanisms for embedding hyperlinks in the deliverables. These are normally used for such things as linking Table of Contents entries to their locations in the document, lists of figures, and linking key terms to their glossary entries.

Because Page Viewers use proprietary file formats, however, attention should be paid to where labor costs are incurred during the production process. Labor which is expended after conversion (i.e., printing) to the vendors proprietary file format is usually lost when publishing the next version of the document. This is especially true when dealing with hyperlinks. In most cases, hyperlinks are inserted manually using the vendor's publishing tools and must be re-entered when a revised document is imported.

Page Viewers are the first category of software in this schema that are given serious consideration as vehicles for delivering SGML data and as serious alternatives to SGML browsers. Many individuals consider Page Viewers pretty but dumb, however, especially where the publishing of SGML data is concerned. The publishing process is simple, but a lot of important metadata is stripped out and lost. At one time, Adobe claimed that a future version of Acrobat would support bidirectional conversions from and to SGML. This has not happened yet.

Binary Browsers

Binary Browsers also use proprietary, binary file formats (like Page Viewers), but they aren't tied to a page image. While the majority of electronic delivery products seemed to fit this category a couple of years ago, more and more of them seem to be shifting to the Fixed DTD and Arbitrary DTD categories and becoming more "SGML-like." At this time, products like Folio VIEWS, Lotus SmarText, HyperWriter, and the Microsoft Help browser still appear to fit in this category.

The products in this category exhibit a wide range of functionality. Although most of them were originally developed to work with word processing files, they can be used to deliver SGML data. To do this, the SGML data stream must be converted into the non-SGML binary files used by the browser. While filters can be used to perform some of the conversion, most of these tools are more like authoring environments than publishing environments.

This authoring process usually requires significant interaction with a vendor-supplied styles editor to design screens, format documents, and add navigation aides and hypermedia links. As is the case with the Page Viewers described above, labor which is expended after the conversion from SGML will probably be lost if a revised document is imported into the publishing environment. Accordingly, vendors are beginning to introduce richer sets of importation filters which allow SGML document structures to be mapped directly to the capabilities of the delivery tool and reduce the level of interactive authoring.

Fixed DTD Browsers

A Fixed DTD Browser is a tool that uses SGML as part of the product architecture but only works with a small number of vendor-selected DTDs. Oracle Book, InfoAccess Guide, and Day and Zimmerman's Interactive Presentation Manager (DZIS-IPM) are examples of products in this category. NCSA Mosaic and Netscape also fit within this category, in that they operate against a finite set of "HTML DTDs" (Ignoring, for the moment, whether the software actually uses a DTD or a just a formalized tag set that could be represented as a DTD).

NCSA Mosaic is relatively unique as a Fixed DTD Browser, however, because the version of HTML supported by Mosaic is not proprietary to NCSA. Instead, most of the tools in this category use proprietary DTDs. This is the case with Netscape, as it uses a proprietary version of HTML that includes the popular Netscape extensions. While a proprietary markup language blurs the distinction between Binary Browsers and Fixed DTD Browsers, some important differences remain.

First, when used with SGML documents, these tools support a publishing process which can be best described as mapping, where the elements in the source DTD are mapped to elements (and thus indirectly to functions) in the delivery environment. Some Fixed DTD Browsers use a markup language which is very structurally oriented. Some use a markup language which is more visually oriented. When evaluating products in this category, it is important to determine whether the target markup language is a good fit for the source data that you will be producing and the ways that you wish to render it.

Second, there are more options for automating SGML to SGML transformations than for converting SGML to a proprietary binary format. Not only do many of the vendors in this category provide customized filters, but other popular filter tools can also be used (e.g., OmniMark and Perl).

Third, the stronger SGML bias shows in a better separation between content and styles. This allows sets of formatting commands to be designed once for a given DTD and used for both multiple document instances and multiple versions of the same document. While rules-based formatting may be possible with the Binary Browsers, the products in this category tend to have more mature and robust support.

Although most of the tools in the previous categories provide ways to capture input from readers (e.g., bookmarks and annotations), many of the products in this category extend those capabilities to provide low-level data collection and authoring functions that are designed to be integrated back into the document lifecycle.

Most of the HTML browsers, for example, include forms capabilities and some even integrate email functions. Interactive Electronic Technical Manual (IETM) browsers, like DZIS-IPM, are designed to not only display technical procedures, but to collect operator input and data from diagnostic systems and use that information to customize the logical flow of the procedure's steps. IETMs can also be used to route operator feedback in real-time to other information management and reporting systems.

Arbitrary DTD Browsers

These browsers are designed from the ground up to render SGML data and are truest to the philosophy of SGML. By accepting arbitrary DTDs, these products do not require that a document instance be restructured, converted, or mapped into a vendor-specified tag set.

These tools not only retain all the metadata in the SGML document instance, but they also maintain clear separation among a document's structure, content, and visual rendering. Electronic Book Technologies DynaText and SoftQuad Panorama are examples of the products in this category.

The publishing process is focused around the DTD that was used to structure the incoming document instance. Styles are defined for each element type in the DTD and stored a separate file, which is normally called a style sheet. Multiple style sheets can be defined for the same DTD.

One of the primary functions of the browser is to merge data and styles at the time of rendering (often based on the decisions of the reader). This preserves flexibility in a way that is normally not possible with other viewers or browsers. It is also common for multiple style sheets to be used at the same time (e.g, DynaText normally uses one style sheet for the Table of Contents and another for the full text of the document). Some of these browsers even allow the reader to define their own custom styles.

Style sheets and other browser-specific sets of metadata are usually stored as SGML files and can be created or revised without using the vendor-provided editing tools. At this time, the DTDs used to structure style sheet files are proprietary, but in time, they are almost certain to become compliant with DSSSL (the Document Style Semantics and Specification Language) or a well-accepted dialect of DSSSL (e.g., DSSSL-Lite).

While many, if not most, electronic delivery tools support hyperlinks so that Tables of Contents and other navigational aides to be built or coded, the tools in this category usually have the most sophisticated methods for handling such navigation. In Panorama, for example, Tables of Contents are supported through a feature called navigators. Multiple element types can be flagged for inclusion in a navigator, thus creating a hierarchical Table of Contents which can be expanded and collapsed. Multiple navigators can be applied to the same document to speed access to different sets of elements (e.g., figures, tables, code fragments, etc.).

Some of the most exciting features in this category are those that rely on HyTime (or similar addressing approaches) and can't easily be described in an introductory overview. Generally, though, by allowing a variety of user interactions to be captured as SGML data streams, these tools allow important sets of metadata to be collected, not during authoring, but during browsing. This allows subject matter experts to easily codify their understanding of the relationships inside complex documents in a form that can be readily published to other users of the system. Such consumer-defined metadata collections have the potential to function as alternatives to complex, content-oriented DTDs.

Text Retrieval Engines

At its simplest, text retrieval involves searching and string matching. It's a rare electronic delivery tool that doesn't support simple text searches within a single, onscreen document. For most applications, however, this is far from adequate. To be useful, text searching must be done against a body of documents. Full text indexing and retrieval systems address these needs.

When discussing text retrieval, the terms precision and recall are fairly important. Precision refers to the ability to retrieve only what is desired, and not a lot of extraneous (noisy) data. Recall is the ability to retrieve everything that is of interest. Ideally, a query returns everything that you are looking for (recall) and nothing that you aren't (precision). Query results are never ideal.

Mechanically, full text retrieval is almost always a two-step process. The first step involves an indexing function. Although vendors use different indexing approaches, this step usually occurs somewhere within the publishing phase of the document lifecycle. An exception to this is some of the indexing which is being done on the World Wide Web, where software tools are indexing documents after they have been published on the Internet.

The second step is the specification and processing of a user-defined query. The way that queries can be constructed and processed is the greatest single differentiating factor among retrieval tools. The more common query approaches are boolean searches, weighted thesauruses, vector searches, and context-sensitive searches.

Boolean Searches

The simplest query model is the boolean search. In addition to searching for a specific string, systems that support this approach (virtually all systems) allow multiple strings to be searched for. AND, OR, and NOT operators can be combined with the specified strings to influence precision and recall (e.g, get me the documents that have "SGML" AND "pasta" in them).

Weighted Thesauruses

Verity, with their Topic product, has done most of the pioneering work with weighted thesauruses. This approach was developed originally for the CIA to help process large amounts of incoming data to determine which information deserved further attention.

It works something like this. Imagine being interested in information about outer space. A fairly large vocabulary of words could be used to identify documents about outer space: space, rockets, Boeing, moon, NASA, stars, shuttle, Hubble, etc. Some of these words are likely to be strong indicators of relevance (e.g., NASA) and others are less likely (e.g., movie "stars"). The weighted thesaurus allows a hierarchy of terms to be constructed, where each node and branch can be given a number to indicate its probable relevancy.

When queries are formed by referencing these key terms, a complex set of string matching and statistical calculations are used to rank target documents for relevancy. When the thesauruses are well-designed and maintained, this is a very effective retrieval approach. It may not be as good a choice for performing predominantly ad-hoc queries, though, where the cost of crafting well-designed vocabularies of search terms is hard to justify and much of the power of the query tool remains unused.

Vector Searches

From what I can tell, Gerard Salton, of Cornell University developed this method, and only a few retrieval engines support it at this time.

Imagine taking every article in Byte magazine and counting the number of times that the words "hardware" and "software" appear in each article. Plot each article on a grid with the number of occurrences of "hardware" on the vertical axis and the number of occurrences of "software" on the horizontal axis. Next, take an article that is a good example of what you are looking for and plot its location on the grid. Vector math can then be used to find the nearest article, which will have a similar combination of "hardware" and "software."

Admittedly, the above example is simplistic and a bit stupid. This approach becomes much more useful, however, when the index contains thousands of keywords that have been carefully chosen. Some researchers are using vector searches to replace hard-coded hyperlinks and integrating these search engines with graphical displays, where dot clustering and color changes are used to communicate proximity.

Context-Sensitive Searches

The preceding search methods can be used with either structured or unstructured data and can be found in a wide variety of products. Context-sensitive searches, on the other hand, require structured data and are usually only found in those products that have a solid SGML foundation. OpenText, DynaText, and Explorer are some of the products that support this searching method.

Context-sensitive searches are performed by specifying both the text which is to be found and the element that it is to be found in. This approach can significantly improve precision. By searching for words only in document titles, for example, the absolute number of hits will be lower, and the precision will generally be higher.

The desire to perform context-sensitive searches can often have a significant effect on tool selection and markup strategies. Page Viewers, Binary Browsers, and Fixed DTD Browsers can have trouble supporting context-sensitive searches because the original SGML markup is likely to have been stripped out or converted to another metadata framework.

DTDs that are designed to enhance the effectiveness of context-sensitive searches are likely to have more elements than those designed mostly to support visual formatting. Being able to search for part number elements that contain "1978" is very different from just searching for "1978" or even "part number 1978." It is possible for DTDs to become incredibly large and complex as information consumers strive to have every possible context be formalized in the SGML DTD to support potential search strategies.

CD-ROM Publishing

Most of what has been described in this paper applies directly to CD-ROM publishing. Virtually all of the tools that have been described can be used for CD-ROMs, just as they can be used with data that resides on local hard disks, networked fileservers, and other media. Licensing arrangements may vary from vendor to vendor, however, making some tools more or less attractive than others.

One of the few issues that are relatively unique to CD-ROM publishing involves how the documents are encoded on the disk. While some CD-ROMs, like the SGML World Tour, contain native SGML files, intellectual property interests may dictate that the documents be stored in a binary representation that cannot be converted back to revisable text.

An emerging trend in CD-ROM publishing is to integrate CD-ROMs with the World Wide Web and other online services. CompuServ and Encyclopedia Britannica appear to be some of the first to be doing this. I can envision two popular approaches: 1) using the CD-ROM to distribute pieces of information that are heavily used or fairly static (like graphics), thereby cutting bandwidth requirements, reducing cost, and improving performance; and 2) to keep the data on CD-ROMs more current by integrating fragments of dynamic data that are accessed online.

Most of the early efforts at integrating CD-ROM and online publishing appear to be based on custom software. I am only aware of one commercial tool that has been designed to support CD-ROM/World Wide Web integration, Electronic Book Technologies Matterhorn product, which is currently in beta. Matterhorn will allow URLs to be imbedded on the CD and be used to retrieve Web pages when an appropriate hyperlink or icon has been activated.

Conclusion

Document delivery and retrieval play important, sometimes critical roles in the document lifecycle. Because SGML tends to shift costs upstream and benefits downstream, the selection of electronic delivery and retrieval tools can dramatically influence 1) the cost-benefit ratios for the entire SGML project, and 2) where in the lifecycle benefits are realized.

SGML DTDs represent a negotiated balance among the divergent stakeholder interests that exist at different points in the document lifecycle. The way that the interests of authors (simplicity), publishers (structure), and consumers (richness) are balanced will drive many, if not most, of the major DTD design choices. These metadata choices will, in turn, influence the appropriateness of individual display and retrieval tools.

While a very broad range of readers, viewers, browsers, and text indexing tools can be used to deliver SGML documents, browsers and retrieval engines that are built to support SGML generally offer superior performance. As a rule, they better preserve the metadata richness of the original document instance, provide more flexibility during display, and support context-sensitive searching methods that can enhance the precision of query results.

The shift from paper-based to SGML-based document lifecycles is blurring the distinction between information producers and consumers. Many of the high-end delivery tools include capabilities for capturing valuable information during the browsing session. These changes have the potential of shortening cycle times, speeding individual learning, improving collaboration, and enhancing organizational adaption.

Stemma

This document is based on a paper which was presented at SGML'95, December 4-7, 1995 and published in the conference proceedings