The Transformation of SGML Documents for Presentation on the World Wide Web

Eric D. Freese
Principal Software Developer
Information Dimensions, Inc.


Introduction

As the popularity of the World Wide Web (WWW) increases, a growing number of organizations are interested in distributing their data over the Internet. Some of this data is marked up using the Standard Generalized Markup Language (SGML). The challenge of distributing SGML over the WWW involves converting structurally marked data into a less structured format for the presentation of the data.

Two major issues that must be addressed when developing a system for the publication of SGML information over the Internet include the partitioning of large SGML documents and the transformation of elements.

Many WWW browsers currently available seem to work best for documents, or document fragments, which are only a few "pages" in length. However, many SGML documents are much longer than this. One possible solution is to break SGML documents into "components" that are managed and retrieved as units. This allows users to retrieve only as large or small a component as they require.

Hypertext presentation on the WWW generally requires that markup within a document be transformed into HyperText Markup Language (HTML) which is an application of SGML. At first glance, this may seem to be trivial task since the conversion is essentially SGML to SGML. However, a document marked up in a highly structural markup scheme may not have elements that map directly to the elements found in HTML. This mapping process can be accomplished by matching the markup in the original document with HTML elements. This process would allow start tags, end tags, attributes and other forms of SGML markup to be used in the transformation to HTML. An extension to the process of component presentation is the dynamic mapping of the elements based on the context of the element within a specific component.

This paper will discuss some of the issues of partitioning large SGML documents and the mapping of elements in a dynamic retrieval and presentation environment. Possible extensions to the proposed HTML 3.0 standard will be discussed which will enable WWW browsers to dynamically present arbitrary SGML-encoded data.

Document Partitioning

SGML, by its nature, provides a very rich environment for text retrieval, authoring, editing, and document management systems to take advantage of the structure inherent to documents. The ability to utilize document structure during retrieval and presentation allows users to search and view only those parts of documents that are most likely to satisfy their query. Furthermore, when the retrieval system can inform the user of the exact location of a hit, the user can make a better decision regarding the relevancy of query results and, consequently, which items to view.

SGML's structural nature allows documents to be broken down into "components". For example, a document may have several chapters, which are broken into sections, which are broken into paragraphs, etc. A database system can then manage the components as separate units, while still maintaining the context in which they originally occurred. In other words, components can be searched, retrieved, updated and viewed as separate units, yet their position in the original document hierarchy is maintained. Several SGML products are using the concept of components to assist in management of the data including authoring systems, database or repository managers, and viewers.

Component Location and Definition

Several factors can be taken into consideration when determining what elements within a document can be treated as components. These typically are based on an element's location within the document or its "context". For example, consider the following simple, generic Document Type Definition (DTD):

<!DOCTYPE doc [

<!ELEMENT doc  - -  (title, chapter+) >
<!ELEMENT chapter  - -  (title, section+) >
<!ELEMENT section  - -  (title, para+) >
<!ELEMENT para  - -  (title?, text) >
<!ELEMENT (title | text)  - -  (#PCDATA) >

]>

In this example, a document is divided into one or more chapters. Chapters are divided into one or more sections. Sections are divided into one or more paragraphs. Paragraphs are a block of text. Titles are also attached to the document, chapters, sections and optionally to the paragraphs.

In this case, the DOC, CHAPTER, SECTION, and PARA elements are good candidates for being treated as components. The main factor in this decision is that users of the information may want to perform searches and receive the information in differing levels of granularity. One user may only be looking for a specific bit of information in a paragraph while another may want to read the entire chapter on a given subject. In this application, it has also been determined that the document title and chapter titles should also be managed as components to aid in the searching of the data. An example grammar to define this component storage scheme may be as follows:

component: #GI=doc
component: #GI=chapter
component: #GI=section
component: #GI=para
component: #GI=title [#PARENT=doc]
component: #GI=title [#PARENT=chapter [#PARENT=doc]]

In this example, the #GI keyword identifies a generic identifier for an SGML element. The #PARENT keyword identifies an element that immediately contains the named generic identifier within the document hierarchy. An #ANCESTOR keyword may also be used to identify an element that contains the named generic identifier at any level within the document hierarchy. These keywords help to define the context of the elements that are to be treated as components. The square brackets ("[" and "]") are used to demonstrate groupings and associations within the declaration. Notice that the last two declarations help to exclude section titles and paragraph titles from being treated as components by giving very specific contexts.

A search and retrieval system that can take advantage of the component structure defined above can greatly enhance the usability of the information stored within documents by making the specific items being searched more accessible. This still raises the issue of presenting the data once it is retrieved. We will now discuss what effects the use of components have on the presentation of data, specifically in terms of presentation on the WWW.

Element Transformation

The complexity of the task of transforming elements from an arbitrary SGML application to HTML can be directly related to the design of the original application's DTD. For example, if the DTD is designed to simply emulate the markup inserted by a word processing system, then the transformation to HTML can be very simple. However, if a more structural content-based markup is used, then the transformation may be more difficult, since the elements identify the semantics of the data rather than the presentation. Although the elements identify the content, in many cases they can also be used to help identify the structure. Some very complex applications, such as the Interactive Electronic Technical Manual (IETM) project within the Department of Defense, identify and store elements with no dependence on the possible uses of the data. The order and placement of the elements are not known until the information is gathered for final presentation.

Currently, any transformation from SGML to HTML must be done outside the domain of the WWW browser. A preliminary DTD for HTML 3.0 (HTML+) is being developed by David Raggett from the Hewlett Packard Laboratories in the United Kingdom. In it, a new element has been proposed which could be used to handle simple element transformations. The element is declared as follows:

<!ELEMENT render 	- o 	EMPTY>
<!ATTLIST render	id	ID	#IMPLIED
			tag	CDATA	#IMPLIED
			equiv	CDATA	#IMPLIED
			style	NAMES	#IMPLIED >

The HTML 3.0 DTD is being designed so that it can be extended to add elements from a source document type by redefining HTML entities. The RENDER element specifies how these additional elements should be formatted. The element name from the source document is provided as the value for the tag attribute. The HTML element that the source element mimics is given as the value of the equiv attribute. Any additional formatting is done by providing a space delimited list of standard HTML emphasis element names such as I for italic, B for bold, U for underlined, S for strike-through, and TT for a fixed-pitch font as the value of the style attribute. Using the DTD declared previously, the rendering for paragraphs may be:

<render tag="para" equiv="p">

In cases where the equiv attribute is used, it is assumed that the end tag for the equivalent HTML element occurs at the same point where the end tag for the source element occurred.

In many of the current WWW browsers, any markup not recognized as a supported HTML element is ignored. The addition of the RENDER element, as proposed, allows a simple element transformation to take place within the browser so that original SGML markup can be used to drive the presentation of the data. This also alleviates the need to develop a transformation process that must occur before the data is presented to the browser. However, this is only a simple one-to-one element transformation. In many instances, a generic element, such as the TITLE in the example DTD, is used throughout a document and the formatting applied to it depends on the context in which it occurs. For example, a chapter title may be formatted differently than a section title. In such cases, the RENDER element, as currently proposed, would not be able to format the instances of TITLE differently based on their context.

In order to be able to include the context of an element within the RENDER element, a location grammar is needed. Such a grammar can more specifically define the location or context of an element so that it can be identified and transformed correctly. This could be easily done within the tag attribute since it is declared as character data. For example, if a grammar language similar to that used above for the component declaration was implemented for the RENDER element, the following would be possible:

<render tag="#GI=title [#PARENT=doc]" equiv="h1">
<render tag="#GI=title [#PARENT=chapter]" equiv="h2">
<render tag="#GI=title [#PARENT=section]" equiv="h3">
<render tag="#GI=title [#PARENT=para]" equiv="h4">

This extension accomplishes two things: more specific identification of elements and more accurate mapping for document presentation. Even though there is a TITLE element that appears within the HTML 3.0 DTD, the additional specificity provided above ensures that it will still be treated as a TITLE within HTML. The additional context information provides that the additional titles found within the source document are not mistaken for the HTML document title. This also provides the capability for a single, generic element to be presented in any number of ways based on where and how it is used.

There are other items within the SGML markup that may affect the presentation of the data including attributes, end tags, marked sections, and processing instructions. The location grammar can be extended to accommodate all of these items as shown below.

Element attributes can affect the presentation in a number of ways. Typically these include:

Element end tags may also serve as triggers for some types of presentation. In the example below, element FOOBAR occurs within a paragraph that has been presented in bold. Element FOOBAR should be rendered as italics only (no bold) but turn the bold back on when it is finished.

<render tag="#GI=foobar [#PARENT=para]" equiv="/B">
<render tag="#GI=/foobar [#PARENT=para]" equiv="B">

Marked sections, especially those to be ignored, can have an effect on the presentation of data:

<render tag="#MS=IGNORE" equiv="hide">

Processing instructions can also be interpreted as presentation commands. In the source document the following processing instruction may occur:

<?BOLD>

The location syntax can be extended to handle processing instructions by using the following:

<render tag="#PI=BOLD" equiv="B">

Dynamic Element Transformation

Previously we discussed dividing documents into components. The ability to retrieve documents as components rather than whole documents raises an interesting challenge where presentation is concerned. In the RENDER declarations for the TITLE elements above, it is assumed that DOC is the highest level in the hierarchy followed by CHAPTER, SECTION, and PARA. In an environment where components can exist, this may not always be the case. If the user only wishes to view a section or paragraph at the highest level, then the titles associated within the new hierarchy should be displayed as such. In other words, if a section is the highest level in a particular view of the data, then the section title should be viewed as if it were a first level heading (H1). This can be done using the following declarations:

<render tag="#GI=title [#PARENT=doc]" equiv="h1">
<render tag="#GI=title [#PARENT=chapter [#PARENT=doc]]" equiv="h2">
<render tag="#GI=title [#PARENT=chapter]" equiv="h1">
<render tag="#GI=title [#PARENT=section [#ANCESTOR=doc]]" equiv="h3">
<render tag="#GI=title [#PARENT=section [#PARENT=chapter]]" equiv="h2">
<render tag="#GI=title [#PARENT=section]" equiv="h1">
<render tag="#GI=title [#PARENT=para [#ANCESTOR=doc]]" equiv="h4">
<render tag="#GI=title [#PARENT=para [#ANCESTOR=chapter]]" equiv="h3">
<render tag="#GI=title [#PARENT=para [#PARENT=section]]" equiv="h2">
<render tag="#GI=title [#PARENT=para]" equiv="h1">

In this case, the more specific declarations are examined first and if they fail to match the context of the element, the subsequent declarations are checked until one succeeds. Once a successful match is found, the element is presented and matching ends. If a successful match is not found, the source element is ignored, much as unrecognized markup is ignore currently.

A mechanism such as this allows data to be presented in a dynamic form based on the amount of data being presented. This also allows document designers to develop entire documents as one large piece, but still present it in several smaller, more manageable and readable pieces.

Other Considerations

The ability to define the formatting of a document based on an element's context or its attribute value would require the development of a specification grammar. This grammar would need to allow the user to specify an element's context as well as any of the element's attributes that might have an effect in how it is presented. The addition of such a capability raises several items for consideration.

  1. The addition of advanced location declarations would require that browsers be able to track the location of an element within the document hierarchy. This would probably require the embedding of an SGML parser that can interpret the source DTD as well as understand the markup of the original document. The parser would also need to be able to incorporate the source information into an HTML document. The knowledge of the original markup is especially important in cases where markup can be minimized.
  2. Since the tag attribute within RENDER is defined as character data, an SGML parser would not be able to validate the contents of the attribute to ensure that the specifications are defined correctly.

The inclusion of a full-blown SGML parser within a browser adds to the processing overhead as well as the maintenance complexity of the browser. It may also cause difficulties when older HTML documents are parsed by the browser. These documents, which may or may not conform to a given DTD, may result in a number of errors. This will occur especially in cases where the documents were tagged so that they "looked good in Mosaic".

The combination of markup from different SGML document types always raises the possibility of "tag collisions". This is a case where more than one element declaration for the same tag name exists in the combination scheme. This is an SGML error that is handled differently by different SGML parsers. Some may identify the error and quit. Others may identify the error and ignore the second and any subsequent occurrences of the element declaration. The ISO standard does not specify what action should be taken when errors occur during a parse.

Although this paper has focused on the transformation and presentation of SGML documents, it is conceivable that the RENDER element can be declared such that any token, be it an SGML tag or RTF markup, can be mapped and transformed by the browser.

Summary

Many organizations are using SGML because it represents a format-neutral method for marking up their data. This is especially important in cases where the data may have an extremely long life and be subjected to a wide range of uses. One of the real strengths of SGML is that it can be transformed into the format needed at the time that it is used.

The ability to partition large SGML documents into components at retrieval allows users to find needed information with greater speed and accuracy. This also reduces the volume of data flowing across the Internet.

HTML represents a publicly available and accepted method for the presentation of data. As HTML matures, more and more of the strengths of SGML are being incorporated into this standard. At the same time, the tools which use HTML, including WWW browsers, are also maturing. The addition of dynamic transformation capabilities allows these tools to work with a wider range of information and gain an increased level of acceptance and use.

The extension of the HTML 3.0 DTD to include a mechanism that allows arbitrary SGML documents to be transformed at presentation accomplishes two very important objectives. First, it provides a consistent model for document transformations across the WWW. Second, it increases the capabilities of WWW browsers in a manner that can be extended even further in the future. In theory, if HTML continues to be extended to enable the transformation of additional data formats, present-day browsers could evolve into virtually universal information presentation systems.

Eric Freese

Eric Freese is a Principal Software Developer within the SGML Solution Center of Information Dimensions, Inc. His responsibilities include the design and implementation of SGML technologies for IDI's document database products as well as helping to set the direction for future development of new SGML products.

His past SGML experience includes the design and development of a set of methodologies for the migration of over 200 million documents from a proprietary format to SGML. This included the development of a model for the segmentation of the documents to allow the maximum usage from the data. Mr. Freese also has experience in the CALS environment designing Document Type Definitions (DTDs) for a variety of technical manuals in addition to serving on a number of CALS-related committees.

He has a Master of Science degree in Computer Information Systems from Bentley College in Waltham, Massachusetts.

E-mail: efreese@idi.oclc.org