[Mirrored from: http://www.arbortext.com/natifilt.html]

Native SGML vs. Filtered SGML

An ArborText White Paper

SGML (Standard Generalized Markup Language) has become the world standard for exchanging information. As a result of the significant benefits of adopting SGML, many organizations are currently planning to introduce SGML in their document authoring and publishing systems. These organizations must choose between two fundamentally different approaches: “native SGML” and “filtered SGML.”

(c) 1995, ArborText, Inc. This file may be redistributed electronically as long as it remains wholly intact, including this notice and copyright. This file must not be redistributed in hard-copy form. ArborText will freely distribute this document in its original published form on request.

Native or Filtered?
Tradeoffs Between Cost and Results
Comparing the Processes
Challenges to Successful Conversion
Additional Benefits of Native SGML

Native or Filtered?

SGML has become the worldwide standard for storing document information because of its considerable benefits to both information providers and information consumers:

Multiple outputs from a single document
Vastly improved productivity
Reusability of information
Flexibility beyond traditional publishing
Vendor independence

This paper examines two fundamentally different approaches to adopting SGML within an organization: native and filtered.

Native Editors vs. Proprietary Editors

There are significant differences between native and filtered approaches to SGML. An organization must choose between these two approaches based on many factors, including:

The organization's objectives for adopting SGML;
The investment required to meet those objectives; and
Internal resistance to changing familiar tools and processes.

For the filtered approach to SGML, you use a traditional editor: either a word processing program or a desktop publishing package. Because these editors store information in different and proprietary file formats, we refer to these as proprietary editors. To create SGML from a proprietary editor, you must use software that converts between the proprietary format and SGML, a process we refer to as filtering.

With the native approach, you create SGML directly by means of a new desktop tool called a native SGML editor. A native editor looks and feels much like a typical word processing or desktop publishing package, but it does not use a proprietary file format; its storage format is SGML. Native editors are different in other ways that will become more apparent later in this white paper.

A native editor works directly with SGML documents, both manipulating and storing document information in “pure” SGML. Native SGML editors read SGML-formatted files, maintain the SGML structure throughout all manipulations, and write SGML back out (Figure 1). Because SGML is an established standard, any native editor that fully complies with the standard can read and write the same files. (You must exercise some caution here, since native editors vary in the breadth of their support for SGML.)

Figure 1 -- Native Editor

Proprietary editors only work directly with their own proprietary formats. To handle SGML, proprietary editors convert SGML documents to their proprietary format, manipulate those documents in that format, and convert back to SGML before writing the result (Figure 2). As this paper later shows, these conversion steps impose significant costs and limitations; setting up each conversion is a lot of work, and even the best conversion process still requires considerable manual intervention.

Figure 2 -- Proprietary Editor

The native and filtering approaches each raise different challenges and offer distinctly different benefits to the organization implementing SGML:

Native Editor	Filtered SGML
High initial investment	Moderate initial investment
Significantly lower operating costs	Moderate to significantly higher operating costs
Streamlined process	Adds steps to existing process
Benefits include all that SGML offers	SGML benefits limited primarily to interchange

Why Don't Proprietary Editors Work Directly in SGML?

Even though proprietary editors can automatically convert from other proprietary formats to their own format, none can do the same with SGML. Their inability to automatically convert arises because SGML embodies a fundamentally different approach to information. Unlike proprietary formats, which assign style codes to blocks of text, SGML defines containers of information and users assign text to those containers. Style codes are assigned to the information containers as well, which permits users to easily assign different styles for different output media.

The information containers are defined by an SGML structure called a “Document Type Definition” or “DTD” that remains the same for a single type of document but varies widely for different types of documents. For example, the same DTD could apply to an entire collection of legal briefs, but that DTD would differ considerably from the DTD for an automobile service manual. The only way to convert an SGML document from a proprietary format is to set up a conversion program for that document's DTD. Each different DTD requires a different conversion setup.

In a recently published book, The SGML Implementation Guide, authors Brian Travis and Dale Waldt describe “The Holy Grail” of SGML:

“There is a strong desire to get the benefits of SGML without sacrificing the ease-of-use and comfort level users have with graphical word processors. We have been asked many times for a system that would allow users to keep using their word processors and page-makeup programs, but be able to load the files into an SGML database for the purpose of creating by-product applications and generally getting control of their data. In short, they want a two-way, production-oriented conversion facility. We call this the Holy Grail because, if it does exist, no mortal has yet seen it.”

As the authors go on to explain, converting from SGML to a proprietary format is relatively easy: it's like taking a picture of a building. Converting back to SGML is much harder: it's like trying to construct a building based on a photograph. Much knowledge about the structure is hidden from view; one must infer its structure based on its appearance, and much will be lost in the process (examples follow later in this paper).

Tradeoffs Between Cost and Results

The goal of an SGML project can vary in complexity from producing SGML-formatted documents for interchange to improving the entire process of creating, maintaining, and distributing document information. Of course, that means the budget for an SGML project can vary from relatively modest to very expensive.

Deciding between native and filtering involves tradeoffs in both the goals and the costs of the project. Native SGML involve ambitious goals and require a significant investment. If the goal of a project is simply to produce SGML for interchange, the investment can be relatively modest.

The following table contrasts the typical application requirements that affect the choice of native vs. proprietary editors:

	Native Editors	Proprietary Editors
Reason for SGML	SGML documents needed to support advanced applications	SGML documents needed only for interchange; document storage uses proprietary format
Typical Document Size	Large amounts of information	Small to medium amounts of information
Typical Size of Document Repository	Large amounts of information	Small to medium amounts of information
Document Structure	Documents with regular or complex structures	Documents with little or no structure
Date Storage	Database; cannot tolerate deviations from defined structure	Flat files; structure deviations tolerable
Size of Team	Medium and large teams	Small teams or individuals
Process Changes	Streamlined process	No plan to reengineer

Filtering applications usually cost more to operate but less to implement; however, implementation costs can still be substantial. Ambitious SGML projects that call for native editors can double or triple productivity, which vastly reduces operating costs, but implementation costs can be considerably higher.

The following table illustrates the differences in costs between implementing native editors and proprietary editors:

	Native Editors	Proprietary Editors
Purchased	Purchase native editors; publishing software for paper and CD-ROM; optionally document management and workflow software	Purchase filtering software; because of the more comples workflow and larger number of documents, the purchase of document management and workflow software is often required.
Document	Adopt industry standard structures (DTDs) or adapt to meet unique needs.	Adopt industry standard structures (DTDs) or adapt to meet unique needs.
Integration	Develop and test style sheets.	Develop and test the conversion maps for each type of document.
Training	Train authors, editors, and reviewers.	Train authors to create structured documents and fix conversion problems.

Comparing the Processes

Native editors operate more efficiently; proprietary editors require additional steps that add costs (often hidden or not apparent at the start) without adding value.

To use a native editor to change an SGML document, the user performs the following steps:

Native SGML editing process

Load the document
Edit it
Save it

To use a proprietary editor to change an SGML document, the user performs the following steps:

Filtered SGML editing process

Locate or create the conversion filter
Load the document
Convert the document from SGML to the product's internal proprietary format
Edit the document
Convert the document back to SGML
Locate and correct any semantic errors reported by the converter, either by changing the source document or by changing the conversion filter
Repeat steps 5 and 6 until no errors remain
For sophisticated documents, a trainer, author, or editor must manually validate for additional semantic errors that the converter cannot detect
Save the document

Steps 5-8 in the process outlined above occur during the latter stages of a project, when all the slack time has been consumed and time is at a premium.

Challenges to Successful Conversion

This section describes some of the barriers to automatic conversion of SGML to proprietary formats and back to SGML. Though these barriers are very significant, converting from SGML to proprietary formats poses fewer problems than the reverse.

Mapping

To perform a conversion, the user must set up a “map” that describes the relationship between SGML structures and internal styles. Creating a map is a complicated and highly iterative task that native editors don't require.

To set up a map, the user must create a one-to-one association between SGML tags and the proprietary editor's style sheet. For example, the user might map the tag <emphasis> to the character style italic, and the tag <warning> to the paragraph style warning. From these two examples, you might infer that mapping is simple, but that impression disappears upon closer examination. Severe complications arise because of the dramatic differences between SGML and proprietary editors:

One-to-many and Many-to-one

SGML prescribes a hierarchical description of a document while proprietary systems represent only a linear structure. This means that the same SGML tag can change meanings depending on its position in the hierarchy of the document (i.e., depending on its “context”). On the other hand, a style always performs the same function (i.e., prescribes the same formatting) regardless of its position.

For example, a title tag could be a chapter title, section title, or sub-section title, depending on whether it's located within a chapter, section, or sub-section. In a native editor, the author simply inserts a title, which is convenient, since the author can promote or demote an element that includes the title without changing the title itself. (Changing a section to a chapter is “promoting” an element; changing a section to a subsection is “demoting” an element.) In a proprietary editor, the author applies a style of chapter-title , section-title, or subsection-title.

Tags	Styles
<chapter> <title>	chapter-title
<section> <title>	section-title
<subsection> <title>	subsection-title

These conversions are called “one-to-many” and “many-to-one,” and they add enormous complications to the process of setting up a conversion map. The user must carefully analyze the possibilities and set up the map to deal with every single one.

Nesting

SGML's context-sensitive tagging permits an author to create complex structures with ease, but causes headaches for conversion.

Tags	Styles
<list> <list-item>	list1
<list> <list-item> <list> <list-item>	list2
<list> <list-item> <list> <list-item>	list3

For example, an author writing in a native SGML editor needs only two tags (such as <list> and <list-item>) to create lists within lists to a nearly limitless number of levels. To set up a conversion map to handle this, the user must create a different style for each list level that can occur in practice; for example, if the user expects a maximum of nine levels in a list, then he must create and map nine different styles (such as list1, list2, . . . list9).

Attributes

Attributes, which represent additional information that applies to a tag, have no equal in proprietary systems.

Tags	Styles
<para: id=xyz11>	paragraph
<para: lang=French>	paragraph or French-paragraph
<para: lang=English>	paragraph or English-paragraph

For example, SGML supports a unique identifier (“ID”) attribute that allows one element to refer to another; no straightforward method exists to preserve ID attributes when they are converted from SGML. In some cases, converting tags with attributes (e.g., language, skill level, security clearance, etc.) to SGML may require many different styles in order to provide a one-to-one conversion.

Container tags

SGML uses “container” tags that have no content but simply indicate the start or end of a group of information elements; proprietary systems have no analogous feature. As a result, converting from proprietary formats to SGML requires the user to set up logic to generate container tags based on an element's relative position to other elements.

Tags	Styles
<list> <list-item>	list1
<list-item>	list1
</list> <para>	paragraph

Consider, for example, converting a list from a proprietary format to SGML. To convert the first paragraph with a list style involves creating both the container tag (which indicates the start of a list) and the item tag (which indicates the list item itself). Similarly, when the converter encounters the first paragraph with a style other than list, it must first generate the tag that signals the end of the container.

Because of the challenges involved in setting up a map, a leading desktop publishing vendor admits to supporting only 20% of the over 300 tags in the DocBook DTD (an industry standard for computer hardware and software documentation); to increase their support to 50-60% would be extremely expensive; to support 100% would be impossible.

Authoring for Eventual Conversion to SGML

Authoring for eventual conversion to SGML places far more demands on the author than either conventional authoring or native SGML authoring:

No direct formatting: To support conversion to SGML, authors must apply formatting through styles and never directly. In other words, if the author wishes to make a word italic, the author must apply a character style such as emphasis and not highlight the word and italicize it.
If an author ever applies formatting directly instead of using styles, conversion fails. For example, if the author changes a title from a section title to a subsection title by changing its font directly, its style remains section-title and it will convert improperly. Many authors find it very difficult to apply styles with absolute consistency.
Multiple styles for the same appearance: Different styles may have the same appearance, which makes it difficult or impossible for authors to validate their text. For example, italicized text might indicate a foreign word, part number, or emphasized word. To set up for conversion to SGML, the author would have to apply a different character style (e.g., foreign, partnum, emphasis). The only way for an author to verify later that the styles are correct is to move the cursor through the entire document and inspect each style. Or as an alternative, an expert can review the SGML version of the document and check each tag.
Multiple styles for similar tags: SGML allows the same tag to serve multiple purposes when it's used in different contexts. For example, DocBook allows <Replaceable> inside <Command>. If you use <Replaceable> by itself, it means something different. To handle the equivalent in a word processor, you would need the following character styles: Command, Replaceable, and ReplaceableInCommand. Now consider that DocBook has many dozens of inline elements, with very complex interrelationships, and you can see that creating even a partial conversion map could take weeks of work.
Vastly increased number of styles: The number of different styles needed to support a typical SGML document structure can exceed the normal number of styles by a factor of 5 to 10. For example, each of DocBook's 300 tags requires a corresponding style, while a similar technical manual might have only 40 or 50 styles. Because of the challenges with mapping styles to tags, supporting conversion of a technical manual to DocBook could as much as double the number of styles needed, raising the total to 600 or more.
No enforcement of structure: Typical documents virtually never convert to SGML without errors because proprietary systems don't enforce valid structure. On the other hand, most native editors continuously enforce valid structure, which prevents all these problems:
- Inserting invalid structures: Proprietary editors cannot prevent the author from inserting a style that makes no sense, such as a chapter head within a subsection; most native editors, on the other hand, simply prevent the author from inserting invalid tags.
- Omitting required structures: Proprietary systems cannot enforce simple structural rules, such as ensuring that the author inserts a title at the start of each section. Most native editors, on the other hand, not only insert a title tag automatically when the author inserts a section, but they also prevent the author from later deleting the title tag, because that would create an invalid structure.
- Constructing invalid structures: Authors follow many common sense rules, such as “a list must contain at least two elements,” but proprietary systems cannot enforce such rules. On the other hand, most native editors flag this sort of error before saving and provide an easy way to navigate to the problem.
No guidance: Most native editors not only enforce valid structure, they also guide authors by continuously displaying a list of valid tags at the current insertion point. For example, at any point while editing a DocBook-based document, perhaps only 10-20 tags are valid; if the author is editing a similar technical manual on a proprietary system, all 300+ styles are available at any point in the document.

No Round Trip Conversions

Typical documents cannot survive a “round trip” conversion (i.e., from SGML to proprietary back to SGML) because SGML supports data structures that have no equal in proprietary systems. For example:

In an SGML document, a “marked section” is a section of the document that is “included” or “excluded” based on the status of a variable. Marked sections always remain part of the document file, but are included or omitted when sent to a publishing system. Marked sections can be nested within each other or placed in parallel to each other, and can accommodate intricate logic and functions. Because of these potential complexities, marked sections cannot be converted to a proprietary system through any practical means. To handle marked sections on import, a converter simply includes any section marked for inclusion as if it were unconditional and ignores any section marked for exclusion. The excluded sections no longer appear in the document when it is converted back to SGML.
Although most proprietary systems provide functionality similar to SGML text and file entities, converters don't handle these properly because the required programming is too complex. Instead, the converters simply include entities in the resulting document as if they were an indivisible part of the document.

Maintaining Duplicate Information

Because of the myriad complexities of conversion and the difficulties of round trip conversions, users who endure the pain of conversion often end up maintaining both the SGML form of their document as well as the proprietary form. For example, after making relatively small changes to the original source documents, an author typically modifies the SGML documents directly in order to avoid the pain of another conversion. This approach doubles the cost of maintaining documents, an activity that often represents a large portion of a publication department's work.

Additional Benefits of Native SGML

In addition to removing the costs, complexities, and inaccuracies of converting between SGML and proprietary formats, native SGML editors provide many additional significant benefits:

Disciplined editing environment: Native editors continuously display the tags that are valid at the current insertion point, which assures continuously valid data and enforces corporate standards.
Consistently valid data for databases: Some native editors facilitate the use of databases. Most large SGML-based installations store documents in databases for several compelling reasons:
- Ability to handle huge amounts of data efficiently
- Reuse of the same data in different documents, which eliminates redundancy and simplifies processes
- Infinitely variable publishing -- even query-driven publishing
- Access controls
- Potential for additional automation. For example, electronic parts books could connect to an order entry system for automatic parts ordering; service guides could connect to the equipment being repaired to provide automatic diagnostic assistance; and user manuals could adapt to each user's past experience or current equipment configuration.
Access to high-tech SGML constructs:
- marked sections
- proper entity handling
- conref attributes
- subdoc
Context-sensitive element promotion/demotion: Because native editors are inherently aware of a document's structure, they can automatically promote or demote subordinate elements when a superior element is promoted or demoted. For example, a native editor can automatically convert a subsection into a section when it is cut out of a section and pasted into a chapter.
Automatic/batch multiple outputs: SGML is a data format, not a document format, so it lends itself to applications where new documents are generated automatically, even interactively, as the data changes. For example, in an application that publishes SGML documents in HTML for the World Wide Web, a batch program could automatically convert any new or changed SGML files to HTML every night. As another example, a batch program that produces a monthly CD-ROM update could automatically capture and convert new information (if conversion is necessary).
Superior electronic review capabilities: In many organizations, document reviews remain a manual process. Multiple reviewers either mark up multiple printed copies, leaving the author to collate the comments later, or mark up the same copy in different colors of ink. This process provides no easy way for reviewers to track the disposition of their comments, and no way at all to maintain a secure audit trail of changes.
Some publishing systems support on-line document review, but these systems typically lack strong facilities for adapting to each organization's process; sorting and viewing comments in multiple ways; permitting comments on comments; or indefinitely maintaining a record of comments that were accepted and comments that were rejected.

Because SGML-formatted information lends itself to various types of automation (such as database storage), it's also superior for powerful “electronic review” applications that support multiple reviewers, multiple versions, and a complex review process.

Home Page QuickFind