[Mirrored from: http://www.arbortext.com/natifilt.html]
SGML (Standard Generalized Markup Language) has become the world standard for exchanging information. As a result of the significant benefits of adopting SGML, many organizations are currently planning to introduce SGML in their document authoring and publishing systems. These organizations must choose between two fundamentally different approaches: native SGML and filtered SGML.
(c) 1995, ArborText, Inc. This file may be redistributed electronically as long as it remains wholly intact, including this notice and copyright. This file must not be redistributed in hard-copy form. ArborText will freely distribute this document in its original published form on request.
SGML has become the worldwide standard for storing document information because of its considerable benefits to both information providers and information consumers:
This paper examines two fundamentally different approaches to adopting SGML within an organization: native and filtered.
There are significant differences between native and filtered approaches to SGML. An organization must choose between these two approaches based on many factors, including:
For the filtered approach to SGML, you use a traditional editor: either a word processing program or a desktop publishing package. Because these editors store information in different and proprietary file formats, we refer to these as proprietary editors. To create SGML from a proprietary editor, you must use software that converts between the proprietary format and SGML, a process we refer to as filtering.
With the native approach, you create SGML directly by means of a new desktop tool called a native SGML editor. A native editor looks and feels much like a typical word processing or desktop publishing package, but it does not use a proprietary file format; its storage format is SGML. Native editors are different in other ways that will become more apparent later in this white paper.
A native editor works directly with SGML documents, both manipulating and storing document information in pure SGML. Native SGML editors read SGML-formatted files, maintain the SGML structure throughout all manipulations, and write SGML back out (Figure 1). Because SGML is an established standard, any native editor that fully complies with the standard can read and write the same files. (You must exercise some caution here, since native editors vary in the breadth of their support for SGML.)
Proprietary editors only work directly with their own proprietary formats. To handle SGML, proprietary editors convert SGML documents to their proprietary format, manipulate those documents in that format, and convert back to SGML before writing the result (Figure 2). As this paper later shows, these conversion steps impose significant costs and limitations; setting up each conversion is a lot of work, and even the best conversion process still requires considerable manual intervention.
The native and filtering approaches each raise different challenges and offer distinctly different benefits to the organization implementing SGML:
Native Editor |
Filtered SGML |
High initial investment | Moderate initial investment |
Significantly lower operating costs | Moderate to significantly higher operating costs |
Streamlined process | Adds steps to existing process |
Benefits include all that SGML offers | SGML benefits limited primarily to interchange |
Even though proprietary editors can automatically convert from other proprietary formats to their own format, none can do the same with SGML. Their inability to automatically convert arises because SGML embodies a fundamentally different approach to information. Unlike proprietary formats, which assign style codes to blocks of text, SGML defines containers of information and users assign text to those containers. Style codes are assigned to the information containers as well, which permits users to easily assign different styles for different output media.
The information containers are defined by an SGML structure called a Document Type Definition or DTD that remains the same for a single type of document but varies widely for different types of documents. For example, the same DTD could apply to an entire collection of legal briefs, but that DTD would differ considerably from the DTD for an automobile service manual. The only way to convert an SGML document from a proprietary format is to set up a conversion program for that document's DTD. Each different DTD requires a different conversion setup.
In a recently published book, The SGML Implementation Guide, authors Brian Travis and Dale Waldt describe The Holy Grail of SGML:
There is a strong desire to get the benefits of SGML without sacrificing the ease-of-use and comfort level users have with graphical word processors. We have been asked many times for a system that would allow users to keep using their word processors and page-makeup programs, but be able to load the files into an SGML database for the purpose of creating by-product applications and generally getting control of their data. In short, they want a two-way, production-oriented conversion facility. We call this the Holy Grail because, if it does exist, no mortal has yet seen it.
As the authors go on to explain, converting from SGML to a proprietary format is relatively easy: it's like taking a picture of a building. Converting back to SGML is much harder: it's like trying to construct a building based on a photograph. Much knowledge about the structure is hidden from view; one must infer its structure based on its appearance, and much will be lost in the process (examples follow later in this paper).
The goal of an SGML project can vary in complexity from producing SGML-formatted documents for interchange to improving the entire process of creating, maintaining, and distributing document information. Of course, that means the budget for an SGML project can vary from relatively modest to very expensive.
Deciding between native and filtering involves tradeoffs in both the goals and the costs of the project. Native SGML involve ambitious goals and require a significant investment. If the goal of a project is simply to produce SGML for interchange, the investment can be relatively modest.
The following table contrasts the typical application requirements that affect the choice of native vs. proprietary editors:
Native Editors |
Proprietary Editors |
|
Reason for SGML | SGML documents needed to support advanced applications | SGML documents needed only for interchange; document storage uses proprietary format |
Typical Document Size | Large amounts of information | Small to medium amounts of information |
Typical Size of Document Repository | Large amounts of information | Small to medium amounts of information |
Document Structure | Documents with regular or complex structures | Documents with little or no structure |
Date Storage | Database; cannot tolerate deviations from defined structure | Flat files; structure deviations tolerable |
Size of Team | Medium and large teams | Small teams or individuals |
Process Changes | Streamlined process | No plan to reengineer |
Filtering applications usually cost more to operate but less to implement; however, implementation costs can still be substantial. Ambitious SGML projects that call for native editors can double or triple productivity, which vastly reduces operating costs, but implementation costs can be considerably higher.
The following table illustrates the differences in costs between implementing native editors and proprietary editors:
Native Editors |
Proprietary Editors |
|
Purchased | Purchase native editors; publishing software for paper and CD-ROM; optionally document management and workflow software | Purchase filtering software; because of the more comples workflow and larger number of documents, the purchase of document management and workflow software is often required. |
Document | Adopt industry standard structures (DTDs) or adapt to meet unique needs. | Adopt industry standard structures (DTDs) or adapt to meet unique needs. |
Integration | Develop and test style sheets. | Develop and test the conversion maps for each type of document. |
Training | Train authors, editors, and reviewers. | Train authors to create structured documents and fix conversion problems. |
Native editors operate more efficiently; proprietary editors require additional steps that add costs (often hidden or not apparent at the start) without adding value.
To use a native editor to change an SGML document, the user performs the following steps:
To use a proprietary editor to change an SGML document, the user performs the following steps:
Steps 5-8 in the process outlined above occur during the latter stages of a project, when all the slack time has been consumed and time is at a premium.
This section describes some of the barriers to automatic conversion of SGML to proprietary formats and back to SGML. Though these barriers are very significant, converting from SGML to proprietary formats poses fewer problems than the reverse.
To perform a conversion, the user must set up a map that describes the relationship between SGML structures and internal styles. Creating a map is a complicated and highly iterative task that native editors don't require.
To set up a map, the user must create a one-to-one association between SGML tags and the proprietary editor's style sheet. For example, the user might map the tag <emphasis> to the character style italic, and the tag <warning> to the paragraph style warning. From these two examples, you might infer that mapping is simple, but that impression disappears upon closer examination. Severe complications arise because of the dramatic differences between SGML and proprietary editors:
SGML prescribes a hierarchical description of a document while proprietary systems represent only a linear structure. This means that the same SGML tag can change meanings depending on its position in the hierarchy of the document (i.e., depending on its context). On the other hand, a style always performs the same function (i.e., prescribes the same formatting) regardless of its position.
For example, a title tag could be a chapter title, section title, or sub-section title, depending on whether it's located within a chapter, section, or sub-section. In a native editor, the author simply inserts a title, which is convenient, since the author can promote or demote an element that includes the title without changing the title itself. (Changing a section to a chapter is promoting an element; changing a section to a subsection is demoting an element.) In a proprietary editor, the author applies a style of chapter-title , section-title, or subsection-title.
Tags |
Styles |
<chapter> <title> | chapter-title |
<section> <title> | section-title |
<subsection> <title> | subsection-title |
These conversions are called one-to-many and many-to-one, and they add enormous complications to the process of setting up a conversion map. The user must carefully analyze the possibilities and set up the map to deal with every single one.
SGML's context-sensitive tagging permits an author to create complex structures with ease, but causes headaches for conversion.
Tags |
Styles |
<list> <list-item> | list1 |
<list>
<list-item> <list> <list-item> |
list2 |
<list>
<list-item> <list> <list-item> |
list3 |
For example, an author writing in a native SGML editor needs only two tags (such as <list> and <list-item>) to create lists within lists to a nearly limitless number of levels. To set up a conversion map to handle this, the user must create a different style for each list level that can occur in practice; for example, if the user expects a maximum of nine levels in a list, then he must create and map nine different styles (such as list1, list2, . . . list9).
Attributes, which represent additional information that applies to a tag, have no equal in proprietary systems.
Tags |
Styles |
<para: id=xyz11> | paragraph |
<para: lang=French> | paragraph or French-paragraph |
<para: lang=English> | paragraph or English-paragraph |
For example, SGML supports a unique identifier (ID) attribute that allows one element to refer to another; no straightforward method exists to preserve ID attributes when they are converted from SGML. In some cases, converting tags with attributes (e.g., language, skill level, security clearance, etc.) to SGML may require many different styles in order to provide a one-to-one conversion.
SGML uses container tags that have no content but simply indicate the start or end of a group of information elements; proprietary systems have no analogous feature. As a result, converting from proprietary formats to SGML requires the user to set up logic to generate container tags based on an element's relative position to other elements.
Tags |
Styles |
<list> <list-item> | list1 |
<list-item> | list1 |
</list> <para> | paragraph |
Consider, for example, converting a list from a proprietary format to SGML. To convert the first paragraph with a list style involves creating both the container tag (which indicates the start of a list) and the item tag (which indicates the list item itself). Similarly, when the converter encounters the first paragraph with a style other than list, it must first generate the tag that signals the end of the container.
Because of the challenges involved in setting up a map, a leading desktop publishing vendor admits to supporting only 20% of the over 300 tags in the DocBook DTD (an industry standard for computer hardware and software documentation); to increase their support to 50-60% would be extremely expensive; to support 100% would be impossible.
Authoring for eventual conversion to SGML places far more demands on the author than either conventional authoring or native SGML authoring:
If an author ever applies formatting directly instead of using styles, conversion fails. For example, if the author changes a title from a section title to a subsection title by changing its font directly, its style remains section-title and it will convert improperly. Many authors find it very difficult to apply styles with absolute consistency.
Typical documents cannot survive a round trip conversion (i.e., from SGML to proprietary back to SGML) because SGML supports data structures that have no equal in proprietary systems. For example:
Because of the myriad complexities of conversion and the difficulties of round trip conversions, users who endure the pain of conversion often end up maintaining both the SGML form of their document as well as the proprietary form. For example, after making relatively small changes to the original source documents, an author typically modifies the SGML documents directly in order to avoid the pain of another conversion. This approach doubles the cost of maintaining documents, an activity that often represents a large portion of a publication department's work.
In addition to removing the costs, complexities, and inaccuracies of converting between SGML and proprietary formats, native SGML editors provide many additional significant benefits:
Some publishing systems support on-line document review, but these systems typically lack strong facilities for adapting to each organization's process; sorting and viewing comments in multiple ways; permitting comments on comments; or indefinitely maintaining a record of comments that were accepted and comments that were rejected.
Because SGML-formatted information lends itself to various types of automation (such as database storage), it's also superior for powerful electronic review applications that support multiple reviewers, multiple versions, and a complex review process.