[Mirrored from: http://www.ileaf.com/avhome/veggies.html]

Whitepaper

The Art of SGML Conversion:
Eating Your Vegetables and Enjoying Dessert

By Eric Severson
eric@avalanche.com
(303) 449-5032
© Avalanche Development Company/Interleaf Inc.
January 1995

Permission to redistribute this whitepaper is granted, provided that no changes are made, and that this notice and the above attributions are included in all copies.

Introduction

If you want to enjoy the benefits of SGML, your data needs to be in SGML format. That seems logical enough. However, for most applications, getting to SGML is not just a simple translation process. Because the goal is a much richer information environment, conversion to SGML requires a process of document analysis and information refinement in addition to migrating the data itself. This white paper explores the issues involved in moving to SGML, offers techniques for making this process as effective and painless as possible, and shows how the steps in the SGML conversion process are directly related to the benefits you get once conversion is complete.

Eating Your Vegetables

SGML conversions have a reputation for being worthwhile but not necessarily lots of fun. Much like the problem of having to eat your vegetables before you get dessert.

When I was young, I worked hard not to eat my vegetables. My goal was dessert, and I couldn't understand why vegetables had to stand in my way. You might reasonably ask the same kind of questions about SGML and SGML conversion. Since we're just talking about moving data around, can't this process be easy? Can't we just move right to the dessert -- information access, reuse, interchange, electronic distribution, and so forth? Actually, with the right kinds of software and expertise, conversion pain can be minimized. However, like understanding that vegetables are critical to your health, understanding the nature of SGML conversion is the first step to success.

So What's for Dessert?

While initial SGML efforts revolved around standards compliance, people are focusing on a lot more today. SGML is being seen as the key to leveraging information resources, freeing the tremendous amount of strategic business information that is locked away in documents. This information is locked away because, even though it may be stored in electronic form, it usually resides in individual proprietary systems whose sole purpose is to produce composed hardcopy for people to read. You can access the information in these documents, but only if you separately pick up and browse each one.

Simple document management systems -- like catalog cards in a library -- can help provide a roadmap to find the documents you might want to read. However, your success at finding them is limited by how well the titles, and other predefined catalog card classifications, actually represent what's inside. Furthermore, this representation needs to be consistent with your particular point of view given your particular task at hand. And you can't really avoid poring through each document turned up by the search.

Full text retrieval -- searching for specific words or phrases across the entire pool of text in a document collection -- helps you see inside the documents, but is limited by authors having actually picked your assumed words and phrases. And unlike your natural impulse when you read a document, full text retrieval has no notion of skimming through headings and section titles to narrow down the search. It gives equal weight to phrases found in main titles and phrases found in footnotes, because it doesn't know which is which.

Suppose you find the documents you want: what do you do next? Probably, you want to reuse the information in some way, combining it with other data and republishing it in a new format, or passing it along to a separate process or other areas of the organization. Once again, the proprietary, page-oriented formats stand in the way: I can't use your information unless I also use your composition system and your page layout specifications.

Unlocking document information requires an open systemsapproach, which is what SGML is all about. Using SGML, open systems encode the logical structure of objects below the document level: the various levels of headings, lists and so forth that people pay attention to when actually reading the information. Furthermore, SGML encodes this information in a neutral format, keeping logical content separate from specific page layout decisions, and allowing flexible interchange across different platforms and applications.

By taking this approach, SGML maximizes both the current and future usefulness of your information resources, facilitating all of the following:

Greatly enhanced information access and retrieval.
Electronic distribution and sharing of information online.
Flexible information reuse, including republishing with new formats and new media, portability across platforms and systems, and readiness for unanticipated uses in the future.
Powerful document management and assembly for 'just in time' information.

Why Isn't This Easy?

Moving to SGML is challenging because truly getting a handle on your information is challenging; making it accessible and reusable and interchangeable takes some doing. But that's exactly why this process is so important.

People tend to think of conversions as a filtering or translation process, like moving a WordPerfect document into Microsoft Word. This kind of conversion takes one set of information and transforms it to some equivalent set of information. Only the underlying file format is different.

By contrast, SGML conversion typically involves building a bridge between the world of hardcopy and word processing documents (where logical structure is perceived visually by the reader), and 'intelligent' documents (where logical structure is explicitly encoded). The whole point of SGML conversions, and the reason they enable document reuse, electronic document delivery and other benefits is that they necessarily involve information enrichment, adding more than was originally there:

SGML encodes logical structure, but in most documents structure is expressed visually, not through well-defined markup. To get to SGML, implicit structure must be made explicit. By using SGML tags to encode each document object, electronic delivery tools, document management software, and other document processing software can have access to the full richness of document structure.
SGML objects, having been explicitly identified, must be placed into a strict hierarchical context. In most documents, the object structure is flat. The hierarchy, expressed in SGML, enables the outline views, tables of contents, and other structure browsing capabilities of electronic delivery systems, giving you something other than a mere 'page turner.'
SGML objects have other properties in addition to content. To express this data, 'attributes' must be added. The attributes contain information that enables 'linking' to different parts of the collection, conditional display of parts of the text, enhanced retrieval, and other important features.

The Lay of the Land

As a basepoint for exploring the ins and outs of SGML conversion, let's take a quick look at what makes up the process. At a high level, there are four major components:

The SGML application to which you are converting, centered around an SGML Document Type Definition (DTD).
The documents to be converted.
The conversion itself, which results in SGML-tagged documents.
A quality control process to ensure the SGML output complies with the target application.

In general, any large-scale conversion effort is a manufacturing process (in this case, what we're 'making' is SGML documents), and should be set up as a production line. As we'll see, the great bulk of this line can be automated with the proper strategy and the right software tools.

The SGML Application

The important thing to understand about SGML applications is that they're not all the same. There's actually no such thing as 'converting to SGML,' only converting to a specific SGML application . SGML is not a set of tags, nor even a specific way of looking at your information, but rather a language for defining the way you want to look at your information. Or in some cases, how someone else wants you to look at your information.

SGML applications are always based on a specific Document Type Definition or DTD. SGML DTDs catalog the information objects ('elements') of interest, including their names ('generic identifiers'), properties ('attributes'), and allowed content ('content models'). Through the content models, the DTD also specifies all the allowed relationships between elements, including hierarchical document structure, specific order in which elements must appear, how many times each element can appear, what's required vs. optional, and -- through a slightly different mechanism -- the possible links between elements.

One of the first steps in building a new SGML application and set of DTDs is document analysis. This process answers questions like:

What kinds of documents exist, and what common classes of documents can be identified?
What are the basic structural components and other logical objects that occur within each document type?
In addition to text content, what other information or properties might be assigned to each object type?
What are the logical relationships between each of the objects?
What are you wanting to do with your information? What are the kinds of structures and relationships that you are wanting to encode in SGML in order to drive electronic document delivery, document management, and the other benefits that are motivating SGML conversion in the first place?

Document analysis is also required for SGML conversion. That's because conversion involves matching up your input to the kinds of information elements and relationships assumed in the SGML application. In this case, the questions are:

What form(s) of input am I dealing with (hardcopy or electronic, etc.)? What are the graphics, equations and other formats that may need additional conversion?
How 'structured' is the input? That is, how much explicit encoding of structural objects, row/column tables, links, etc. is already available? How consistently have these conventions been followed?
How do the 'objects' (visually or explicitly encoded) in the input document relate to the objects in the target SGML application? How do these objects sort themselves out into element begin and end points, attributes, imbedded content, etc.?

This process should be performed with the help of experts who have been there before (Avalanche and some others routinely provide these types of services). Answering these questions up front ensures that the data you're converting will be optimized to support the target applications -- electronic delivery, search and retrieval, document management / assembly, and so forth.

Input Documents: A Look at the Issues

Before describing how conversions can be made easier, let's take a closer look at the kinds of issues that can come up in practice. These are the things that experienced consultants can quickly sort through, but which might otherwise cause difficulty.

First of all, the form of the input can make a big difference in the recommended conversion approach. Hardcopy is of course the hardest place to start, but don't be fooled by documents in electronic form. As we'll see, all electronic forms are not equal.

The first step in converting hardcopy to SGML is getting the text into digital form. This can be done by manual rekeying, or automated by OCR scanning. Either way, the result is generally an ASCII file containing text and whitespace -- but not much else.

These facts lead to several issues unique to hardcopy conversions:

With hardcopy documents, all structure is visual -- by definition. There is no underlying encoding to translate to SGML.
OCR errors can multiply the effect of errors in the conversion process. Thus it's usually critical to clean up the document before running it further through the process.
Graphics and equations are by definition available only in raster form, and have their own set of parallel conversion issues. These parallel conversions must be planned as part of the process.

Furthermore, the sheer problem of paper handling cannot be taken lightly. Making sure scanned images and OCR'd text stay synchronized with original hardcopy requires careful planning and workflow management. Again, here's where experience can make a big difference.

Conversion to SGML is a lot easier when input has been authored in electronic form. However, there are still things to consider:

The input must be in a format the conversion software can automatically read. Or it must be easily convertible to a format the software can read.
Electronic form is deceptive. In fact, most electronic documents, since they were authored with only paper presentation in mind, contain no more explicit structure than scanned hardcopy. There still may be very little to directly translate; graphics and equations may still need further conversion.

Having reviewed the forms of input, you will want to look inside the documents to ensure your conversion takes maximum advantage of the underlying information. The issues that arise vary with the type of content:

Standard Text

The bread-and-butter of documents, standard body text is the easiest part to convert. Its headings, lists, and paragraphs provide the backbone navigation structure for retrieval and document management applications. But even 'plain text' comes in a variety of flavors, some vanilla and some relatively exotic.

The very first thing in most documents -- the title page and other specialized 'front matter' -- can be difficult to convert using standard methods. Title pages, although short, can often contain twenty or more individual SGML objects, arranged in complex visual layouts. A few pages later, the document suddenly becomes much more straightforward. Yet some of the information, such as the document title, author's name, date and place of publication, revision date, and so on can be very important to get just right, since this content is often turned into tagged fields that are important in document management and retrieval. Also, at both the beginning and end of the document, there are objects that may be explicit in the input but are assumed to be automatically generated from the SGML. These items, which include such things as tables of contents, lists of tables/figures, and indexes, are usually encoded by a single SGML tag or are ignored entirely.

Given these facts, it often makes sense to apply automated methods to the main body of the document, but treat the first few pages (and certain additional sections) as exceptions to the mainstream process. These exceptions are often best handled manually or through specialized automated processes. For example, title page information can be entered into a template, much like filling out a library catalog card. SGML encoding can be produced by software configured to understand the template. Glossaries, which might otherwise appear to be tables, can be separately processed by software configured to process tables as glossaries.

Tabular Information

In many documents, a tremendous amount of information is contained in row/column tables. You will usually want to capture the structure of these tables, since having the structure allows you to handle the tables in spreadsheets, reformat it, and make other uses of this dense, rich information. In general, converting simple, well-structured electronic tables into SGML tables is a straightforward process. However, this can suddenly become a lot harder when tables are complex, and the application demands a faithful preservation of all nuances of original format. When dealing with complex vertical and horizontal spanning, solid lines, and other typographical 'niceties,' a one-to-one match to the target application may not be possible.

When input is in hardcopy form, or when tables have been constructed with tabs and spaces rather than using a table editor, conversion can be very difficult. In these situations, visual recognition (described a bit later) may be the only automated conversion technique available. Having gone through this effort, however, you will see tables that were once inaccessible and awkward to work with turn into tables that can be intelligently searched and flexibly reformatted for display.

Figures and Graphics

Graphics pose another set of conversion issues. When dealing with hardcopy documents, graphics must be scanned in, separated from text for OCR purposes, then coordinated with text so they can be brought in by reference at their original positions. If embedded in electronic formats, they also need to be separated and referenced. All this separation and referencing means that agreeing on a robust set of naming conventions is a high priority.

Whether originally in hardcopy or electronic form, graphics may need to be converted into a different format for inclusion with SGML encoded text. For example, TIFF may be used on the input side, but CGM required in the target application. In some cases, such a conversion could be difficult or impossible to perform automatically (e.g. raster to 3-D IGES format), and graphics will need to be re-created in the target format.

Math and Equations

Equations are similar to graphics, in that they may need to be separated from text and converted to another format, or manually re-entered if conversion isn't possible. From an SGML standpoint, math can be treated in at least three ways: as special typesetting characters, as graphics, or as logical markup. The latter is certainly preferable since it allows you to continue to operate on equations with special, equation editing tools, but options will be limited by the input you have to work with.

Links and Cross-References

Finally, you will want to capture and retain information about links and cross-references, since they are so central to hypertext delivery of documents. Generally, these links and references are implicit (buried within standard text) in the input, but must be recognized and tagged explicitly in the output. Issues that must be considered include:

Naming conventions for building IDs and ID references for links within each document.
Naming conventions for building IDs and ID references for links between documents.
How links between documents will be recognized and tied together.

The Conversion Itself

After sorting through the issues and matching up your input to the target application, the conversion process can be designed and implemented. In general, conversion can be either manual or software-assisted. Manual conversion, however, requires highly skilled operators and can be very expensive. Automated conversion software, such as Avalanche's FastTAG , is available to make this job a lot easier.

When choosing automated conversion tools, differences between software packages are important. In particular, most real-world applications demand use of software based on visual recognition rather than standard 'code-for-code' translation techniques. The translation approach is generally inadequate because it depends on absolute consistency at a detailed code level. For example, consider a word-processor document containing section headings that are bold and centered. A typical translation rule would look for a 'bold' code followed by a 'center' code as the indicator that a heading tag should be generated. In the real world, of course, the 'center' code might be entered first. That calls for a second translation rule. But what if the centering was done with tabs, or the space bar, or some combination of both? What if no 'bold' code is present because bold had already been turned on from the previous object? One can keep adding new translation rules as each new situation is discovered, but ultimately there can never be enough. Translation rules are too fragile.

How do documents get so inconsistent? Easily -- because from the author's point of view, they are consistent. All the different coding patterns result in the same visual presentation, and consistency is checked by viewing the printed page or WYSIWYG screen. Therefore, authors don't care about the underlying codes; in fact, the whole idea of WYSIWYG is that they shouldn't have to. Only the person who wrote the translation rules cares.

Visual recognition looks at documents the way people do. Unique to Avalanche's FastTAG, this technique analyzes the net effect of underlying codes, discovers the visual objects that would 'meet the eye,' and assigns them names based on user-defined rules. These objects then serve as the raw material for translation into SGML.

After choosing the conversion software, it's important to invest the time to install and configure it properly. This is the point at which the up front analysis is translated into an automated process that correctly matches your input to the target application. The pay-off for this effort will be tremendous.

When designing the conversion process, another issue to consider is whether output should be produced as whole documents or in smaller fragments. Often, it is much more manageable to convert information a chapter or a section at a time. It may also be advantageous to separate the easy parts of the conversion from the hard parts. For example, this may involve having body text go through one process, but having a separate process for front matter, graphics, complex tables, and other special cases. Of course, such decisions imply the need for a process to reassemble the document after conversion is complete. This can be accomplished using Interleaf's SGML products or tools like SoftQuad's Author/Editor.

Quality Control: The Untold Story

A common myth is that quality control is synonymous with SGML parsing. Actually, there are at least three critical dimensions of quality control, only one of which can be checked with a parser:

Syntactic correctness (making sure it parses).
Semantic correctness (making sure tags actually correspond to the correct objects, with beginnings and ends at the correct spots).
Tagging completeness (making sure we didn't miss anything important).

The difference between these measures of correctness is subtle but important to understand. Consider the sentence 'Joel have brown hair.' Clearly this is incorrect, and that fact could verified by an automated grammar checker (a kind of parser). But after being rewritten to read 'Joel has brown hair,' is the sentence now correct? What if Joel actually has blonde hair? In that case, the sentence would be syntactically but not semantically correct. The grammar checker couldn't catch this error: it would have no way of knowing. Similarly, an SGML parser alone can't tell that a paragraph is really a list item, or that a table really started two lines earlier than shown. Additional quality control techniques are required.

The need for completeness adds another twist. If we're being asked about hair color for a pair of siblings, 'Joel has blonde hair' isn't enough: we need to also mention that Susan has red hair. A grammar checker couldn't catch that mistake, just as an SGML parser couldn't know that all hypertext links have actually been tagged.

Automated Aids to Quality Control

Happily, each of these dimensions of quality control can be significantly automated. Here's a quick rundown of suggested techniques:

Syntactic correctness: Use an SGML parser to ensure the SGML is valid and parses against the target DTD.
Semantic correctness: View the SGML document in a WYSIWYG system (e.g. Interleaf or SoftQuad's Author/Editor) using a special, exaggerated style sheet. For example, make each higher-level heading at least 10 points larger than the previous level. Objects that are improperly tagged will stick out like a sore thumb.
Tagging completeness: Generate statistical reports on the tagged document. For example, a count of the number of figure references per figure may isolate a possible completeness problem if the ratio is low.

Each of these techniques can be implemented with the help of a software tool like Avalanche's SGML Hammer , which combines an SGML parser with a flexible output engine (the same one used in FastTAG). Of course, no matter which automated tools are being used to aid the quality control process, it's essential that all critical data be manually double-checked.

What If It Doesn't Parse?

SGML parsing ensures that SGML tags have been applied properly according to the target DTD. There are two reasons why this is important:

To ensure that the SGML files you've produced can be processed by your target application.
To ensure that the structure of your documents is consistent with the rules set up in your target application.

Similarly, there are two reasons why a converted output document may not parse:

The conversion process failed (perhaps the conversion software is inadequate or improperly configured, or manual operators are inadequately trained).
The document doesn't actually fit the target DTD.

The first problem requires fine-tuning the conversion process; the second requires some expertise in designing DTDs and conversion strategies. Consider some of the inconsistencies that may occur:

Having only one item in a list, when the DTD requires at least two (the reasoning was 'otherwise, we wouldn't have a list').
Having no embedded paragraph title when a title is always required.
Not having the information to fill in required SGML attributes (such as the last author to modify a particular paragraph).

If you have control over the DTD that will be used in the target application, you may not want to make it this restrictive. It may seem like a good idea to require that all lists have at least two items, but forcing that to be true may require a lot of fixing of input documents. You have to ask yourself if it's really worth it. If you can't control the DTD, you will probably want to use conversion software (like Avalanche's FastTAG) that can be configured to produce parsable SGML even if the input document is incomplete or contains some inconsistencies.

Also, given the potential for input documents not matching the DTD, it's important to plan the process by which information will be added or changed. In particular, it is critical to determine who has proper authority to determine exactly how such changes will be made.

Preparing the Way for Easy Conversions

The issues we've considered all have one thing in common: they result from a mismatch between the level of explicit structure in your current data vs. the level of explicit structure you need to get maximum benefit from your target applications. All the techniques we've talked about -- up front analysis, implementing the proper software, getting the right kind of expertise -- are designed to close this gap. While this takes some effort, moving to SGML can be easy once the gap is closed.

One of the best strategies for success is to adopt a methodology that builds structure into your data before SGML conversion is actually performed (this is another one of Avalanche's specialties). Good authoring practices that will increase the 'SGML-readiness' of your data include:

Consistent use of style sheet components, rather than direct formatting.
Consistent use of the native table editor, rather than tabbed tables.
Consistent use of explicit cross-referencing.
Consistent use of graphic frames, with inclusion of graphic files by reference.
Use of vector rather than raster graphics, so that automatic conversion to either vector or raster formats is possible in the future.
Consistent use of the native equations editor, rather than using special typesetting characters to express equations.

Designing these rules to fit your environment is best performed with expert help. Furthermore, it's important to realize that rules in themselves are not enough. After the methodology has been put in place, use a batch structure checking tool (also available from Avalanche) to ensure that rules are being followed and to provide feedback to authors in the native authoring environment. By performing what are essentially 'what if' conversions to SGML, such tools can catch structural problems early and help authors learn to improve the quality of their data before the actual conversion takes place.

This idea can also be applied to 'one-shot' conversions to SGML. If input data starts out relatively messy and unstructured, it may be best to break the conversion process into three stages:

Conversion from original form into a structured form in the native authoring environment. With visual recognition technology, this can be done by an automated tool. Essentially, it involves taking unstructured, inconsistently coded data and automatically applying as many of the above 'good authoring practices' as possible.
Manual clean-up (with the help of automated verification tools) to ensure good authoring practices have been consistently followed in the native authoring environment.
Automated conversion from the structured native form into SGML.

Once again, this highlights the point that converting to SGML is primarily a matter of enhancing the explicit structure in your information, and secondarily applying the SGML tags themselves.

Conclusion

Your desire to convert to SGML is being driven by the need to get more value from your document assets, and increasing the value of your document repository requires some investment and planning. Where there's no pain, there's no gain. However, a few basic strategies can get you where you want to go as quickly and painlessly as possible:

Realize the task at hand is sufficiently complex -- and important -- to justify careful planning and the use of experienced consultants. Plan ahead in enough detail to be sure all major issues have been identified and your conversion strategy makes sense. Perform a document analysis to understand how to best get from here to there.
Get the right software tools to automate as much of the conversion and quality control process as possible. Don't just look for a simple filter or quick-and-dirty translator tool that doesn't incorporate sophisticated techniques like visual recognition. Invest the effort to install and configure the software correctly.
Explore the possibility of refining information and migrating data as more than a single-stage conversion process.

Above all, focus on refining and maintaining the quality of your information, not just on conversion per se. Look at what you're trying to achieve by adopting SGML -- accessibility, reusability, interchange, etc. -- and make sure the conversion is optimized to set up the target application for success. And please...don't forget to eat your vegetables.