[Mirrored from: http://www.ileaf.com/avhome/veggies.html]
By Eric Severson
eric@avalanche.com
(303) 449-5032
© Avalanche Development Company/Interleaf Inc.
January 1995
Permission to redistribute this whitepaper is granted, provided that no changes are made, and that this notice and the above attributions are included in all copies.
If you want to enjoy the benefits of SGML, your data needs to be in SGML format. That seems logical enough. However, for most applications, getting to SGML is not just a simple translation process. Because the goal is a much richer information environment, conversion to SGML requires a process of document analysis and information refinement in addition to migrating the data itself. This white paper explores the issues involved in moving to SGML, offers techniques for making this process as effective and painless as possible, and shows how the steps in the SGML conversion process are directly related to the benefits you get once conversion is complete.
SGML conversions have a reputation for being worthwhile but not necessarily lots of fun. Much like the problem of having to eat your vegetables before you get dessert.
When I was young, I worked hard not to eat my vegetables. My goal was dessert, and I couldn't understand why vegetables had to stand in my way. You might reasonably ask the same kind of questions about SGML and SGML conversion. Since we're just talking about moving data around, can't this process be easy? Can't we just move right to the dessert -- information access, reuse, interchange, electronic distribution, and so forth? Actually, with the right kinds of software and expertise, conversion pain can be minimized. However, like understanding that vegetables are critical to your health, understanding the nature of SGML conversion is the first step to success.
While initial SGML efforts revolved around standards compliance, people are focusing on a lot more today. SGML is being seen as the key to leveraging information resources, freeing the tremendous amount of strategic business information that is locked away in documents. This information is locked away because, even though it may be stored in electronic form, it usually resides in individual proprietary systems whose sole purpose is to produce composed hardcopy for people to read. You can access the information in these documents, but only if you separately pick up and browse each one.
Simple document management systems -- like catalog cards in a library -- can help provide a roadmap to find the documents you might want to read. However, your success at finding them is limited by how well the titles, and other predefined catalog card classifications, actually represent what's inside. Furthermore, this representation needs to be consistent with your particular point of view given your particular task at hand. And you can't really avoid poring through each document turned up by the search.
Full text retrieval -- searching for specific words or phrases across the entire pool of text in a document collection -- helps you see inside the documents, but is limited by authors having actually picked your assumed words and phrases. And unlike your natural impulse when you read a document, full text retrieval has no notion of skimming through headings and section titles to narrow down the search. It gives equal weight to phrases found in main titles and phrases found in footnotes, because it doesn't know which is which.
Suppose you find the documents you want: what do you do next? Probably, you want to reuse the information in some way, combining it with other data and republishing it in a new format, or passing it along to a separate process or other areas of the organization. Once again, the proprietary, page-oriented formats stand in the way: I can't use your information unless I also use your composition system and your page layout specifications.
Unlocking document information requires an open systemsapproach, which is what SGML is all about. Using SGML, open systems encode the logical structure of objects below the document level: the various levels of headings, lists and so forth that people pay attention to when actually reading the information. Furthermore, SGML encodes this information in a neutral format, keeping logical content separate from specific page layout decisions, and allowing flexible interchange across different platforms and applications.
By taking this approach, SGML maximizes both the current and future usefulness of your information resources, facilitating all of the following:
Moving to SGML is challenging because truly getting a handle on your information is challenging; making it accessible and reusable and interchangeable takes some doing. But that's exactly why this process is so important.
People tend to think of conversions as a filtering or translation process, like moving a WordPerfect document into Microsoft Word. This kind of conversion takes one set of information and transforms it to some equivalent set of information. Only the underlying file format is different.
By contrast, SGML conversion typically involves building a bridge between the world of hardcopy and word processing documents (where logical structure is perceived visually by the reader), and 'intelligent' documents (where logical structure is explicitly encoded). The whole point of SGML conversions, and the reason they enable document reuse, electronic document delivery and other benefits is that they necessarily involve information enrichment, adding more than was originally there:
As a basepoint for exploring the ins and outs of SGML conversion, let's take a quick look at what makes up the process. At a high level, there are four major components:
In general, any large-scale conversion effort is a manufacturing process (in this case, what we're 'making' is SGML documents), and should be set up as a production line. As we'll see, the great bulk of this line can be automated with the proper strategy and the right software tools.
The important thing to understand about SGML applications is that they're not all the same. There's actually no such thing as 'converting to SGML,' only converting to a specific SGML application . SGML is not a set of tags, nor even a specific way of looking at your information, but rather a language for defining the way you want to look at your information. Or in some cases, how someone else wants you to look at your information.
SGML applications are always based on a specific Document Type Definition or DTD. SGML DTDs catalog the information objects ('elements') of interest, including their names ('generic identifiers'), properties ('attributes'), and allowed content ('content models'). Through the content models, the DTD also specifies all the allowed relationships between elements, including hierarchical document structure, specific order in which elements must appear, how many times each element can appear, what's required vs. optional, and -- through a slightly different mechanism -- the possible links between elements.
One of the first steps in building a new SGML application and set of DTDs is document analysis. This process answers questions like:
Document analysis is also required for SGML conversion. That's because conversion involves matching up your input to the kinds of information elements and relationships assumed in the SGML application. In this case, the questions are:
This process should be performed with the help of experts who have been there before (Avalanche and some others routinely provide these types of services). Answering these questions up front ensures that the data you're converting will be optimized to support the target applications -- electronic delivery, search and retrieval, document management / assembly, and so forth.
Before describing how conversions can be made easier, let's take a closer look at the kinds of issues that can come up in practice. These are the things that experienced consultants can quickly sort through, but which might otherwise cause difficulty.
First of all, the form of the input can make a big difference in the recommended conversion approach. Hardcopy is of course the hardest place to start, but don't be fooled by documents in electronic form. As we'll see, all electronic forms are not equal.
The first step in converting hardcopy to SGML is getting the text into digital form. This can be done by manual rekeying, or automated by OCR scanning. Either way, the result is generally an ASCII file containing text and whitespace -- but not much else.
These facts lead to several issues unique to hardcopy conversions:
Furthermore, the sheer problem of paper handling cannot be taken lightly. Making sure scanned images and OCR'd text stay synchronized with original hardcopy requires careful planning and workflow management. Again, here's where experience can make a big difference.
Conversion to SGML is a lot easier when input has been authored in electronic form. However, there are still things to consider:
Having reviewed the forms of input, you will want to look inside the documents to ensure your conversion takes maximum advantage of the underlying information. The issues that arise vary with the type of content:
The bread-and-butter of documents, standard body text is the easiest part to convert. Its headings, lists, and paragraphs provide the backbone navigation structure for retrieval and document management applications. But even 'plain text' comes in a variety of flavors, some vanilla and some relatively exotic.
The very first thing in most documents -- the title page and other specialized 'front matter' -- can be difficult to convert using standard methods. Title pages, although short, can often contain twenty or more individual SGML objects, arranged in complex visual layouts. A few pages later, the document suddenly becomes much more straightforward. Yet some of the information, such as the document title, author's name, date and place of publication, revision date, and so on can be very important to get just right, since this content is often turned into tagged fields that are important in document management and retrieval. Also, at both the beginning and end of the document, there are objects that may be explicit in the input but are assumed to be automatically generated from the SGML. These items, which include such things as tables of contents, lists of tables/figures, and indexes, are usually encoded by a single SGML tag or are ignored entirely.
Given these facts, it often makes sense to apply automated methods to the main body of the document, but treat the first few pages (and certain additional sections) as exceptions to the mainstream process. These exceptions are often best handled manually or through specialized automated processes. For example, title page information can be entered into a template, much like filling out a library catalog card. SGML encoding can be produced by software configured to understand the template. Glossaries, which might otherwise appear to be tables, can be separately processed by software configured to process tables as glossaries.
In many documents, a tremendous amount of information is contained in row/column tables. You will usually want to capture the structure of these tables, since having the structure allows you to handle the tables in spreadsheets, reformat it, and make other uses of this dense, rich information. In general, converting simple, well-structured electronic tables into SGML tables is a straightforward process. However, this can suddenly become a lot harder when tables are complex, and the application demands a faithful preservation of all nuances of original format. When dealing with complex vertical and horizontal spanning, solid lines, and other typographical 'niceties,' a one-to-one match to the target application may not be possible.
When input is in hardcopy form, or when tables have been constructed with tabs and spaces rather than using a table editor, conversion can be very difficult. In these situations, visual recognition (described a bit later) may be the only automated conversion technique available. Having gone through this effort, however, you will see tables that were once inaccessible and awkward to work with turn into tables that can be intelligently searched and flexibly reformatted for display.
Graphics pose another set of conversion issues. When dealing with hardcopy documents, graphics must be scanned in, separated from text for OCR purposes, then coordinated with text so they can be brought in by reference at their original positions. If embedded in electronic formats, they also need to be separated and referenced. All this separation and referencing means that agreeing on a robust set of naming conventions is a high priority.
Whether originally in hardcopy or electronic form, graphics may need to be converted into a different format for inclusion with SGML encoded text. For example, TIFF may be used on the input side, but CGM required in the target application. In some cases, such a conversion could be difficult or impossible to perform automatically (e.g. raster to 3-D IGES format), and graphics will need to be re-created in the target format.
Equations are similar to graphics, in that they may need to be separated from text and converted to another format, or manually re-entered if conversion isn't possible. From an SGML standpoint, math can be treated in at least three ways: as special typesetting characters, as graphics, or as logical markup. The latter is certainly preferable since it allows you to continue to operate on equations with special, equation editing tools, but options will be limited by the input you have to work with.
Finally, you will want to capture and retain information about links and cross-references, since they are so central to hypertext delivery of documents. Generally, these links and references are implicit (buried within standard text) in the input, but must be recognized and tagged explicitly in the output. Issues that must be considered include:
After sorting through the issues and matching up your input to the target application, the conversion process can be designed and implemented. In general, conversion can be either manual or software-assisted. Manual conversion, however, requires highly skilled operators and can be very expensive. Automated conversion software, such as Avalanche's FastTAG , is available to make this job a lot easier.
When choosing automated conversion tools, differences between software packages are important. In particular, most real-world applications demand use of software based on visual recognition rather than standard 'code-for-code' translation techniques. The translation approach is generally inadequate because it depends on absolute consistency at a detailed code level. For example, consider a word-processor document containing section headings that are bold and centered. A typical translation rule would look for a 'bold' code followed by a 'center' code as the indicator that a heading tag should be generated. In the real world, of course, the 'center' code might be entered first. That calls for a second translation rule. But what if the centering was done with tabs, or the space bar, or some combination of both? What if no 'bold' code is present because bold had already been turned on from the previous object? One can keep adding new translation rules as each new situation is discovered, but ultimately there can never be enough. Translation rules are too fragile.
How do documents get so inconsistent? Easily -- because from the author's point of view, they are consistent. All the different coding patterns result in the same visual presentation, and consistency is checked by viewing the printed page or WYSIWYG screen. Therefore, authors don't care about the underlying codes; in fact, the whole idea of WYSIWYG is that they shouldn't have to. Only the person who wrote the translation rules cares.
Visual recognition looks at documents the way people do. Unique to Avalanche's FastTAG, this technique analyzes the net effect of underlying codes, discovers the visual objects that would 'meet the eye,' and assigns them names based on user-defined rules. These objects then serve as the raw material for translation into SGML.
After choosing the conversion software, it's important to invest the time to install and configure it properly. This is the point at which the up front analysis is translated into an automated process that correctly matches your input to the target application. The pay-off for this effort will be tremendous.
When designing the conversion process, another issue to consider is whether output should be produced as whole documents or in smaller fragments. Often, it is much more manageable to convert information a chapter or a section at a time. It may also be advantageous to separate the easy parts of the conversion from the hard parts. For example, this may involve having body text go through one process, but having a separate process for front matter, graphics, complex tables, and other special cases. Of course, such decisions imply the need for a process to reassemble the document after conversion is complete. This can be accomplished using Interleaf's SGML products or tools like SoftQuad's Author/Editor.
A common myth is that quality control is synonymous with SGML parsing. Actually, there are at least three critical dimensions of quality control, only one of which can be checked with a parser:
The difference between these measures of correctness is subtle but important to understand. Consider the sentence 'Joel have brown hair.' Clearly this is incorrect, and that fact could verified by an automated grammar checker (a kind of parser). But after being rewritten to read 'Joel has brown hair,' is the sentence now correct? What if Joel actually has blonde hair? In that case, the sentence would be syntactically but not semantically correct. The grammar checker couldn't catch this error: it would have no way of knowing. Similarly, an SGML parser alone can't tell that a paragraph is really a list item, or that a table really started two lines earlier than shown. Additional quality control techniques are required.
The need for completeness adds another twist. If we're being asked about hair color for a pair of siblings, 'Joel has blonde hair' isn't enough: we need to also mention that Susan has red hair. A grammar checker couldn't catch that mistake, just as an SGML parser couldn't know that all hypertext links have actually been tagged.
Happily, each of these dimensions of quality control can be significantly automated. Here's a quick rundown of suggested techniques:
Each of these techniques can be implemented with the help of a software tool like Avalanche's SGML Hammer , which combines an SGML parser with a flexible output engine (the same one used in FastTAG). Of course, no matter which automated tools are being used to aid the quality control process, it's essential that all critical data be manually double-checked.
SGML parsing ensures that SGML tags have been applied properly according to the target DTD. There are two reasons why this is important:
Similarly, there are two reasons why a converted output document may not parse:
The first problem requires fine-tuning the conversion process; the second requires some expertise in designing DTDs and conversion strategies. Consider some of the inconsistencies that may occur:
If you have control over the DTD that will be used in the target application, you may not want to make it this restrictive. It may seem like a good idea to require that all lists have at least two items, but forcing that to be true may require a lot of fixing of input documents. You have to ask yourself if it's really worth it. If you can't control the DTD, you will probably want to use conversion software (like Avalanche's FastTAG) that can be configured to produce parsable SGML even if the input document is incomplete or contains some inconsistencies.
Also, given the potential for input documents not matching the DTD, it's important to plan the process by which information will be added or changed. In particular, it is critical to determine who has proper authority to determine exactly how such changes will be made.
The issues we've considered all have one thing in common: they result from a mismatch between the level of explicit structure in your current data vs. the level of explicit structure you need to get maximum benefit from your target applications. All the techniques we've talked about -- up front analysis, implementing the proper software, getting the right kind of expertise -- are designed to close this gap. While this takes some effort, moving to SGML can be easy once the gap is closed.
One of the best strategies for success is to adopt a methodology that builds structure into your data before SGML conversion is actually performed (this is another one of Avalanche's specialties). Good authoring practices that will increase the 'SGML-readiness' of your data include:
Designing these rules to fit your environment is best performed with expert help. Furthermore, it's important to realize that rules in themselves are not enough. After the methodology has been put in place, use a batch structure checking tool (also available from Avalanche) to ensure that rules are being followed and to provide feedback to authors in the native authoring environment. By performing what are essentially 'what if' conversions to SGML, such tools can catch structural problems early and help authors learn to improve the quality of their data before the actual conversion takes place.
This idea can also be applied to 'one-shot' conversions to SGML. If input data starts out relatively messy and unstructured, it may be best to break the conversion process into three stages:
Once again, this highlights the point that converting to SGML is primarily a matter of enhancing the explicit structure in your information, and secondarily applying the SGML tags themselves.
Your desire to convert to SGML is being driven by the need to get more value from your document assets, and increasing the value of your document repository requires some investment and planning. Where there's no pain, there's no gain. However, a few basic strategies can get you where you want to go as quickly and painlessly as possible:
Above all, focus on refining and maintaining the quality of your information, not just on conversion per se. Look at what you're trying to achieve by adopting SGML -- accessibility, reusability, interchange, etc. -- and make sure the conversion is optimized to set up the target application for success. And please...don't forget to eat your vegetables.