[Mirrored from: http://www.bim.be/BeLuxweb/94/6sgmltr.html]
This presentation outlines various approaches to SGML "up-translation", i.e. transformation of text data from arbitrary encoding formats to valid SGML instances. Visual recognition techniques, pattern matching techniques and two-step approaches with early conversion to low-level SGML structures, are analyzed with respect to various data sources: text processor files, OCR data and phototypesetting files. This presentation also explains why "uptranslation" is by no way symmetrical to "down-translation" i.e., transformation of SGML data to arbitrary formats, and why different tools and programming paradigms are required for each problem.
SGML "up-translation" is defined as the conversion of text data from some arbitrary encoding format into a valid SGML instance of some "target" DTD, using (at least partly) automated processing techniques
This challenge is faced by any project or organization adopting SGML as its standarddocument representation format, and expecting to avoid throwing away megabytes of legacy data, especially those already available in electronic form. Developing better methods and techniques to automate the SGML up-translation process and increase its reliability is, therefore, of obvious economic significance.
However, the underlying problems seem to have been frequently misunderstood. One of the reasons is an obvious lack of generally applicable processing models. As a consequence, it is very common to set unreasonable expectations based upon immoderate confidence about some "magic" tools in situations where a reasoned feasibility study could have shown that, for quasi-theoretical reasons, these expectations were much too high.
This paper is an attempt to characterize existing approaches to SGML up-translation, with a special focus on the multi-step processing approaches which underlie the latest generation of tools.
Most documents so far have been designed to be printed. For the vast majority of legacy documents available under electronic form (typesetting code, wordprocessor formats, result of OCR processes), existing encoded information is primarily related with visual presentation. Therefore, we will focus on the problem of up-translating documents initially created in a word processing or typesetting environment, although some of the concepts outlined here also apply to structured data created by data management tools.
In such an approach, the only DTD considered is the target DTD, and an attempt is made to "jump directly" from the information level available in the source file to that of a valid instance of the target DTD. This is what we call a "single-step" approach. In this process, three phases, however, can be distinguished, even if they are intermixed in a single program:
1. identification of "visual presentation classes" in the source document, that is groups of text objects sharing a common set of formatting properties. Here, "formatting properties" means both typographic characteristics and typical text patterns For example, an indented paragraph, preceded by a dash can be identified as a some kind of list item.
2. mapping of these "visual presentation classes" onto SGML element types from the target DTD, thus "putting SGML names" upon identified object types. The output of this phase is essentially a flat array of typed objects (some of them with attributes). In the example, this dashed list item could be characteristic of a level 2 list but, if the target DTD allows nested lists, the associated ITEM> element would be the same as for a bulleted item, characteristic of level 1.
3. target structure generation, possibly involving reorganization of data and addition of missing structures (both elements and attributes) to comply with the rules imposed by the target DTD. Generation of the LIST> element which surrounds the series of items would be an example, but this process can be much more complex in real world situations.
Many attempts have been made to use the various tools currently available with this single-step approach in mind. Existing tools can be classified according to the technique used to identify the "visual presentation classes" : pattern matching or visual recognition.
Pattern matching can be used to identify visual classes, as long as the coding system used for the source file is documented and understood. This is applicable to formatting codes (whether local or referring to a style sheet) in word processor files, as well as to typesetting languages.
The idea to use text patterns to imply SGML structures is suggested by the SGML standard itself through the SHORTREF mechanism, which allows SGML context-sensitive mapping of characters or strings to entities.
This idea was elaborated upon in the Mark-It parser (Yard Software) which, in a nonstandard extension, allows text patterns (regular expressions) to be used as short reference delimiters. Text patterns can be named, thus allowing complex nested constructions.
In such a system, the mapping process described previously is handled through the definition of entities appearing in short reference maps, combined with the USEMAP rules. Because the data manipulation language offered by Mark-It is very weak, however, no real data reorganization can be applied (except for element surrounding obtained through automatic "tag inference" by the parser, in cases when the target DTD makes provision for tag omission).
While the "extended SHORTREF' concept implemented by Mark-It can appear as an elegant trick at first sight, it is very heavy to use in practice. It also contributes to the general impression about Mark-It in the SGML community: that of a tool which uses several proprietary extensions (or distortions?) of SGML concepts, with the danger that non-specialized users might have difficulty distinguishing what is standard from what is not.
Exoterica's OmniMark avoids the preceding pitfalls by offering an explicit pattern matching mechanism which supports event-driven programming, based on of lexical events. Though similar facilities are offered by several tools (such as lex or perl), only OmniMark provides tight coupling with the embedded SGML parser, so that pattern recognition can be made dependent on the SGML context. Patterns in OmniMark can be named, thus allowing complex constructions.
This part of the process covers the visual class identification and mapping phase.
The conversion process is complexified in what is called "context-translation": the powerful error recovery capabilities of the embedded parser are used to "correct" the generated SGML file on the fly, and make it a valid instance of the target DTD. This involves automatic generation of inferred elements to satisfy the structural constraints, even when the corresponding tags are not declared omissible. This part of the process correspond to the target structure generation phase in our model.
Unfortunately, while this mechanism works well for slight distortions with respect to the target structure. it can prove very unstable when intensive data shuffling and generation of additional structures would be necessary to comply with the target DTD. The limits are those of the single-step model, with the assumption that the source file is a valid instance of the target DTD, in which structure has only been "hidden" in some way and has to be "uncovered" .
Zandar's TagWrite is a sophisticated pattern matching engine with specialized features to process Microsoft Word files. TagWrite is devoid of any parser and generation of a valid instance is the responsibility of the programmer.
WordPerfect's recent product, Intellitag, is yet another implementation of the patternmatching approach, based on contextual search and replace forms. Though less powerful than previously described tools, it is also less intimidating for the WordPerfect user and provides real-time display and validation of the conversion result.
All the tools based on pattern recognition stumble on the same difficulty: in most cases, there are a number of ways to encode the same "visual structure".
This is true for word processor files, where nothing can tell from the visual appearance of a document whether a style sheet was used without any modification, with modification, or not at all. Even when corporate guidelines impose exclusive use of some well defined style sheets and prohibit usage of "local" styles or local stylesheet overrides, experience shows that this is rarely guaranteed, since the word processors themselves were not designed to impose such a discipline.
In typesetting files, even sets of files coming from a same typesetting shop for a same document exhibit variations in coding, since each operator tends to use his own tricks and modifies the basic "formats" drawn from a common library. These problems tend to reach their climax in the case of complex structures, such as tables.
As a result, the definition of a complete grammar for recognizable text-patterns is often and endless process, with time consuming successive adjustments.
The visual recognition approach was designed to circumvent the code variability problem. In this case, the input codes are "literally understood" to build some internal representation of the formatted document, preserving all typographic and layout information.
Programmable recognition is directly applied to this "visual abstraction", thus bypassing the need to deal directly with the input code grammar.
FastTag (Avalanche) is the only commercialized tool implementing such an approach. This tool combines Al techniques to automate some aspects of the visual recognition process, while allowing programmers to refine visual class identification using the Inspec language. The Louise language is used to define output code generation, thus combining the mapping phase and the target structure generation phase.
Since no SGML parser is embedded in FastTag, no DTD-driven structure inference occurs (as opposed to what happens with Mark-It or OmniMark), and the generation of a valid instance of the target DTD is under the sole responsibility of the programmer. Furthermore, specific programming is required to track the SGML context in the generated instance, which is needed when the mapping process is context-dependent (see the embedded ITEM> example used before).
Also, although Louise is a rather general purpose procedural language able to handle nontrivial data structures, FastTag's event-driven programming paradigm (where an event is generated each time a visual object is identified) does not facilitate complex structure reorganization during the output process.
By associating the identification, mapping and target structure generation phases in a single, non-interruptible process, the single step approach described so far tends to concentrate several independent kinds of difficulties in a single piece of code. This code is written and maintained by a single programmer, who has to combine knowledge related to the idiosyncrasies of some input format and the particulars of some target DTD.
Furthermore, tools built around event-driven programming paradigms tend to restrict the programmer to a linear vision of the conversion process, and make it very difficult to address cases where structure disambiguation requires look-ahead, look-behind and look-aside.
Finally, by directly referring to the target DTD and attempting to make a one-step jump to the associated information level, one silently makes the assumption that the source document fits the target DTD, which is not always true.
Processing model The main idea behind multi-step approaches is to relax some of the previous constraints by introducing one or several intermediate DTDs. These intermediate DTDs explicitly model intermediate "information states" in the uptranslation process, such as the output of the identification or mapping phases described before. But it is also possible to break down each phase further and to introduce still more DTDs.
This "pipelining" process intends to meet several goals:
1. separate the problems associated with decoding of the input format, identification of visual presentation classes, mapping to the target structure and "raising" the information level;
2. allow different programmers, with different skills, to develop and maintain the multiple pieces of the up-translation "pipeline", thus facilitating work in parallel;
3. enter the SGML domain as soon as possible; this brings the benefits of explicit information content modelling through DTDs, and allows use of sophisticated tools to process SGML structures, especially those which allow unconstrained examination on the ESIS tree in a non-sequential way, such as Balise developed by AIS.
This leads to a new processing model, in which the up-translation process is decomposed in a series of jumps through intermediate information states, each of them being modelled as an SGML instance. Transformation from one state to the next can be seen as a tree transformation process, even though some of these trees are rather "flat".
There is nothing to prevent using the tools mentioned so far in a multi-step setting, but new tools recently introduced explicitly call for such an approach.
Electronic Book Technologies recently introduced several new concepts and products which fit well in our multi-step up-translation model.
The Rainbow DTD is a "wide-spectrum" DTD, which attempts to model the data structures usually found in files generated by modern word-processors and technical documentation tools (Word, WordPerfect, Interleaf, FrameMaker, etc.). Conversion from proprietary word-processor formats to Rainbow -- the work done by "Rainbow Makers" --is a fairly mechanical process which makes no attempt to raise the information level, and thus should become quite deterministic and reliable once the Rainbow Makers are fully debugged.
DynaTag is a new-generation up-translation tool which effectively supports the multistep approach. DynaTag's input format is Rainbow: the original decoding step thus actually occurs outside DynaTag itself. The DynaTag engine automatically determines the set of "visual presentation classes" that were used in the document, and allows to browse the results of this recognition process in "WYSIWYG" mode.
Concerning output, DynaTag automatically generates a "de-facto" DTD based on an "OTD" (Output Type Definition) provided by the user, which is essentially a catalogue of element and attribute names.
The generated DTD is made flexible enough to ensure that the generated instance is valid. In our processing model, the corresponding information state would be the output of the mapping phase.
The rest of the up-translation process -- converting the intermediate instance to a valid instance of some predefined target DTD -- is outside the scope of DynaTag, and is handed to specialized SGML transformation tools.
In 1993, AIS had to handle a difficult up-translation job for Bureau Veritas (Marine Branch), in which the Rules and Regulations for the Classification of Ships (a thousand pages document) had to be converted from Autologic MOPAS typesetting codes into SGML. One of the difficulties of this job was the high proportion of math equations (more than 5000 on the whole) to be converted to the lSO TR 9573 DTD.
This job was done using a multi-step approach, involving five intermediate stages. Each stage was described by an SGML DTD and made explicit some information which was not explicit in the previous stage, thus raising the information level captured by the SGML structure.
The first DTD in the pipeline, called the "magma DTD", was simply an SGML version of a catalogue of typesetting codes and format calls, which defined each code as an EMPTY element. Such a low-level use of SGML could be considered as an abuse. However, it allowed us to use an SGML parser to check the completeness of the code catalogue against instances, and discover that the documentation provided to us was far from complete!
The only tool used throughout the pipeline was Balise, used as an SGML transformation tool, except for the initial step handled by a very simple lex program.
This processing model is reminiscent of the SGML document formatting model which emerged during work on the DSSSL standard. However, the resemblance ends here.
Even though the resulting formatted document can be thought of as "visually richer" than the original SGML instance, it is important to realize that, during this process, only information loss occurs. For this reason, an SGML formatting process (downtranslation) may be quite complex, but it is always deterministic.
As a contrast, an SGML up-translation process though it can certainly be very complex as well -- generally attempts to raise the information level in the document representation and, for this reason, cannot be guaranteed as being deterministic. This would be true even in the (theoretical) limit case where one would "up-translate" the result of some SGML down-translation process, that is try to recover some lost information (or filter some "noise"). In real situations, however, one major additional problem occurs : the source document can never be guaranteed to be coerced into the target model (DTD) since the model did not even exist (or was not explicit) when the document was created!
This is why "uptranslation" cannot be seen as symmetrical to "downtranslation" and, in the general case, will always be harder, however powerful are the tools at hand. In extreme cases, source data have to be modified to make uptranslation feasible at all.
François Chahuneau is currently General Manager of Advanced Information Systems (A.I.S. S.A.), a thirty people, fully owned subsidiary of French Group Berger-Levrault, specializing in SGML systems integration, consulting, application and software product development. After several years spent as a research scientist in applied mathematics and computer science, he joined Group Berger-Levrault in 1988 to create AIS. M. Chahuneau spends most of his time developing AIS activities in the reference publishing, bank, insurance, and technical documentation markets. Recent achievements include SGML-based editorial systems for several French law publishers, the Historical Dictionary of Switzerland and Groupe Bull, innovative electronic documentation systems for PSA and Aerospatiale, a prototype of an electronic reading workstation for Bibliothèque de France, and the launching of the Balise SGML application language and the SGML/store SGML database. M. Chahuneau studied at the Ecole Normale Supérieure in Paris.