Event Driven or Tree Manipulation Approaches to SGML Transformation -

[Mirrored from: http://www.balise.com/current/articles/lecluse.htm]

Event Driven or Tree Manipulation Approaches to SGML Transformation -
You should not have to Choose

Christophe Lécluse

Technical Director

A I S
35 Rue du Pont
F-92200 Neuilly sur Seine

clec@ais.berger-levrault.fr

Abstract

Two approaches are available for specifying transformation processes on SGML documents: a declarative approach, based on context-sensitive rules triggered on SGML parsing events, and a procedural approach, based on explicit manipulation of the document tree.

This paper shows that each approach is optimal for a certain class of problems, but that both are actually needed and that maximum expressive power is achieved when both can be combined in a same program.

1. SGML Transformation

SGML Transformation is the process that takes SGML instances as input and manipulates them for the various purposes of publishing activities: generation of new instances, content validation, data or table of contents generation, composition, etc.

This process is key in most SGML applications. SGML documents are instances of a model defined by the Document Type Definition (DTD) and SGML transformation is the process for manipulating the information encoded into the document.

Application areas for SGML transformation are numerous:

Validation and Quality Assurance. The structure model of documents is expressed through the DTD and SGML parsing already provides a significant level of validation. Most applications however, require more in-depth validation of documents. Examples of such validations are content validation (check that there is no empty paragraph and that a title has less than 30 words), specific structure validation (check that there is more than 3 chapters and less than 20), etc.

Information Extraction from SGML Documents: Typical example is the extraction of meta information (author, date etc.) from an incoming document for management purposes. This area of application is very large, ranging from simple field extraction up to automatic derivation of sub documents.

SGML documents enhancements: This covers the implementation of live links in documents (the so-called active documents) and, more generally, merging various sources of information (documents, databases etc.) in an SGML framework.

Translation from non SGML to SGML: This area, also called ``up-translation'', can largely be considered also as an application of SGML transformation. Up translation applications can always be separated in two phases. The first phase (the decoding phase) simply deals with the analysis/decoding of the input format and generates a first level of SGML. The second phase (the interpretation phase) deals with the interpretation of the decoded objects and tries to generate objects that conform to the output DTD. The second phase is clearly an application of SGML transformation. It is in most cases the largest and most complex phase of up translation projects.

SGML to SGML transformations (DTD changes): This area is a very classical issue in SGML applications that involve information exchange with other environments: conversions between legacy DTDs and exchange DTDs, update of SGML documents to new versions of DTDs etc.

SGML document publishing: This area covers publishing on various possible media (paper, CD-ROM, Web, etc.). Publishing SGML documents means transforming the SGML information into a form that can be presented to consumers. This form can be the input of a composition system, a Web of HTML pages, the input format for a CD-ROM engine, etc. In all cases, this process involves a transformation of the original SGML documents towards a target (sometimes SGML or pseudo SGML) format.

This list of application areas is not exhaustive and some applications can also spread different areas. They however reflect the most current cases of SGML transformation that arise when implementing SGML systems.

2. Event driven and tree manipulation approaches

Several tools and products have been proposed to perform such transformations. They are based on SGML parsers coupled with application languages.

2.1 The event-driven approach

Most of these tools use the event driven approach where each SGML ``event'' (ESIS components in most cases) is associated to some action written in the application language. Such a language is therefore essentially rules-based, where rules associate element names (or elements in context) to lists of instructions.

The handling of context information by this kind of tools is of primary importance for writing simple applications. Rules must be able to express conditions based on contextual information. The simplest context expressions only involve the stack of open elements. More elaborated context expressions may involve the left sibling of the current node, for example.

In this approach, the context information is gathered synchronously with SGML parsing. This means that the system knows at most all the SGML information that the parser has already decoded, but all information beyond the current element is unreachable. Figure 1 above illustrates this notion of left context (before the current element) and right context (after the current element).

Moreover, we can simply figure out that keeping all the left context at hand when parsing very large SGML documents is too costly. Most systems thus limit the available context to a limited part of the left context, mainly the stack of open elements and some immediate left siblings.

Event driven SGML transformation programs will therefore restrict the context expressions to such a subset as illustrated in the following example written in the Balise language:
element ITEM [ in LIST and first ] {

  //Rename first items in lists to FITEM 

     on start { 

          changeGI("FITEM"); 

          cout << echo(); 

     } 

     on end cout << echo(); 

} 

element ITEM [ not in LIST ] { 

		     //For non List items do something else ... 

} 
The context management and the corresponding context expressions are an essential characteristic of the event driven approach. Only a limited notion of context is available and, especially, it is not possible to use information located after the current (or immediate follower) parsed element.

The consequence on transformation programs is that specific hooks have to be developed to overcome this limitation, mainly:

buffers and, more generally, complex data structures that store relevant information for further use;

multiple passes processes, with intermediate files or data structures to exchange global information

etc.

This is a strong limitation when complex SGML structure reorganization is needed.

In conclusion, the rule-based programming style, while it is very well adapted to simple/limited transformations, does not scale well to complex transformations, and programs quickly become more and more complex and difficult to maintain.

2.2 The Tree Manipulation Approach

The tree manipulation approach consists in providing the programmer with a tree manipulation abstraction rather than an event abstraction for SGML instance manipulation. The manipulation is done through an API that allows navigating in trees, modifying trees, creating new trees, etc.

This paradigm corresponds to transformations that require some global view of the manipulated documents. HyTime validation or DSSSL transformation processes specifically refer to such a tree manipulation paradigm. For instance, the product HyMinder (HyTime instance validation and manipulation from TechnoTeacher) specifically rely on a persistent form of the instance tree and a tree manipulation API for the programmer.

Transformation programs will typically look like the following example:
function transformDocument ( doc ) {

 // Rename first ITEM elements to FITEM 

 for elem in searchElemNodes(doc, "ITEM") { 

    var leftS = leftSibling(elem); 

    if leftS == nothing or GI(leftS) != "ITEM" 

                      changeGI(elem, "FITEM"); 

  } 

 // Move all FIG elements to the end of the document 

 for elem in searchElemNodes(doc, "FIG") { 

     cutSubTree(elem); 

     insertSubTree(doc,elem, AT_END); 

 } 

 // Dump the resulting tree as SGML markup 

 SdumpSubTree(doc); 

} 
The clear advantage of this approach is that the programmer has a global view of the document at hand. He can use the navigation/search primitives to check left or right contexts, call update operations on the tree, create new trees etc. Complex structural operations such as ``programmed cut and paste'' are very intuitive to express as tree transformations and they will be programmed directly using tree manipulation primitives.

This approach, however, also has some clear limitations. First, resources (especially memory) are an issue for manipulating large documents. Handling documents on secondary storage is possible but this leads to more complex systems and impede performances.

Second, while the API paradigm is best suited to complex transformations, especially those involving structural transformations, the event-driven approach leads to simpler programs for simpler transformations such as element renaming, content validation, title extraction etc.

3. Combination of both approaches

AT first glance, this looks like an alternative where the programmer has to choose one or the other paradigm (through the same or different tools), depending on the complexity of the problem to solve or the size of the documents to process. What we want to show in this presentation is that these two approaches can and should be merged within a dual programming paradigm that takes benefit from both.

3.1 The table DTD translation example

Translation between the various existing table DTDs is a classical example of a transformation that obviously takes benefit from direct access to the SGML elements. The translation between CALS and ISO or AAP DTD for instance requires subtle re-organization of rows and cells structures, and generation of attributes whose value strongly depend on the element organization.

Event driven and tree manipulation approaches can be efficiently combined in this case. Table instances are, in most cases, rather small pieces of SGML and memory-based manipulation of tables is possible. We thus have potentially large/huge instances that contain some elements (tables) that can be loaded and manipulated as trees whereas the rest of the instance is processed in event-driven mode.

This example illustrates the first possible combination of the two modes where a document is processed in event-driven mode and some specific subtrees are loaded and manipulated as trees.

Table transformation can also illustrate another alternative: in most event-driven SGML transformation, the output instance is most often generated by outputting a correct sequence of SGML markup. For the same reasons as above, it can be very useful to have some tree abstraction of the output document, that is to generate a tree representation of the instance and just dump this representation when the instance is complete.

The interest of this method for generating SGML output is typically the ability to explore parts of the tree that have already been built, and potentially insert or update elements at any position in the tree. In the case of table transformation, programs can be considerably simplified if one is able to add or modify attributes in column or row descriptors at any time.

3.2 Full context event-driven transformation

The restriction of context information to (a limited part of) the left context is the strongest constraint of event-driven transformations. Intermediate data structures can be used to ``remember'' part of the left context that may be useful for further use but, by construction, there is no way to get knowledge of elements that are not yet parsed.

This is especially frustrating when applying a transformation that can be intuitively expressed as an event-driven program but requires some right context information.

Addressing this case means extending the notion of context to the whole instance tree. More specifically, the instance tree is first built as for a tree manipulation transformation, but instead of calling a tree manipulation function on it, a ``parsing function'' is called that simply traverses the tree and generates events in the same way as a parser would do. The only difference for the application programmer is that he is able to use context expressions and navigation primitives that address the whole instance.

3.3 The Cell content issue in table translation

Consider the table translation again as an example of structure-oriented transformation. Tables have cells that may contain arbitrary elements, depending on the ``host'' DTD; typically paragraphs, text, lists, images, etc. When table translation is included into a more general transformation process, it may be necessary to also transform the cell paragraphs just like any other paragraph outside of tables.

A possible approach is to duplicate the transformation code within the table manipulation function. A more modular way is simply to invoke the event-driven actions corresponding to cell elements.

This is another example of the interest of invoking event-driven actions from pre-constructed subtrees.

4. Key interfaces

From the examples above, we can identify several key functionalities and interfaces that are necessary to successfully integrate event-driven and tree-manipulation programming in a complete SGML transformation framework. For each aspect, a short example will be given in the Balise language to illustrate the idea.

4.1 Scalable context management

The provision of a scalable context management is a major point. It means the ability to dynamically select these parts of the trees that will be kept in memory for direct manipulation in a very flexible way. It also means a complete integration between context management and tree manipulation.
 . . . element TGOUP { // From an event-driven program 

 on start holdSubTree();	/ tell the system to hold this subtree 

 on end processTable ( currentNode() );	 // call a processing function 

}	

// The subtree is automatically destroyed . . . 
4.2 Navigation/search/update interface for document trees

A full set of navigation/search primitives must be provided to allow efficient programming of tree transformations. Update interface must include local updates (insertion of new elements, update of attributes etc.) but also global structural updates such as cut and paste of subtrees.

It is also helpful, in many transformations, to be able to manipulate any number of document trees at the same time, to use temporary documents as ``tree clipboards'' etc. Documents and tree nodes should be true objects in the host language.
var next = rightSibling(node); 

var isLast = True; 

if next != nothing /* there is a right sibling */ 

          and GI(next) != GI(node) then isLast = False;

... 

var xanchor = enclosed("sec1"); 

var titleNode = child(enclosing(xanchor, "CHAP"), "TITLE"); 
4.3 Generating markup from trees for output

As a special case of the navigation/update interface described above, it should be possible to build a tree representation of an SGML output instance in a first step, before dumping it as markup. This is especially useful when some ``late'' updates have to be performed on the document.
var titleNode = Node (doc, "TITLE", Node(doc, "#CDATA", 

                                               "The title...")); 

insertSubTree(chap, 0, titleNode); 

... 

dumpSubTree(doc, cout); 
4.4 Direct update of input/context trees

A special case of SGML transformation is the case where the output DTD is very close to the input DTD and only minor changes have to be done on the input instance. In that case, the most efficient way to apply them is to directly update the input instance tree. This must be possible at any time using the tree update primitives.
var n = 0; 

element SEC { 

  on start { addAttr("ID",format("REF-%1",n++)); 

                                        // add a new SGML attribute 

  cout << echo(); // dump the modified startTag value 

  } 

  on end cout << echo; // end tag is dumped as-is 

} 
4.5 Invoking rules on arbitrary subtrees

We have seen examples where event-driven transformations may benefit from a global context information. The corresponding functionality is the ability to generate events (i.e. invoke a set of rules) from any identified subtree.
main { 

   var ret = parseDocument(List("expl.sgm")); 

   scanSubTree(ret.document,ThisProcess); 

} 

// now follows a standard transformation process. 

before { ... } 

default { 

   on start cout << echo;

   on end cout << echo; 

} 
5. Conclusion

We think that a complete integration of event-driven and tree manipulation approaches for SGML transformation is a key aspect of a powerful SGML transformation framework. None of the two approaches alone provides enough flexibility and power to address the variety of problems in the field.

This integration means more than just choosing between the two modes when writing an application. It means intimate interaction between the two modes, in both ways.

We have shown some of the key interfaces that are implemented in the Balise product and allow such an efficient integration.

Paper presented a the SGML'96 GCA conference, 18-24 Nov 96, Boston

Last Modified: 01:53pm , October 18, 1996

Event Driven or Tree Manipulation Approaches to SGML Transformation - You should not have to Choose

Abstract

1. SGML Transformation

2. Event driven and tree manipulation approaches

2.1 The event-driven approach

2.2 The Tree Manipulation Approach

3. Combination of both approaches

3.1 The table DTD translation example

3.2 Full context event-driven transformation

3.3 The Cell content issue in table translation

4. Key interfaces

4.1 Scalable context management

4.2 Navigation/search/update interface for document trees

4.3 Generating markup from trees for output

4.4 Direct update of input/context trees

4.5 Invoking rules on arbitrary subtrees

5. Conclusion

Last Modified: 01:53pm , October 18, 1996

Event Driven or Tree Manipulation Approaches to SGML Transformation -
You should not have to Choose