The SGML Tree Transformation Process

[Mirrored from: http://www.sgmlbelux.be/96/deseyne.htm]

The SGML Tree Transformation Process
Processing SGML documents in an absolutely standardised way

Jacques Deseyne

Sema Group Belgium
Stallestraat 96
B-1180 Brussel

E-mail: Jacques.Deseyne@sema.be

Abstract

In many SGML-related applications, cross-conversions of documents between different DTDs is necessary. In principle, SGML has given the ownership of the data back to the user and has relieved him of the dependence from particular document data formats. Alas, real-world SGML applications tie him back again to the particularities of one or another vendor's tool and scripting language. The processing model defined by DSSSL holds the promise that also this yoke will be cast away.

Keywords: SGML Conversion, DSSSL, Groves, Architectural Forms, Property sets

Introduction

In the following sections, we will use the following terms which have found a widespread acceptance in the SGML user community.

By SGML transformation is meant the conversion of a document instance according to a given DTD into one according to a different DTD, whereby content can be generated, deleted or re-arranged. Elements can be merged or split; attribute values can be converted to element content or vice-versa.

SGML up-conversion is a often-used term indicating the conversion of a document not coded in SGML towards an SGML document. Down-conversion is used for the opposite process.

The need for SGML Transformations

Getting your documents into SGML may reveal to be painful. Authoring can be hard; converting existing documents from particular formats into generic SGML "for eternity" is in general even harder. Once the parser doesn't complain anymore and semantic control gives reasonable satisfaction, SGML implementation teams have a kind of feeling many farmers must have after they succeed to secure their harvest in the barns just before the thunderstorms of the autumn arrive.

Alas, even the SGML encoding which took al this effort is less perennial than we may think. In practice, conversion of SGML instances from a mark-up according to one DTD into a mark-up according to a different DTD is quite common. One can think about the following situations.

State-of-the-art DTDs vs. Application Environment

An SGML application may be based on optimal DTDs, which reflect the logical structure to the maximum, avoid any formatting-oriented elements and attributes and do not force the document instances to contain text which can better be auto-generated, such as Section, Paragraph or Footnote numbers.

Yet, it is sometimes desirable (or necessary) to use modified DTDs to accommodate to the application tools one happens to have:

authoring tends to be more efficient (in terms of keying speed) when DTDs are not too hierarchic and do not have an abundance of nested container elements;
authoring tools cannot always cope with very complex DTDs or large documents (for instance, SGML authoring extensions to word processors);
authoring tools do not always support SGML features which are used in the optimal DTD (such as SUBDOC or CONCUR);
document databases and repositories may require that attribute information be stored as character data within specific elements;
publishing systems for paper or on-line media may have problems to auto-generate text if complex context-sensitive algorithms have to be applied.

Compromising DTD quality in the document creation process is not recommended, unless a post-conversion utility can add the intelligence the authoring tool doesn't have. Such post-conversion is an SGML transformation.

For publishing systems, it may be necessary to convert the instance to a different encoding which is more presentation oriented, e.g. where bullets and heading numberings are added as textual content. This is also an SGML transformation.

Up-conversion of legacy documents in several steps

Converting an existing set of documents can be painful. Often, the creation and maintenance of those documents in the dark ages before the organisation decided to use SGML, did not take logical structuration into account. Such documents were created with word processors, by typesetting systems or obtained through scanning and OCR. In the best case, the structure they reflect is presentation-oriented.

A useful approach when up-converting huge sets of documents in proprietary word-processor formats and plain text files is to do it in several steps, isolating the different methods for structural recognition.

A first step can take the original document and convert it into an SGML instance structured according to a low-level DTD, reflecting the visual presentation features and relevant word processor codes. The Rainbow DTD from EBT is an example of such a formatting-oriented encoding. With text files, the first step filters out spurious content, takes care of character code sets and tries to take maximal advantage of structural hints, e.g. through SHORTREF and DATATAG mechanisms.

Next steps can then tackle specific sub-tasks of interpreting the document in order to gradually introduce a "richer", more content-oriented structure tagging.

Except for the first step, all other processes are SGML transformations.

On-the-fly generation of HTML documents from an SGML repository

As explained by Lou Burnard earlier today and in many other contributions on Internet discussion groups and SGML Conferences, HTML is probably not an appropriate choice for structuring and maintaining living documents in a repository.

Converting an SGML instance into "low-level" HTML can also be considered as an SGML transformation although current practices, such as the increasing use of frames and tables for mere layout purposes, rather suggest the terms down-conversion or formatting.

Updating the structure of existing SGML documents

The environment in which SGML is used is evolving quickly. Whereas ISO 8879 defines an (abstract) syntax for document structuring, related standards have defined mechanics for attaching semantics to structural elements. The concept of architectural forms was defined in HyTime (ISO 10744:1992) and is further enhanced in the HyTime Technical Corrigendum, soon to be published.

The best known example is probably the HyTime addressing conventions between separate documents, such as clink and ilink. The use of these conventions will ensure that a HyTime-aware environment will be able to interpret the semantics, which would not be the case with proprietary constructs, even if defined and expressed in SGML.

Current transformation tools

Today's users have a choice among several commercial and public domain SGML transformation tools. Some conversion engines don't rely on a DTD and are in fact not SGML aware. We cannot call them transformation tools.

The SGML standard doesn't tell very much about document processing in general and document transformation in particular:

tag minimisation features can be used to imply structural indications;
LINK set declarations allow to define contexts for semantic processing;
several recognition modes are defined for parsing documents.

The notion of ESIS (Element Structure Information Set) is only implicitly present in ISO 8879. ISO work document Recommendations for a Possible Revision of ISO 8879, ISO/IEC JTC1/SC18/WG8/N1035 defines ESIS explicitly.(Note 1)

ESIS defines the set of information that is acted upon by structure-controlled applications, for active document types, active link types, start and end of an element, processing instructions, data, attribute information, references to internal and external entities and link set information. For instance, the ESIS at the start of an element holds information about:

the generic identifier,
attribute information for the start-tag,
for each link rule to which the element is associated, link attribute information,
applicable link rule selected,
for the chosen link rule, result generic identifier and attribute information,
link set (if any) information.

The EXPLICIT LINK sets are the only feature where ISO 8879 provides a mechanism to define a correspondence between elements from different DTDs. Unfortunately, the nature of this one-to-one mapping imposes severe limits to the processing which could be done: element boundaries have to be coincident.

Event-driven tools

Most tools perform an event-driven process. They are configured for a specific conversion by specifying useful events and defining the actions which have to be taken. Examples of such useful events are :

occurrence of specific elements
occurrence of specific content portions or patterns
occurrence of markup (tags, declarations, entity references, processing instructions, marked sections, ...)

This kind of process handles the document as a stream: it is read from start to end and actions are triggered by the recognition of an event. Information or content can be kept in storage locations (entities) and put out in another sequence than the one present in the processed document.

Events can be "qualified" , i.e. they occur in specific contexts, such as determined by the value of an attribute, the parent element or even an elaborate ancestry of the current element.

Context-specific handling is in general not really based on extra knowledge about the location in the document. Rather, the scripts specifying the actions to be taken are used to build (sometimes extensive) state machines where events will cause transitions.

Tree manipulation-based tools

More recent tools do more than just recognise markup and patterns. They build an SGML object tree from the sequence of information the parser provides, allowing to navigate in the complete document structure.

Classical operations of pruning or duplicating and grafting of sub-trees allow to perform more powerful operations than through event-driven transformation.

Tools supporting HyTime location facilities generally build and store one or another kind of document tree.

The tool's Application Language

The actions related to a particular event or the operations to be performed on a tree representation are specified in a procedural or declarative programming language. This programming or scripting language is sometimes called a Data Manipulation Language, maybe in an attempt to address administrators of transactional systems in terms they are familiar with.

This application language is what makes users dependent on a specific tool. With every tool, it takes a learning curve to master the programming concepts and the idiosyncrasies. Developing, testing and debugging conversion programs takes a fair amount of time for all but the most trivial tasks. Once an application gives satisfactory results, the user sticks with it and with the tool, the only change being maybe a new release, promising even better functionality, less bugs and other miracles.

Even tools with an application language of a more public nature, such as CoST using tcl or the MIPS HyTime engine using C++, rely on specific data structures which makes it hard to exchange applications between different environments.

To conclude, the emancipation from proprietary and application-dependent formats by using SGML has not freed the user from the particularities of the tools he is using. The finalisation of DSSSL, the Document Style and Semantics Specification Language, holds a promise that further emancipation will be possible.

DSSSL - ISO 10179:1996

The Document Style Semantics and Specification Language (DSSSL) was published this year as ISO/IEC standard 10179.

The standard as a whole is presented by other speakers, so only the general outline is shown in the following graphic.

A global view on DSSSL

The transformation process

The transformation process as defined in DSSSL is a much more powerful concept than the foundations of current tools. The following graphic shows the concept:

The SGML Tree Transformation consists of three components.

The grove builder accepts one or more SGML documents, parses them and builds a source grove.
The transformer takes a transformation specification, applies it to the source grove and thus builds one or more result groves.
The SGML generator accepts the result grove(s) as input and generates one or more valid SGML documents out of them.

All three components are under the authority of the transformation specification. A transformation specification is an SGML document corresponding to the DSSSL document architecture.

The Transformation Specification

It can contain the following components:

supported DSSSL features
specification of the neutral representation of the characters present in the document
the SGML grove plan
the transformation specification body, consisting of a sequence of DSSSL transformation language expressions.

These components will be discussed in more detail hereafter.

Groves

The transformation process as defined by DSSSL specifies a particular internal representation of an SGML document to work on: the grove.

Clause 6.7.1 of the HyTime Technical Corrigendum, to be published early in 1997, will explain the acronym "grove" as "graph representation of object/property value environments". The DSSSL standard itself defines grove as "a set of nodes connected into a graph by their nodal properties."

A grove may be considered as a set of objects ("nodes") which are characterised by properties to which values are assigned. The properties behave as connectors between nodes. There are other connections than just parent-sibling relationships, so that the grove is actually a directed acyclic graph.

A grove is typically generated by parsing an SGML document or as the result of a transformation process. A partial grove could be represented by the following graphic.

The nodes in a grove consist of much more than just the logical elements in the SGML document. A grove for even the simplest SGML document will already contain hundreds of nodes.

SGML Property Set

Of what objects and properties is a grove made up ? These are defined in the SGML Property Set, an overcomplete listing of the components of an SGML document, with their possible characteristics.

The Property Set is a concept which appeared in the HyTime standard (ISO 10744:1992). In that standard, the HyTime Property Set is defined, listing the components which can make up a HyTime application.

The SGML Property Set contains several modules defining object classes, properties, data types and enumerated values. These modules are related to:

SGML declaration
Document prolog (base DTD, possible other DTDs, LINK process declarations)
Document instance
Support of optional features

Not all modules are required. There can be several levels of granularity with respect to the amount of information which is retained within the grove.

As an example, take the following part of an SGML declaration:

CAPACITY PUBLIC "ISO 8879-1986//CAPACITY Reference//EN"

According to the SGML standard, the RE, RS and separator characters within the literal have to be interpreted into a single SPACE character. Now, an environment can consider it useful to keep these "ignored" characters inside the grove, such that the literal is represented by the following nodelist:

...
datachar Value 'A'
datachar Value 'C'
datachar Value 'I'
datachar Value 'T'
datachar Value 'Y'
datachar Value SPACE    property intrplch Value RE
intignch Value RS
intignch Value TAB
intignch Value TAB
intignch Value SPACE
datachar Value 'R'
datachar Value 'e'
datachar Value 'f'
datachar Value 'e'
...

An optional module defines, among other things, the intrplch ("interpretation replaced character") property holding the value of the original character which was replaced and the intignch object class ("interpretation ignored character"). An environment wishing to take into account these ignored data from the input document will have to support the Base SGML document string level 1 module (the level 0 module is always required).

Wanting to retain such information may be considered as hairsplitting or of purely academic interest. Yet, in the light of the possible future evolution of SGML and related standards, it might indeed be wise not to exclude a priori any information from the SGML property set.

An application doesn't have to take into account the complete set of objects and properties. A grove is built following a grove plan, i.e. the set of classes and properties an application is interested in. A grove plan is a subset of the complete property set.

The Grove Builder

The grove builder takes one or more SGML documents, parses them and creates a source grove. The grove plan which is part of the transformation specification tells which object classes and properties have to be considered.

All characters in the input document are represented in a neutral way. For instance, the source document can represent the character é in different ways:

as a single character (e.g., encoded in ISO 8859-1);
as a non-spacing diacritic and the character e (two characters);
as the character e and a combining diacritic (two characters);
as the entity reference é.

Input characters are normalised into a single character repertoire, where all the same characters have the same representation. This may imply conversion between different character sets. The transformation specification contains the necessary information to guide this process.

The source grove is an internal representation, of which the user is not necessarily aware.

The Transformer and the Transformation Language

The transformation engine accepts the transformation specification body and applies the expressions to the source grove.

The transformation specification body is a collection of associations which have the form :

query expression transform expression [priority expression]

These statements do the following:

the query expression queries the source grove and returns a node list as the result;
the transform expression is evaluated for each node of the node list;
the priority expression is optional and indicates the priority with which the association will be applied; if none is specified, the priority is 0, the lowest level.

If several transformation associations apply to one node, the one with the highest priority is applied.

The transform expression creates a node or several nodes in the result grove.

An example of a statement (with no priority indicated) is one which inserts legal style numbering (such as "1.1", "1.3.2", ...) in section titles:

((match-element? nd '(section title)) (format-number-list (hierarchical-number '("chapter" "section") nd) "1" "."))

The first line constitutes the query expression (checks if the generic identifier of the current node is title and if the generic identifier of the parent is section).

The following lines contain the transformation expression. It creates a string representation of the numbering (using format-number-list, a standard procedure of the DSSSL expression language and calling the hierarchical-number procedure from the DSSSL query language).

DSSSL defines an expression language, more or less identical to the well-known programming language Scheme, a dialect of LISP. The Query Language defines specific procedures for querying SGML groves.

Possible operations

The transformation allows to:

combine structures from the source document into a single structure in the result document (e.g., combine elements forename and surname into an element name, separating their contents by a space);
create new elements (e.g., a table of contents);
create new nodes associated to sequences of content (e.g., the first item in a bulleted list may be mapped into a specific item to receive a different presentation afterwards);
create new nodes associated to content portions (e.g., extending the string "SGML Users' Group" with an image containing the logo or making a specific element out of it).

The possibilities are much more powerful than any SGML conversion tool allows at this moment. For instance, the resulting destination of a transformation expression can be calculated in function of other transformations having occurred. In addition, the expression language allows to define an application's own specific procedures or functions which can be re-used.

DSSSL allows to use regular expressions to identify content portions. This is a very powerful extension of the short reference facilities which were provided by the SGML standard.

The SGML Generator

This module takes the result grove(s), collects the information and builds one or more valid SGML documents out of them.

Advantages offered by the Transformation Process

DSSSL is a new, really complex standard and the Transformation Process can still only be described on paper. Yet, its advantages compared to existing tools are manifold.

The normalization of the character repertoire allows to handle characters independent from their encoding.
The SGML Property Set defines a maximal view (with the highest level of granularity) on the information which can be obtained by parsing an SGML document and thus extends the original ESIS in a significant way.
DSSSL defines a real high-level programming language much better suited for expressing an application than the scripting or macro language in all existing conversion tools.
DSSSL defines an SGML query language which will be common with the (corrected) HyTime standard.
A regular expression facility allows to identify and recognise content as a structural element.
An SGML transformation based on DSSSL does not have to take into account vendor-specific approaches. Such a transformation can be really independent from the actual tool which is used.
Users can become owners of their application as well as of their data.

The present and future of the Transformation Process

STTP is part of DSSSL, a standard which was published in the beginning of this year. Although the text of ISO 10179 is well written, the covered subject is very complex and it takes some hours of lecture before a minimal insight in some topics starts to appear. There is not much documentation available and even on Internet newsgroups, only a few dare to pose questions, often getting no answers to them. At the moment, only a small knowledge base exists for DSSSL and, a fortiori, for the Transformation Process.

Some tools supporting DSSSL start to appear, such as JADE from James Clark and YADE, covered elsewhere in these proceedings. I believe most of these tools must be considered as beta versions. None of them supports the core features of the Transformation Process so far.

Contributors to the standardisation work of DSSSL itself acknowledge that a "formidable" effort will be needed to implement a transformation engine which supports the Transformation Process.

Software development among existing SGML conversion tool vendors concentrates on different regions, such as database connectivity, with an eye on where the big money is, i.e. the world-wide-web servers for Intranets. Their attitude is rationalised as a way "to protect the customers' investment in infrastructure", but the reality behind looks more like protecting their own investment.

Vendors will not readily give away the advantage that users are tied to their application environment. Users will have to require support of DSSSL's transformation process before vendors really will start to support it.

Yet, support of the base level of STTP doesn't differ that much from current support of SGML features; only the Application Language part has to be drastically changed. Some not-for-profit projects have started implementation and it may be assumed that the base level will also be supported by commercial tools in a not too distant future.

Given the current speed at which companies and tools evolve, it should be clear within one year from now whether STTP makes a chance to find practical use or not.

October 1996

Note

1. Reproduced in Charles F. Goldfarb, The SGML Handbook, Oxford, 1990, p. 571-593.