Section 2.3: Transformation tools

Transformation tools take an input stream, or document, and apply a set of transforms to generate the output stream, or document. The transformation may be to or from SGML-tagged documents, but not necessarily. Some tools use SGML as an `internal' representation of the information structure. They can also be used to transform from one SGML tagging to another. They are not `autotaggers', in that they provide a high-level programming environment for the definition of the transformations to be undertaken. They may or may not include an SGML parsing capability.


Product:
Balise (v2.2)
Associated Products:
SGML Store (see below), DynaText/Balise Toolkit (DT/BL)
Developer:
A.I.S. S.A. (France)
UK Supplier(s):
SGML Systems Engineering, Wells
Price:
Single user: DOS: £1,900, Unix: £3,900, VMS: £3,900 Development licence: DOS/Unix: £8,150, VMS: £11,500
Platforms:
MS-Windows, Windows NT, Unix, VAX/VMS
Description:
Balise provides both SGML event-programming and explicit ESIS-tree manipulation. Balise can represent parts of the SGML instance as memory-based trees, thus freeing the programmer from the limitations of sequential access to SGML information, and providing unconstrained transformation power. Balise lets the user programme in a rules-based declarative mode to process SGML events, or in an imperative, procedural mode to explicitly create, access and modify ESIS trees.

The Balise language offers all the power of a fully-fledged programming language and all the flexibility the user can expect from an interpreted language.

Balise is able to handle any SGML transformation problem. Major application areas are: content level instance validation, instance enrichment, database loading, instance formatting, and SGML to SGML transformation.

The DynaText/Balise toolkit will provide (due July 1994) a direct connection between the Balise SGML processing language and DynaText databases. DT/BL Toolkit will give structured access to the SIT navigation and query capabilities through Balise functions, as well as full access to DynaText stylesheets.

Assessment:
Balise is a high level SGML application programming environment comprising an SGML parser (SGMLS) linked to an interpreted programming language. It can be used for down- translation (SGML-documents to X), cross-translation (SGML to SGML), and up-translation (X to SGML). The previous version (2.1) was not able to be used for up-translation, as the only form of input was a valid SGML file. Version 2.2 however, has added enhanced regular expression and pattern matching functions, sufficient to create complex lexical analysers, such that Balise can now be used to read arbitrary input files.

Balise operates on the SGML structure of a document, rather than on the occurrence of start- and end-tags. This means that Balise recognises the start or end of an element irrespective of whether the start- or end-tag is explicit or whether it is implied because of the usage of minimisation features.

Balise can be used in three different modes :

  1. as a general procedural programming language (for example, when being used to perform up-translations);
  2. in parser-driven SGML event-recognition mode (for example, when being used to perform down-translations, where the inbuilt SGMLS parser is used to notify the Balise kernel of the occurrence of SGML "events", such as the occurrence of the start and end of an element, an SDATA entity reference, a processing instruction, etc. These events then may have actions associated with them, which are statements in the Balise programming language);
  3. in ESIS-tree manipulation mode (for example, when being used to perform cross-translations, where an existing SGML file needs to be restructured to conform to a different DTD. A common example of this type of requirement is to convert tables created to one structure to tables conforming to a different structure. In this case, the document is parsed and data structures conforming to the ESIS (Element Structure Information Set) are built in memory. These data structures may then be accessed in a random fashion, rather than in a sequential manner as in parser-driven mode)

The Balise language contains programming components such as:

In SGML event-driven mode, Balise recognises the following events :

Balise keeps track of ancestors and attribute values for those ancestor nodes, enabling context sensitive decisions to be made within the program. Balise also supports `applicative attributes' — these are user defined variables attached to the current element and remain accessible until the end tag of that element is discovered.

Balise is command line driven and allows the user to pass arguments to the program that can be used to set variable values within the program (this is similar to the usage of the C `argv' variable). The documentation is packaged in a single manual with a very usable Quick Reference section.

Information available to Balise is ESIS, with two extensions — information is retained from the DTD about whether an element was declared EMPTY, and SDATA entity names are recorded. In most cases this set of information is sufficient for complex processing of SGML documents, but in some cases it may also be wished to access other non-ESIS information. For example, in Balise it is not possible to test the type of an attribute (eg. whether it is declared as a NAME, ID, NUTOKEN, etc), and it is also not possible to access information contained in comments.

Balise is easy to use, especially for experienced C or C++ programmers, and provides a great deal of help in the processing of SGML files through a comprehensive set of pre- defined functions. Any functionality not provided by those functions can be incorporated by the use of user-defined functions, the shell spawning program and a special development toolkit which allows extension through user provided C or C++ libraries. The Development licence includes a compiler, enabling the creation of distributable runtime applications.

As with all SGML transformation tools, the level of SGML conformance of these products is determined by the parser which they incorporate. In the case of Balise this is SGMLS. SGMLS supports only the Reference Delimiter Set of short reference delimiters and does not support LINK (in fact, none of the commercial transformation tools support LINK).


Product:
OmniMark
Associated Products:
SGML Kernel
Developer:
Exoterica Corp, Inc (Ontario, Canada)
UK Supplier(s):
Exoterica Corp, European Operations (Paris, France)
Price:
PC: 2,495 ECU for 1st to 5th copy, 495 ECU further copies
Workstation: 6,495 ECU for first copy, 2,495 ECU for further copies
Platforms:
MS/DOS, Mac System 7, and most Unix implementations
Description:
OmniMark provides a programming language for the manipulation and conversion of structured text. It is tightly integrated with Exoterica's SGML parser which provides uniquely powerful fault tolerance and error recovery facilities. OmniMark provides an expressive programming language that can be applied to a wide variety of data-manipulation tasks. It has many features targeted at the kind of complex problems encountered in text document analysis and translation.
Assessment:
OmniMark is a high level SGML application programming environment comprising an SGML parser linked to an interpreted programming language. It supports four kinds of document conversion (where `X' is an arbitrary document format which may be SGML):

OmniMark offers an explicit pattern matching mechanism that supports event-driven processing, based on lexical events. It provides a tight coupling with an embedded SGML parser so that pattern matching can be made dependent on the SGML context. It has its own rule-based programming language that is `English-like' and although the language is obviously aimed for the non-programmer, people who use it will still need to know the basic concepts of writing a program — loops/macros/etc.

OmniMark has three data types (which can be declared as global or local to an element):

All OmniMark objects are associative arrays called shelves, which may be variable or fixed in size.

The control structures that the OmniMark language contains are:

do when ... {else when ...} {else ...} done
do ... done
repeat ... again
repeat over (shelf) ... again

There is no ‘for loop’ or ‘case’ structure, although these can be created from the other structures. In addition there are control structures for text scanning (do scan/repeat scan) and text skipping (do skip).

OmniMark provides a general-purpose macro capability that allows a user-defined name to abbreviate a more complicated expression. Macros can also be assigned to delimiter characters so that a special character substitutes for a longer expression. Macros can be parameterized so that a repeating but variable pattern can also be shortened.

An OmniMark program consists of a set of rules and a set of actions associated with each rule. In up-translations and cross-translations the basic rule is the FIND rule, which describes a pattern of interest in a document and the actions to take whenever it is encountered. In down- translations the basic rule is the ELEMENT rule, which describes the actions to be performed whenever an element of a particular type is encountered in an SGML document. Both types of rule may be qualified by the SGML context, e.g. position in the tree hierarchy.

OmniMark can work on any SGML document instance and its associated DTD. It can access comments and processing instructions contained within the SGML document instance and act upon those instructions. OmniMark keeps track of ancestors and attribute values for those ancestor nodes enabling context sensitive decisions to be made within the program. It also keeps track of entity values (both user defined and ISO Standard) and can act upon the values of those entities.

The documentation that comes with OmniMark is both comprehensive and useful, but in some places it could do with more examples and a more complete explanation of those examples.

Overall, OmniMark V2.4 is a very powerful package that handles all types of translation and is straightforward to use. The main features of the product are:

Whether or not this last feature is a benefit or a drawback is a moot point. The idea is presumably that the language be as easy to learn as possible by a non-programmer. However, in order to write OmniMark programs the user must still be able to think as a programmer, in planning program flow, making use of the inbuilt functions and data types, etc. For experienced programmers (especially those used to C or C++) the terminology and syntax of the language can be confusing to learn and a hinderance to program development rather than an advantage. The lack of user-definable functions and the restricted range of data types can also be frustrating at times, though there are usually ways to work around such restrictions using the existing data types and macros.


Product:
SGML Hammer
Associated Products:
FastTAG
Developer:
Avalanche Development Company, Boulder, CO, USA
UK Supplier(s):
Interleaf UK Ltd
Price:
$1,950 (PC versions), $2,730 (others) plus 20% per annum maintenance
Platforms:
PC/DOS, PC/Windows 3.1, Sun/SPARC & Solaris, popular Unix
Description:
SGML Hammer is a tool to convert SGML-tagged files and documents into other applications, including other SGML applications. It can be used to convert into word-processor formats, into hypertext applications, and into publishing systems.

SGML Hammer consists of a parser coupled to a LOUISE application language processor (the same language used by FastTag). SGML Hammer applications fall into three general categories:

Assessment:
SGML Hammer is a tool for processing SGML documents. It consists of an SGML parser linked to an interpreted block- structured programming language (LOUISE). The parser generates ESIS "events" — such as the start or end of an element, an entity reference, etc. — to which actions can be associated in LOUISE.

The parser is based on SoftQuad's parser, and it supports OMITTAG and SHORTTAG but not SHORTREF or LINK. The LOUISE language is the same as is found in the FastTag product. LOUISE is based upon procedures, or `action blocks', rather than functions, with all variables being global. User- defined procedures can be created, which are simply named code segments without arguments. The only data types are strings (which may be contextually understood as numbers), and dynamic indexed arrays of strings. The control structures available are : if ... else; while; for; foreach.

Information exchange between the parser and LOUISE is at an extended ESIS level. As well as ESIS information, it is known whether or not an element was declared as EMPTY, and the names of SDATA and CDATA entities. SGML Hammer maintains element type information for all ancestor nodes, but attribute information only for the current element. Information on sibling elements is not maintained, though one useful feature of SGML Hammer is its ability to look ahead to test the name of the next element.

SGML Hammer is only able to have one input file and one output file open at any one time, which can be a big disadvantage in certain situations. For example, for subsequent loading into a database it may be desired to despatch the content of different SGML elements into different files. Whilst this is a simple matter with Balise or OmniMark, it would require multiple passes of SGML Hammer. It can also be frustrating not being able to read from external control files or to initiate external processes on the fly during processing. This means, for example, that it would not be possible to use SGML Hammer to create a FOSI-based document processor or any other document processor which requires the reading of a specification from an external file at the same time as the document is being processed.

Resolution of cross-references can also be difficult with SGML Hammer, as the extended string data type has a maximum length of 2K, so it can not be used for buffering large amounts of data.

Optional output procedure libraries are also available as costed extras. These contain procedures for creating output files in (currently) FrameMaker MIF, Interleaf ASCII, Microsoft RTF and WordPerfect 5.1 formats.

Generally, SGML Hammer is adequate for performing fairly simple SGML transformations. For this type of task it is possible to quickly create effective programs with a minimal learning curve. However, whilst it is possible to use SGML Hammer for more complex tasks, the limited functionality of the LOUISE language can make such procedures laborious and difficult to plan, requiring an experienced programmer to achieve the most from the product. Anything which can be written in SGML Hammer can generally be written more concisely and efficiently in either Balise or OmniMark. But it is not possible to rewrite all Balise or OmniMark programs using SGML Hammer, as much of the functionality of these other languages is not available within LOUISE, such as user-defined functions and complex data types.

The user documentation is clear and concise, allowing novice users to quickly create working transformation programs. More complex transformations require an experienced programmer in order to make the optimal use of the restricted programming environment.


Product:
ICA (Integrated Chameleon Architecture)
Associated Products:
-
Developer:
Ohio State University
UK Supplier(s):
available from `ftp.ex.ac.uk'
Price:
Public Domain
Platforms:
Unix
Description:
ICA was developed by the Chameleon Research Project at the Ohio State University. ICA is an architecture for developing translators from one electronic form to another. To be able to generalise the translation process, rather than develop a matrix of translators from each possible format into every other possible format, the Project choose to use an `intermediate' form. This intermediate form uses the tagging structure defined by SGML. Hence any translator into or from SGML contains only a single stage, whereas general translations would involve two stages — one into SGML and one from SGML into the target format.


Product:
CoST (Copenhagen SGML Tool)
Associated Products:
requires sgmls
Developer:
Euromath Center, University of Copenhagen
UK Supplier(s):
available from `ftp.ex.ac.uk'
Price:
Public Domain
Platforms:
Description:
CoSt is a public domain tool that allows a user to write SGML processing programs in an SGML-aware programming environment. CoSt is built on top of `sgmls'.