[This local archive copy mirrored from the canonical site: http://www.sgmltech.com/papers/reuse.htm; links may not have complete integrity, so use the canonical document at this URL if possible.]

Reusability in SGML with Focus on Software Engineering

Author

Stéphane Bidoul

Keywords

SGML
LINK
ESIS
Application
Process
Event-Driven Programming
Object-Orientation
OO
Reuse
Reusability
Scripting Languages
Python

Abstract

There are many opportunities for reuse in SGML. Among the reasons considered are:

because SGML is an ISO standard;
since the information is structured, it may be reused for output on a variety of media;
the modularity of document type definitions (DTDs), allowing reuse of parts in DTDs for other applications.

Whilst all the above are considered, this paper concentrates on the area least discussed to date, that of reuse of code.

One recently introduced technique enabling SGML application code reuse is the concept of SGML architecture. This technique is discussed as well as one to write reusable code by binding it to the DTD through the LINK feature.

Biographical Note

Stéphane Bidoul is a project manager at ACSE sa/nv, Brussels, a Member of the SGML Technologies Group. He is a software engineer and systems architect specializing in object-oriented distributed applications and complex automated document workflow systems; all these systems use SGML either as a document storage and exchange medium or as a formal message specification tool for communications between distributed application processes. A graduate in Electromechanical Engineering of the Free University of Brussels, he may be contacted at sbi@acse.be.

Reuse of a Standard Method

SGML is considered here as an international standard (ISO 8879:1986) and as a method of structuring information. It has become widely used internationally, being ISO's fastest selling standard when first published in 1986. People share the same DTDs, for example as information is prepared in accordance with the same structure by international organizations where sharing is of importance.

Consider the telecommunications authorities where information on telecommunications hardware and message handling facilities has to be available to authorities in countries to which the messages are sent, by voice or other medium. The overall authority can mandate that information shall be provided in SGML (an international standard) in accordance with an agreed DTD; it could not have demanded that its members from different countries used specific software that may be popular today but gone tomorrow.

Equally, SGML can be mandated by an oil company for documenting all the equipment that goes to make up an off-shore oil production platform or rig. It includes engines that have to be documented as well as all the drilling equipment and the rig structure itself. This information must be able to be used on-site and be in a common format, easily updatable to comply with safety regulations. SGML as an appropriate international standard may be mandated for compliance by all contractors and subcontractors when submitting information.

SGML lends itself naturally to reuse at all levels. It is indeed what it has been designed for in the first place. In the next sections, we shall discuss reusability of SGML structured information, and reusability of SGML DTDs or parts of DTDs. We shall then describe techniques that one can use to improve reusability of the application software written to process SGML data.

Structured Information

With SGML information is structured, and it is essentially because it is structured that it may be reused in a variety of ways and output on a variety of media. A document may be published on paper, in whatever format is chosen by the publisher or perhaps publishers in cases where the same document is republished at some time, perhaps in another format (SGML does not address format, merely the neutral representation of the content). This same document could be published in large print for those who are partially sighted. It may be published on CD-ROM, on the Internet, or any other electronic medium, transmitted anywhere.

So the same information that goes to make up a document can be reused or republished on different media in different formats.

Recently, the adoption of the DSSSL standard (ISO/IEC 10179:1996) extended the reusability further, allowing a style specification associated with a DTD to be defined, in a system-independent manner. This means that one can describe how instances of a particular DTD must be output, independently of the particular tool used for final rendering. For example, to some extent the same style specification could be used for both paper rendering using a typesetter and online display in a browser.

Document Type Definition

There is widespread reuse of DTDs themselves, parts of DTDs, and element names by a diversity of applications, not just the one for which they were originally designed.

A DTD for technical documents in the aerospace industry, for example, may be used as a whole or in part by other industries having a similar complexity of technical documents. When DTDs are constructed in a modular way, with a well-reasoned structure for the content of paragraphs, say, the same definition can be reused with ease for paragraphs of the same construction. This principle applies equally to other modules of DTDs, the well-written DTD being able to be reused in different ways. There can also be reuse of well-chosen element names for similar applications.

Document declarations can be common to many applications, being reused many times, to an even greater extent than DTDs and their component parts.

SGML Processing Applications

Most SGML applications rely on the generic identifiers to decide which processing to trigger when parsing SGML data. By this very fact they are tied to a particular DTD. Often, however, these applications implement tasks that are more general. Unfortunately, they cannot be reused because of their dependency on element and attribute names. Such applications are said to be markup-dependent.

In the previous section we discussed the reuse of DTDs and/or DTD fragments. Recently, however, it has been recognized that it is not always practical to reuse the same element and attribute names to express the same semantic constructs in various DTDs. In other words, it is sometimes desirable to use different syntaxes to express the same semantic.

A mechanism must therefore be introduced to specify semantics independently from element and attribute names, allowing easier DTD and code reuse.

The HyTime standard (ISO/IEC 10744:1992) has been first to introduce this concept under the name of architectural form. An architecture or architectural form is a meta-DTD, or fragment of meta-DTD. Semantics are associated with an architecture in the same way that semantics are associated with elements types and attribute lists in a normal DTD. The benefit of the architecture concept is that many different DTDs can be mapped to a given architecture using a simple renaming mechanism, providing they have a compatible structure. Thus, using architectures, one can apply standard semantic to existing DTDs, with little or no change to those DTDs.

In this section, we briefly present the mechanism of architectural forms, and we explain how this technique dramatically improves the reusability of SGML processing applications. We also describe a technique based on the SGML LINK feature which provides a markup-independent mechanism to bind application code to a DTD.

Introduction to SGML Architectures

In its simplest form, an architectural form is an element type declaration, with or without an attribute list. This declaration has a well- known semantic. The SGML architecture concept relies on a general name mapping mechanism, which allows element type declarations from existing DTDs to be related to the declarations in the architecture.

Let us explain this mechanism with a simple example from HyTime. Here is the clink architectural form:

<!-- Contextual Link -->
<!element clink    -- Contextual link --
                   - O      (%HyBrid;)* >
<!attlist clink    HyTime   NAME     clink
                  id       ID       #IMPLIED  -- Default: none --
                  linkend  -- Link end --
                            -- Constraint: No HyTime reftype constraints,
                               but application designers can constrain
                               element types with reftype attribute --
                  IDREF    #REQUIRED
>

This is a normal element type declaration with an attlist and some comments describing constraints and semantics.

Now suppose you have already a DTD named mydtd, say, with some kind of local cross-referencing mechanism, in the form of an element named myref with a target attribute indicating the referenced element:

<!DOCTYPE mydtd [
<!-- ... -->
<!ELEMENT myref - O (#PCDATA)>
<!ATTLIST myref
          id      ID    #IMPLIED
          target  IDREF #REQUIRED
>
]>

To make it work as a HyTime clink, you must slightly modify your DTD, to establish the name mapping between your DTD and the HyTime architecture:

<!DOCTYPE mydtd [
<!-- ... -->
<!ELEMENT myref - O (#PCDATA)>
<!ATTLIST myref
          id      ID    #IMPLIED
          target  IDREF #REQUIRED
          -- HyTime AFO mapping --
          HyTime  NAME  #FIXED clink
          HyNames NAMES #FIXED "linkend target"
>
]>

The HyTime attribute specifies the mapping for the generic identifier. In this case, it means the myref element is to be treated as a clink HyTime element. The HyNames attribute specifies the name mapping between the attributes in the source DTD and the HyTime specification. In this case, it means the target attribute in mydtd is playing the role of the linkend HyTime attribute.

Alternatively, you can avoid modifying your DTD by adding a Link Process Definition (LPD), which is a somewhat cleaner process whereby you avoid cluttering your DTD with AFO references:

<!LINKTYPE mylink mydtd #IMPLIED [
<!ATTLIST myref
          HyTime  NAME  #FIXED clink
          HyNames NAMES #FIXED "linkend target"
>

<!LINK #INITIAL	myref>
]>

This IMPLICIT link declaration defines two fixed LINK attributes. If the mylink LINKTYPE is activated, the application will have access to the two link attributes HyTime and HyNames playing the same role as if they had been declared in the DTD.

SGML Architecture and Code Reuse

Associated with the concept of architecture is the notion of an architecture engine, a software component working closely with the SGML parser:

(1) it validates the parsed instance relative to the linked architectural forms by interpreting the architecture renaming attribute (eg HyTime, HyNames);
(2) it produces its output relative to the architecture meta-DTD.

While (1) is most important for formal specification of semantics, (2) is equally important from the point of view of code reuse. Indeed, if you process the output of the architecture engine, your application becomes markup-independent, as element and attribute names are reported to your application as if they came from instance of the meta-DTD.

If an application is written to an architecture, the programmer can concentrate on the semantics of the parsed data, being confident that his program will be reusable for compatible DTDs expressing the same semantics.

The LINK Feature and Code Reuse

Working with architectures sometimes implies an overhead that is not affordable for small applications. If it is actually necessary to define formally semantic concepts independently of any specific DTD, like in HyTime, the process of defining an architecture may be overkill for many simple to medium-scale applications.

The following describes a simple technique which allows you to write markup-independent application code using an IMPLICIT Link Process Definition, effectively retaining the reusability potential of architectural form processing.

In [sbi96], an object-oriented mechanism has been proposed to pass ESIS events to the application in a context-sensitive manner.

In a nutshell, we can summarize the principle by saying that an application object can be associated with an SGML element. Such application objects expose an interface through which they receive events such as onStart(), onData(), onEnd(), and onPI(). Using LINK rules, we can specify the type of objects to associate with the different element types, depending on context.

The following figure shows SGML application objects associated with SGML elements. The application objects receive events from the corresponding SGML elements.

In [sbi96], we proposed the use of the generic identifier to chose which class to instanciate. The same mechanism can be used in a fully markup-independent manner by exploiting a reserved attribute instead of the generic identifier to decide which class to instanciate.

We shall not expand here the full details of the implementation of such a mechanism. We shall, however, describe a simple example showing how the various components are tied together.

Let us say we have to write an application where we must handle clink elements and carry out a very complex but nevertheless useful work with them. In this example, let us say the application is written in the Python language [lutz96].

We write a class to handle events for the clink element: the TCLinkHandler class. Then we tie it to the DTD, using an LPD.

Here is the LPD we write to bind the application code to the DTD:

<!LINKTYPE mylink mydtd #IMPLIED [


<!ATTLIST clink
          py CDATA #IMPLIED  -- PY is the reserved attribute
                                used to map application objects
                                to SGML element
                             --
>

<!LINK #INITIAL
-- this rule maps clink elements to instances of the
   TCLinkHandler class, passing the LINKEND attribute
   to the constructor of the class. --

clink [ py="TCLinkHandler(a['LINKEND'])" ]
>
]>

Now let us look at the TCLinkHandler class:

class TCLinkHandler:
    def __init__(self,linkend):
        # some initialization code
        self.linkend = linkend
        # . . .
    def onStart(self):
        # some work on start tag
        # . . .
    def onData(self,data):
        # some work on data
        # . . .
    def onEnd(self):
        # some work on end tag
        # . . .

This class has a constructor which receives the value of the end of the hyperlink, and method to deal with start tag, data, and end tag events, for the corresponding clink SGML element.

From the point of view of reusability, two things must be noted:

the class makes no reference to the clink generic identifier;
the class makes no reference to the linkend attribute name, but receives its value as a formal parameter of the constructor.

Now let us examine how we can reuse this code with another DTD. Here is mydtd again:

<!DOCTYPE mydtd [


<!-- ... -->
<!ELEMENT myref - O (#PCDATA)>
<!ATTLIST myref
          id      ID    #IMPLIED
          target  IDREF #REQUIRED
>
]>

Let us say we want to make the same processing on myref elements. Since the TCLinkHandler class is markup-independent, it is very easy to reuse it for the myref element which is architecturally compatible with clink:

<!LINKTYPE mylink mydtd #IMPLIED [
<!ATTLIST myref
          py CDATA #IMPLIED  -- PY is the reserved attribute
                                used to map application objects
                                to SGML element
                             --
>
<!LINK #INITIAL
-- this rule maps myref elements to instances of the
   TCLinkHandler class, passing the TARGET attribute
   to the constructor of the class. --
myref [ py="TCLinkHandler(a['TARGET'])" ]
>
]>

Note that the value of the TARGET attribute is given as the formal parameter linkend of the TCLinkHandler constructor. This effectively implements attribute renaming.

Other Aspects of Code Reusability

In addition to these SGML aspects of code reusability, traditional software engineering techniques of modularity should be kept in mind while designing SGML systems. Designing markup-independent applications is one step in the good direction, however.

Summary

It has been shown that SGML lends itself to reuse in many ways, through which use of an SGML system can lead to great efficiency. Careful initial design, where modularity is a keyword, can affect reuse of many ingredients which go to make up an SGML application.

These include the reuse of the standard method, reuse of DTDs and parts thereof as well as element names, and reuse by the design of generic applications as opposed to specific ones.

SGML architectures have been shown to be an enabling technology for writing reusable applications. The reusability benefits of markup-independent applications, however, can be achieved with alternative techniques. Such a technique was presented, using the SGML LINK feature to bind an application language to a DTD.

Bibliography

Charles F Goldfarb; The SGML Handbook. Clarendon Press, 1990
Mark Lutz; Programming Python. O'Reilly & Associates, 1996
Stéphane Bidoul; 'Object Orientation and SGML: Link Revealed'. In Conference Proceedings of SGML'96, pp 485-98
ISO/IEC 10179:1996, Information Technology - Processing Languages - Document Style Semantics and Specification Language
ISO/IEC 10744:1992, Information Technology - Hypermedia/Time-based Structuring Language

Please email your comments to Stéphane Bidoul at sbi@acse.be.

This paper was first published in the International SGML Users' Group Newsletter, volume 3 Issue 3 (July 1997), pp 16-19.