[This local archive copy mirrored from the canonical site: http://www.sgmltech.com/papers/oosgml.htm; links may not have complete integrity, so use the canonical document at this URL if possible.]
Stéphane Bidoul
Several studies have tried to address the topic of object orientation around SGML.
The question asked was too simple and dichotomic; the answer given was a far too simple 'yes' or 'no'. The SGML application aspect, that is not covered by the standard, was not considered when searching for commonalities.
This paper intends to show that some application architectures coupled with an SGML parser offer an object mechanism with embedded SGML.
The relationship between the parsed tokens and the application methods shows that application objects are connected to parsing objects in a simple and efficient paradigm which fully conforms to the LINK feature of the SGML language.
Adopting this view of an SGML application makes all the facilities offered by the LINK feature suddenly self-evident and useful.
Stéphane Bidoul has been working for four years at ACSE as a developer and systems architect for object-oriented distributed applications and complex documentary workflow automation systems (automation of the editorial process for the European Community budget, automation of the legislative procedures for the Belgian French Community Parliament, etc).
All these applications have in common their use of SGML, either as a document storage and exchange medium, or as a formal message specification tool for communications between distributed application processes.
He obtained an electromechanical engineering degree from the Free University of Brussels in 1992, and may be contacted at sbi@acse.be.
This paper presents an event-driven object-oriented computing model that is capable of processing an SGML data stream in real-time.
The nature of the events processed is similar to those defined in ESIS. However, the application program that handles those events can be tied contextually to element instances. This relieves the programmer from managing context information himself.
The next section introduces the basic principles of this model.
The 'Implementation strategies' section discusses an implementation where an application language integrated with the parser is provided as a way to bridge the gap between the parser (where the processing is tied to the parsed DTD) and more generic data processing packages that are not tied to a specific representation of the data structure in a particular DTD.
The 'Applications' section gives a few practical examples.
One output of the parsing of an SGML instance can be seen as a succession of events that correspond to the parsed tokens. Such an event-oriented view of the parser output has been formalized in ESIS. Hooking application code onto these events gives a powerful technique for processing SGML documents.
However, ESIS gives no contextual information to the application. It leaves the whole burden of maintaining this important information (ie the parse tree) on the programmer's shoulders, to perform simple tasks such as distinguishing two instances of the same element in different contexts of the model.
The real power of the paradigm presented here resides in the fact that the events are automatically routed to application objects that correspond to the elements of the SGML instance. These objects have inherent information about the SGML context where they are created.
To program such a system, the developer writes a set of classes that all expose a common interface through which they receive notification of the events from the parser. The instances of such classes correspond on a one-to-one basis to the SGML element instances in the document.
The connection between the parser and the classes is done in a standard fashion throughout the three flavours of the LINK feature of SGML. Link rules are used to tell the parser which class to choose when it creates the objects associated with the SGML elements. This mechanism unleashes the full power of the LINK feature, by associating classes (with code and data) to elements in a particular context with #USELINK and #POSTLINK.
In the next subsections, we successively examine the following topics:
identifying the events;
defining the classes, the lifetime of objects, etc;
linking the classes to the DTD in a context-sensitive fashion by using LINK.
The computing model is event-driven, because the application code is triggered by events resulting from the parsing of the SGML instance. These events are generated by the parser and fed into the application code in real-time through a mechanism which we shall come back to later.
We can identify a simple (yet non-exhaustive) set of events that is of interest from an SGML programmer's point of view.
The list below includes a brief description of each event, as well as the basic information it carries.
onStart, triggered on element start (start-tag).
The basic information carried by this event is the element name and the list of attributes. It also receives the doctype to which the element belongs.
This event is fired even when the start-tag is implied by OMITAG.
onEnd, triggered on element end (end-tag).
This event carries no additional information.
onData, triggered for all the data chunks that make up the data content of the element.
The basic information carried by this event is the character data.
This event is fired each time a data token is recognized by the parser (eg the data flow of the element is interrupted by a sub-element). Moreover, if the data content is large, it could be tokenized arbitrarily by the parser.
The point here is that the application which receives such events should be prepared to deal with the data content in multiple chunks. Only when receiving the onEnd event can the application be sure that all the data content of the element has been passed.
onEntity, triggered on general entities.
This event carries all the information that is necessary to identify the entity, its nature, and its content. We can identify its name, its type (CDATA, SDATA, NDATA, PI, SUBDOC, etc), and also the replacement text, possibly as a system identifier.
If there is a notation attached to the entity, the event also carries the notation identifier so that the application can process the data accordingly.
The bottom line is that the parser must pass enough information so that the entity manager can be fully or partly replaced by the application.
onPI, triggered on processing instructions.
The basic information carried by this event is the data content of the PI.
onRegExp, triggered on regular expression matches.
The regular expressions can take the form of Lex-like [Lesk 75] regular expressions. Those regular expressions can have embedded 'tagged parts' whose matched content will be mapped to variables at run-time.
This mechanism is useful to let the application have some lexical capabilities without modifying the DTD. It should be used when SHORTREF and DATATAG mechanisms either are not powerful enough, or when the programmer does not want to change the DTD or SGML declaration to process the document in a fashion not foreseen by the DTD designer.
This powerful mechanism can even be used to parse non-SGML documents (eg RTF and TEX), the SGML parser becoming a multi-purpose data processing tool. But that is another story....
The basic information carried by this event is the regular expression name, as well as the data that triggered the match.
Although this simple set of events is enough for developing sophisticated SGML applications, it is possible to define other kinds of events. Some of the events presented are defined in ESIS; others are new.
Our goal here is not to define an exhaustive list of such events, but to highlight the basic principles of this event-driven SGML computing model.
Event-driven programming lends itself very well to object-oriented programming. The basic principle is to map events to methods of (or messages to) objects.
To achieve this, a common interface is defined for all classes that will handle the events. Objects (instances of such classes) are associated with each of the SGML elements.
The events we defined previously will guide us in defining the interface the objects will need to expose in order to interact with the parser.
Here is such an interface, defined in CORBA IDL.
interface ISGMLEventSink { void onStart( in sgmlname_t LinkSetName, in sgmlname_t DocTypeName, in sgmlname_t ElementName, in attlist_t SourceAttributes, in attlist_t TargetAttributes ); void onEnd( ); void onData( in sgmlstring_t Data ); // inout parameter leave room // for a replaceable entity manager. void onEntity( in sgmlname_t EntityName, inout entity_t kind ); void onPI( in sgmlstring_t PIData ); // regular expression gives the matched data, // and the matched expression variables; // the matched data can be replaced by the application void onRegExp( in sgmlname_t RegExpName, inout sgmlstring_t MatchedData, in attlist_t MatchedVariables ); // called when the active link set changes void onLinkChange( in sgmlname_t FromLinkSet, in sgmlname_t ToLinkSet ) };
The methods used in this interface correspond to the events described previously. The LinkSetName and TargetAttributes arguments will be explained in the next section ('LINKing the classes to the DTD').
What are the instances of such classes?
In a nutshell, let us say they correspond to SGML elements. Each time an SGML element is encountered, an instance of a class that exposes the ISGMLEventSink interface is created (see next subsection for details about the lifetime of such objects).
Once the object is created, it will receive all the events for the corresponding element. This means that its ISGMLEventSink methods will be called upon at the element start and end. The data content of the element will be dispatched to this object too (as well as entity replacements, etc).
If a sub-element is encountered, an instance of another or possibly the same class (see 'LINKing the classes to the DTD', below) will be created to handle the corresponding events.
Basically the instances of application classes have the same lifetime as the elements to which they are mapped.
This is obviously true when we consider the events that are passed from the parser to the objects: the first event occurs with the start of the element, and the last occurs with the end of the element.
In certain cases, however, the code that executes on the onEnd event may still be active after the element is terminated. This occurs when the application code lets the parsing continue while executing the onEnd() method.
One must be cautious of such issues when implementing the object manager part of the parser.
In object-oriented systems, one often confuses inheritance and object hierarchies (ie parent-child relationships).
In the system we describe here, the parser will establish a hierarchy of objects that correspond on a one-to-one basis to the parse tree. This is not inheritance. The only inheritance in effect here is the interface inheritance mechanism that makes all classes expose the same ISGMLEventSink interface.
What can be extremely useful is to have an implementation inheritance mechanism which eases the coding of the classes. However, this is purely a programming language issue, and the parser does not need to be aware of such relationships between classes.
In this section we shall answer the question 'from which class should we create the objects?'. In other words, we shall define a standard-based mechanism that allows the programmer to link elements of the DTD(s) to the classes he (or some other programmer) has coded according to the ISGMLEventSink interface.
This powerful mechanism is based on the LINK feature of ISO 8879, and provides great flexibility in the level of detail the programmer has to provide to realize the mapping. This flexibility comes from the three kinds of links defined by the standard.
This flexibility ranges from very simple (where all elements are mapped to instances of the same class) to very elaborate (where we tend to wards an ideal situation where the application can be written independently of the actual structure of the DTD that models the data being processed).
As defined in ISO 8879, the LINK feature is a mechanism which enables the dynamic association of additional information to elements at parse time.
This additional information is provided in the form of link attributes, which are new attributes (not defined in the DTD, and thus not present in the source document). These are dynamically added to the elements at parse time.
The feature is named LINK because it also provides a way of linking elements between a source and a result document type.
ISO 8879 defines three kinds of links. They differ in the amount of additional information one can provide in the process of linking source and result elements.
Explicit link provides a mechanism to map elements from the source document to elements in the result document.
Link attributes can be specified for the source element. Attributes, that are declared in the result DOCTYPE, can be set for the result element.
Implicit link is similar to explicit link, except that the result DOCTYPE cannot be specified. One cannot thus give result elements; implicit link boils down to a way of associating process-specific attributes with the element in the source document.
Simple link provides a way to associate one or more link attributes with #FIXED values to the base element of the document.
The association between a source element and a result element is specified in a link rule. Link rules are grouped in link sets. Link sets are grouped in link type declarations which are the formal definition part of link process definitions (LPDs).
Multiple link types can be active at the same moment. However, inside a link type there is only one link set active at a time, and thus there is only one active link rule per link type.
Note here that the simultaneous availability of multiple link types allows for simultaneous 'concurrent' processing on the same document instance.
For a summary of LINK, see also [Kimber 95].
It is important to note here that an LPD is merely a vehicle for specifying processing ... . It is up to the application to determine what the specification means. In particular, although the nominal result of processing is a document instance that conforms to the result DTD, this is merely a conceptual device to permit specification of the processing in terms of the result document instead of, or in addition to, the source document, if that should be desirable. The actual result could be anything at all. [Goldfarb 90].
The point here is that ISO 8879 sees LINK as the way of hooking application processing to an SGML parser.
In the next subsections we shall examine how we can use the three kinds of link to express the mapping between SGML elements and the classes defined above in a way conform to the standard.
We introduce here the notion of class set which is simply a group of named classes that have in common the fact that they expose the ISGMLEventSink interface.
Associated to the class set we assume there is a mechanism for creating an instance of a class, given its name.
Moreover, we assume there is a mapping (implicit or explicit) mechanism between linktypes and class sets. An implicit mapping (based on the name) can be sufficient in most cases, but an explicit mapping would be useful to re-use the same class set in different conditions (ie on different DTDs).
Since the only information that can be added by a simple LINK is a set of fixed attributes on the base element, we can only deduce an implicit mapping between all the elements of the DTD and one class. For the sake of simplicity, we assume this class has the same name as the class set to which the link type is mapped.
For instance, if we assume a class set named APPLICATION, and a link type named THELINK, both tied together, the elements of the document will all have an associated object of class APPLICATION.
We have a situation where each element of the application will trigger uniform processing. If the application wants to specialize processing based on context or element type, it can do it, but it must keep track of context or element type by itself, with no help from the parser.
With simple link, most of the dispatching work is done by the application, and the application is tied to the DTD for which it is initially written. In this configuration all the events are passed to the program which receives no context information from the parser. This is very similar to ESIS.
Implicit link allows for the specification of additional attributes for each element of the source DTD. The introduction of the source DTD suggests that we can have a processing that is specified on a doctype basis.
When a link rule is activated, the parser looks in the class set for a class that has the same name as the element. If one is found, an object is created from that class, otherwise nothing is done.
In this situation the programmer receives more information from the parser since a different class is activated for each element. The #USELINK and #POSTLINK facilities allow for activation of the classes on a contextual basis.
However, the application still remains tied to a particular DTD, since the classes are named according to the SGML element names. Moreover, if the same element appears in a different context, the contextual information can be passed to the application as the link set name, but it is up to the programmer to select which code to execute in each context.
With explicit link one can specify a source and a result document, and for each link rule a source and result element, both with additional attributes. The nominal output of the parser is a concurrent document structure, with elements sent from the link rules.
The trick here is to define a system result document. Let us name it PROCESS for the purpose of the discussion.
The result element in a link rule will then correspond to the class name of the application object to be created. The additional attributes for the result element will be passed to the application as the TargetAttributes parameters of the onStart() method.
This is the most elaborate mapping scheme. It allows for the design of classes that are used for different element types. But most importantly, the classes can be tailored for a particular context of parsing, since the parser will use them for link rules that are to be activated only in a particular context with #USELINK or #POSTLINK.
Link attributes and target attributes are also important features that let us create a simple mapping between the DTD and the application classes. We can pass a lot of interesting information, including the name of the DTD attributes, so that the application code can be as independent as possible of the original DTD.
Following the paradigm expressed in this paper, it is possible to foresee many implementations.
We shall present here a three component structure which we believe to be capable of solving a large number of information management and transformation problems in an elegant manner.
The extended SGML parser program includes two components:
the SGML parser itself;
the application language module which handles classes, creates the objects according to the parsing, forwards parser events to the objects, destroys objects, etc.
The third component is built of external information handling packages, which can be called by the parser's application language module.
What we expect is for the user to develop simple SGML processing applications with the embedded language, while complex data processing applications will be developed in the form of specialized packages that are not tied to a specific DTD. In that case, the embedded language will be used as the glue that maps the structure of a particular DTD to the structure expected by the generic external package, reflecting the well-known fact that there are multiple ways to model a non-trivial data structure in SGML.
The ISGMLEventSink interface presented above has the potential to interface the parser with any programming language capable of generating objects that comply to this interface.
However, some additional features on top of the minimal support of that basic interface would be very desirable.
The language module should be available on a wide range of platforms, and follow the SGML philosophy of universal exchangeability.
The language should be very open to external packages. It should be able to access a wide range of libraries (Windows DLLs or OLE objects, CORBA objects, etc). Such openness is a must to let users transform their SGML data in any way they see fit (transforming it, putting it in databases, feeding it to an interactive subsystem, etc), unleashing the true promises of electronic data interchange.
For instance, one could imagine an Invoice Processing Package that is generic in scope and is used to input the incoming invoices into the invoice tracking system of a company. Invoices coming from various sources could all be fed to this package, whatever their format, provided a stub is written to map the input format to the application package.
Moreover, a language that would be embedded in the parser would gain additional benefits.
Easier access to the parser data structures (SGML declaration, detailed contextual information, etc) for more advanced applications.
The handling of the attributes could also become more straightforward, really mixing the application and SGML objects.
Indeed, DOCTYPEs and ELEMENT declarations are often considered as classes from which documents and elements in documents are instances. In particular, element attributes are declared in classes (ATTLIST declarations) and instanciated within the objects (the element instances). For a detailed account on the subject of classes and instances in SGML document see also [DuCharme 95].
It is really interesting to mix the SGML element class, element instance paradigm with the application class, application object paradigm described in this paper.
From this viewpoint, we can consider that we have different kinds of classes:
the element class, with its attribute declaration in the DTD;
the link source element class, which inherits the attributes from the element class and adds the link attributes;
the link target element class, which inherits from the link source element class and adds the attributes of the target element;
on the other side, there is the application class, which is written by the programmer.
So, when a link rule is activated the parser can mix the link target element class and the application class and create an object that receives the attributes of the SGML part as normal attributes in the object-oriented application world!
The first application ('Changing the delimiters') shows an example of uniform processing of all the elements of the application, using a SIMPLE link.
The second example ('Formatting') shows how #USELINK can be exploited to trigger different processing of the same element, the decision being taken by the parser, not the programmer.
Both examples are written with SGML-conformant LINK specifications, and the classes are written with a C++ like syntax. They are only meant as demonstrations of possible applications of the paradigm presented here, and not as complete working applications!
In this example we show how to use SIMPLE link to change some delimiters from the reference concrete syntax to another:
stago (l) will change to { ;
etago (l/) will change to {/ ;
tagc (>) will change to };
lit (") will change to | .
We begin by declaring a SIMPLE linktype to tell the system that a link process has to be activated.
<!LINKTYPE DELIMCHG #SIMPLE #IMPLIED>
Then we write the class set that execute the transformation. Since we use the SIMPLE link there is only one class for all the elements of the application.
classset DELIMCHG { // global variables ofstream file; const string target_stago = "{"; const string target_etago = "{/"; const string target_tagc = "}"; const string target_lit = "|"; // code that runs on program start onStart() { file.open("output.sgm"); } // code that runs on program end onEnd() { file.close(); } // Only one class class DELIMCHG : implements ISGMLEventSink { // class local data sgmlname_t LocalDocTypeName, LocalElementName; // on element start void onStart( sgmlname_t LinkSetName, sgmlname_t DocTypeName, sgmlname_t ElementName, attlist_t SourceAttributes, attlist_t TargetAttributes ) { // save element and doctype names for use on the end event LocalDocTypeName = DocTypeName; LocalElementName = ElementName; // write the beginning of the element file << target_stago << "(" << DocTypeName << ")" << ElementName; // write the attributes for (int i=0; i<SourceAttributes.count(); i++) { file << " " << SourceAttributes[i].name << target_lit << SourceAttributes[i].value << target_lit ; } // write the tagc file << target_tagc; } // on data tokens void onData( sgmlstring_t Data ) { file << Data; } // on element end void onEnd( ) { // write end tag using saved element and doctype name file << target_etago << "(" << LocalDocTypeName << ")" << LocalElementName << target_tagc; } } }
The above class set works according to the following scenario.
When the program starts (the associated linktype is activated), the onStart method of the class set is called and the output file is opened.
For each element instance an instance of the DELIMCHG class is created, this object receiving all the events for the corresponding element. For each event the (obvious) processing is done to write the newly delimited tags to the output file.
When the program finishes (the associated linktype is deactivated), the onEnd method of the class set is called and the output file is closed.
In this example we show how #USELINK can be exploited to trigger different processing on the same element, the decision being taken by the parser, not the programmer.
We shall take a simple application: converting a document marked according to the GCAPAPER DTD to HTML. We shall not present the whole code here; however, we shall examine a typical example: handling the TITLE element.
Here are some excerpts of the GCAPAPER DTD which show the use of the TITLE element.
<!-- The TITLE tag is used for: headings within the paper, figure captions, and to give the author's job title. Each type of title is formatted differently. --> <!ELEMENT title - o (#PCDATA) +(%emphs;|ftnote|fnref) > <!-- FRONT MATTER ELEMENTS --> <!ELEMENT front - o (title, subt?, author+, keywords?, abstract, biography) > <!-- AUTHOR INFORMATION --> <!ELEMENT author - o (fname, surname, title?, address ) > <!-- SECTION MODEL --> <!ELEMENT section - o (nbr?, title, para*, subsec1*) > <!ELEMENT subsec1 - o (nbr?, title, para*, subsec2*) > <!ELEMENT subsec2 - o (nbr?, title, para*, subsec3*) > <!ELEMENT subsec3 - o (nbr?, title, para+) >
This element has the peculiarity of appearing in different contexts: the main title of the document, headings within the paper, the author's job title, and so on. In each context it needs to be formatted differently.
Note that this is a real world example. While writing this, I periodically ran such a converter to preview and let others review my document in a more enjoyable format.
We shall show how this TITLE element is output in HTML in each different context (except for figure captions). So let us write the LINKTYPE.
<!-- link type for GCAPAPER source dtd, -- -- and PROCESS target (ie the application language module) --> <!LINKTYPE GCA2HTML GCAPAPER PROCESS [ <!-- initial link set for the author element --> <!LINK #INITIAL -- TITLE will be handled by the MAINTITLE class -- TITLE MAINTITLE -- Switch to AUTHORLK for AUTHOR element -- AUTHOR #USELINK AUTHORLK #IMPLIED -- Tell the app. there is a heading level change then -- -- switch to SECTIONLK for SECTION element -- SECTION #USELINK SECTIONLK LEVEL -- other link rules not shown here... -- > <!-- one link set for the author element --> <!LINK AUTHORLK -- Author's job title will be handled by JOBTITLE class -- TITLE JOBTITLE -- other link rules not shown here... -- > <!-- one link set for the section element --> <!LINK SECTIONLK -- Tell the app. there is a heading level change -- SUBSEC1 LEVEL SUBSEC2 LEVEL SUBSEC3 LEVEL -- All sections and subsections titles are handled -- -- by the same class (SECTITLE) -- TITLE SECTITLE -- other link rules not shown here... -- > ]>
The point in this linktype is that each title element instance will be handled by a different class, based on its context. All section headings will be handled by the same class, and a level class is set to tell the application that there is a heading level change.
Now let us look at the classes.
classset GCA2HTML { int level = 1; // top level (for main title) // The datawriter class is used as an ancestor // for all classes that need to output all the // data content of their element to the file. class DATAWRITER : implements ISGMLEventSink { void onData( sgmlstring_t Data ) { // write element data on output file file << Data; } } // Class to handle the main title. class MAINTITLE : implements ISGMLEventSink { // Local variable to store the title text string title = ""; // The title has to be saved, because it // will not be written directly to the file // but used twice at the end (see onEnd()). void onData( sgmlstring_t Data ) { title += Data; } void onEnd( ) { // Write the HTML header and // the first heading, centered. file << "<head>" << "<title>" << title << "l/title>" << "l/head>" << "<center><h1> << title << "l/h1>l/center>"; } } class JOBTITLE : inherits DATAWRITER { // Title data is written by DATAWRITER ancestor class void onEnd( ) { // finish with a hard break file << "<br>"; } } class LEVEL : implements ISGMLEventSink { void onStart( ,,,, ) { // increment global heading level level = level + 1; } void onEnd( ) { // decrement global heading level level = level - 1; } } class SECTITLE : inherits DATAWRITER { void onStart( ,,,, ) { // write HTML heading start tag file << "<h" << level << ">"; } // Data is written by DATAWRITER ancestor class void onEnd( ) { // write HTML heading end tag file << "l/h" << level << ">"; } } }
What makes this application so simple is that the code does not need to take particular actions to keep track of the context: the parser does the hard work and selects the right portion of code (ie class), according to the link rules.
A structure-controlled SGML application operates on the structure that is described by SGML markup. The usual way to communicate such structure information is through the ESIS interface.
ESIS defines a set of events that are of interest to an SGML application. However, most tools only deliver this information in a linear fashion, leaving to the programmer the chore of maintaining contextual information. This contextual information, which is very important in all but the simplest SGML applications, is naturally known by the parser. And SGML defines a standard mechanism to pass this information to the application: the LINK feature. However, even when full link information is available in ESIS, the programmer is often left alone to realize the dispatching of a linear suite of events.
What we have presented here is an object-oriented computing model that offers the ability to write context-sensitive event-driven SGML applications. This model is based extensively on the standard LINK feature to map the element of the DTD to application objects. We have also described an extended SGML parser that automatically dispatches ESIS events to application objects written by the programmer.
With this programming model one writes code for a particular context only, letting the parser select the right code fragment for the right context. This opens the whole world of reusability to SGML programming.
Please email your comments to Stéphane Bidoul at sbi@acse.be.
This paper was first published in the Conference Proceedings of SGML'96, pp 485-98.
Copyright © 1997 The SGML Technologies Group