[This local archive copy mirrored from the canonical site: http://www.cgl.uwaterloo.ca/meta/sgml97/mmccool/index.html; links may not have complete integrity, so use the canonical document at this URL if possible.]
Author:Michael McCool
Assistant Professor
University of Waterloo, Department of Computer Science, Computer Graphics Lab, Waterloo, Ontario, Canada, N2L 3G1Bio:
Michael McCool graduated in 1989 from the University of Waterloo with a B.A.Sc. in Computer Engineering, and from the University of Toronto in 1995 with a Ph.D. in Computer Science. During his time in Toronto Michael served on the executive of the ACM Toronto SIGGRAPH.
He is currently an Assistant Professor in the Computer Graphics Lab within the Department of Computer Science at the University of Waterloo. Published research includes algorithms for random sampling, analytic antialiasing, volume rendering, and representation of multivariate probability distributions. Currently, his research is focussed on the application of formal language theory to the representation and manipulation of symbolic representations of three-dimensional scenes and software engineering metainformation.
Author:Paul Prescod
University of Waterloo, Department of Computer Science, Computer Graphics Lab, Waterloo, Ontario, Canada, N2L 3G1Bio:
Paul Prescod is currently a research assistant in the Computer Graphics Lab at the University of Waterloo. Paul has extensive SGML experience and has previously implemented SGML solutions to academic and banking problems. Paul helped to develop XML as part of the World Wide Web Consortium XML Special Interest Group. His current research involves structured specification of software component interfaces and the automatic generation of ``glue code'' and human readable documentation.
Keywords:CORBA, Interface description language, Software engineering, Object-oriented programming, Software components, Design patterns, Protocols, DTD Components
Abstract:SGML (and XML) documents consist of a grammar-constrained tree of typed, attributed nodes with ordered children. This structure can be used to represent almost any kind of information.
We are using SGML to represent software engineering metainformation, specifically language-independent, formal class library interface descriptions. Standard SGML document transformation tools can then be used to translate the interface description into support code or into human-readable documentation. We use DSSSL to generate the documentation and Perl in conjunction with SGMLSPL to generate code.
Unlike other systems designed to do similar things (i.e. CORBA's IDL) the SGML metadocument approach is extensible, and extra-language constraints such as protocols and design patterns can be represented. Furthermore, the transformation can be annotated to specialize the transformation for a particular target programming language.
We have used this system to formally document the interface to a 3D graphics class library and automatically generate multiple language interfaces to it. Our METADOC DTD and transformation system is formed from a set of reusable DTD/DSSSL/Perl components which we have used to build other document types, and we will discuss our strategy in this regard.
SGML documents can represent formal information in a flexible and extensible way. We have been experimenting with the application of SGML to formal documentation for software components.
In this application, an SGML document instance serves as a language-independent description of object interfaces, similar to the CORBA (Common Object Request Broker Architecture) IDL (Interface Description Language) [2]. The concept of an object is understood abstractly relative to a ``core object model''. In this model object classes have construction parameters, attributes, and methods, and may also have interface and implementation inheritance. Attributes are values which can be set and retrieved from the object and methods are operations which can be performed on or by the object. An abstract type system and exceptions are also supported. This is an abstract model only; different bindings (e.g. C++, Tcl, Java) of the abstract definition of a class can implement the features of a class differently.
Standard publically-available SGML document transformation tools can be used to implement the generation of code to build the actual interface to the component in a specific target language, or multiple variants of the interface in a single language. In addition to such multiple language bindings, code generated from such an interface description can include proxy objects for distributed object access, script language interfaces to objects, and generic attribute interfaces enabling object database access and SGML externalizations of objects. The same source SGML document can also be used to generate formatted online or printable documentation that is guaranteed to be consistent with the actual interface of the objects.
In the following sections we describe our approach and indicate how we model object interfaces in SGML. We also use a parameterized component structure for our DTD. We will briefly describe our technique to accomplish this, then focus on the specific DTD component used to represent object interface information. SGML itself can be used as a target language for externalization of objects and scripting, and we will present some ideas on this. Finally, we will compare our approach with the use of structured comments and CORBA IDL.
Examples of the use of our system and the software can be found at our web site [5].
It is intended that formal object specifications will be embedded in other rather complex documents containing user reference material and tutorials. Rather than defining the whole of such document types, we have opted to use a DTD component system to maximize flexibility. The format of object definitions is defined in a relatively small DTD fragment which follows a specific structure which allows it to be reused in many different full DTDs.
Specifically, DTD component fragments in our system all expect specific parameter entities to be defined, and in turn define new parameter entities. Mutually recursive definitions are enabled by breaking the component DTD into two parts, a .ent file and a .comp file. The .ent file defines a set of parameter entities for the high-level exported elements in the component. A full DTD reads in these .ent files first, defines any new content models needed in terms of these parameter entities, then reads in the .comp model.
In particular, the DTD fragment defining object interface definitions expects parameter entities defining content models for textual annotation, and defines parameter entities for elements that refer to classes and types from elsewhere in the document.
Lack of name scope in SGML is a serious problem, which we currently avoid by following naming conventions to force globally unique names. However, in the future we intend to apply the same technique described here for object definitions to the description and integration of DTD components. We propose to create a document type to represent DTD components and then map these components onto a combined SGML/XML transformation to implement namespaces. Such an approach can gain the advantages of having name scope using current publically-available tools.
Basically, such a process would proceed as follows:
An SGML document with hierarchically-organized namespaces would first be processed as an unstructured[FN 1] XML document (using nsgmls in XML mode, for instance). Elements that create new scopes would require closing tags and would have to be properly nested. Simple document processing with a name-stack (using Perl/SGMLSPL, for instance) would add component name prefixes to all tags, converting a hierarchically-named document into one with a flat namespace[FN 2]. [FN 1] [FN 2]
The structure of the resulting flat namespace document would then be verified against a DTD with prefixed names. This DTD would be generated from the relevant component DTD's.
Conforming documents would finally be processed with transformations that would internally use the fully expanded names.
Processing of the METADTD specification (as an SGML document) would produce as output a transformation process to implement the initial XML processing, a full DTD with fully expanded names, and style sheets with fully expanded names by combining component specifications appropriately.
Using the above scheme, representation of object interfaces in our system is restricted to a single DTD component called SPEC.
A SPEC consists of an optional list of type definitions followed by a list of object class interface definitions. Each class interface definition includes a list of construction parameters, attributes, and methods, and may also define new anonymous types for use in typing arguments and return values of methods.
Types are defined using a set of basic types common to all languages: characters, strings, finite length integers of various lengths, and IEEE floating point numbers. Composite type constructors are also supported to build more complex types: arrays, lists, enumerated types, and records.
Names of classes and types are used as SGML identifiers for cross-referencing. References to external object classes or types use the formal public identifier of the document containing the class' interface definition along with the name of the object class as defined in that document.
The overall specification is also given a package name and a short package name prefix. If necessary, in particular language bindings without package namespaces the prefix can be used to build globally unique but reasonably short class names. The transformation process can also make additional modifications to the names of classes if necessary to support multiple implementations of the same component interface in a particular language.
In most languages, methods must also be generated to access and set attributes. Attributes cannot be implemented using direct access data members because this does not make remote access implementations possible using the same interface.
Not all information contained in the class specification can always be converted into code. There are two cases:
Information may be provided which is specific to particular language bindings.
For instance, element attributes are currently provided in the SPEC DTD component to indicate whether a value to a method should be passed by reference or not, whether a method modifies the object (useful in C++ and functional languages), whether multiple overloaded methods should be provided to pass vectors as individual elements, and so forth. Reasonable defaults are provided for these attributes, but ``nice'' object interfaces often require customization of these attributes.
Additional architectural constraints may be provided which currently are not enforced by any programming language.
Examples include protocols and design patterns. Protocols are permissible sequences of method invocation and attribute access, possibly with additional temporal constraints. Design patterns are specifications of a set of roles in a pattern and identification of the mapping of specific classes and methods in the current definitions onto these roles.
It is intended that the second kind of information be used to generate more useful documentation and testing. Protocols may in fact be supported by future programming languages, and code can always be generated to do run-time testing (i.e. by assert statements). Unfortunately, static protocol checking basically involves testing if one language is included in another, and this problem is only decidable for regular languages.
Protocols are interesting for another reason, as they are similar to content models in DTDs. We discuss this point further below; a class definition can in fact serve as a meta-DTD describing an SGML externalization of a data structure, or a script used to build a data structure composed of a collection of objects.
Design patterns enhance documentation and also enable semantic cross-checking of object definitions. By explicitly defining the role of an object or method in a pattern, chances are decreased that code maintainers will extend the object interfaces in ways that defeat or subvert the pattern. In order to reuse a pattern in multiple components, it is necessary to provide a mechanism to refer to an external pattern definition by name.
As an example of an especially useful and common design pattern, consider the Composite pattern [1]. The Composite pattern is used to describe hierarchically decomposable structures. This pattern (in its ordered child form) is of course fundamental to SGML, as SGML documents consist of nested elements. A software component conforming to this pattern can be used to build data structures that can be externalized to an SGML document. There are several roles in this pattern: Component, Leaf, and Composite. The pattern is attached to the component by specifying which classes perform which roles, and which methods are used for which roles within each of those classes.
Pattern metainformation can be used in several interesting ways. Firstly, formal verification can match a formal specification of the pattern against the actual classes specified to determine if the requirements of the pattern are satisfied. This can help prevent accidental subversion of the intent of a design in team development environments. Documentation can be improved; simply stating that such-and-such a collection of classes and methods is an instance of a particular design pattern can help experienced programmers understand a complex component design quickly. Likewise, automatically generated class diagrams can be formatted to emphasize a design pattern. Coding idioms are typically associated with a pattern, and support code or default implementations for these idioms can be automatically generated. As any given component can satisfy several patterns simultaneously, such code generation needs to be orthogonal to avoid interference between multiple patterns.
We have implemented several transformations of class specifications, using an embedding of the SPEC component into a reference manual document type we call METADOC. We use public-domain tools to implement these transformations. The tools we use are James Clark's nsgmls SGML parser, the Jade implementation of DSSSL, and SGMLSPL, a Perl transformation toolkit implemented by David Megginson of the University of Ottawa.
The Perl 5 SGMLSPL module is used to intercept SGML parse events, and we have built additional Perl modules to implement code generation for various tasks. Syntactically, Perl is very useful in this respect, as large blocks of parameterized code can be inlined without requiring multiple print statements. Also, the built-in hash tables facilitate name management.
Currently, we have implemented code generators to build C++ interfaces to objects (in both tight and loosely bound forms), generic named attribute interfaces to objects satisfying the Composite pattern to permit SGML externalization of objects, and OTcl scripting interfaces to objects. As described in the next section, we are testing these implementations using a 3D graphics library consisting of approximately 30 classes (the system, called Gn, is functionally similar but not identical to VRML). In the future we would like to expand the list of interfaces with Scheme, Perl, Python, Smalltalk, and Java bindings to objects. We have not yet attempted to build distributed object bindings but this could be accomplished by generating CORBA IDL or a native interface description for Xerox PARC's ILU system.
We use the Jade implementation of DSSSL to generate human-readable documentation. Whereas the Perl transformation basically ignores everything but the SPEC, the DSSSL transformation processes the entire document and uses (pseudo) flow objects to generate TeX, RTF, and Jade-specific HTML views of the document. How our DSSSL implementation deals with code reuse and multiple components in this and other DTDs is described in another paper presented at this conference [4].
A medium-sized class library implementing 3D scene graphs was used as a test case. The original documentation for this class library was implemented in highly structured, hyperlink enabled TeX using a Modula/Ada syntax for language-independent definition of the interface. Printed, the documentation was about 200 pages in length [6].
The system itself was already implemented and running, using hand-generated C, C++, and OTcl bindings of all objects. We wanted to see if we could generate these bindings automatically as maintaining the multiple language bindings manually was proving to be a nuisance, the system was constantly evolving, and we wanted to add yet more bindings (to Java in particular). Already, this ``boring'' code was more than 3/4 of the implementation effort; the total amount of code was approximately 50,000 lines.
We first translated the existing documentation into SGML using a simple version of the SPEC DTD, and then proceeded to implement the code generation and processing. Translation from TeX to SGML took about two person-weeks, although the person involved was also learning SGML at the time. Implementation of the code generator to build the C++ and Tcl interfaces took about two person-months, but again the person involved (Michael van Altena) was learning Perl. The generation of the DSSSL transformations for the human-readable version of the documentation took about three person-weeks by an experienced SGML and DSSSL practitioner (Paul Prescod), although we also had to implement a large number of other DTD components besides SPEC: lists, author identification blocks, emphasis, hyperlinks, footnotes, math, images and figures, bibliographies and citations, code listings, etc.
The system has been reasonably successful, although it is still under continual evolution. We have regenerated a fairly sophisticated Tcl scripting interface to our 3D system, which is approximately 24,000 lines of repetitive C code that no-one has to deal with anymore. It is interesting to note that implementing the generic SGML transformations to generate this code took about as long as writing the code by hand did.
One difficulty with the C++ binding is that C++ does not support a clean separation of interface and implementation, but the code generator needs to be able to rebuild the interface separately from the implementation. We wanted to avoid the need to read and parse hand-modified C++ header files. The problem can be attacked in several different ways. Each approach has tradeoffs. Rather than focussing on one ``Right Way'', we decided to exploit the capability of generating multiple implementations from the same interface description.
Our first approach to the problem ignores it. We just build a class interface definition directly, and use a preprocessor statement to include the private part of the class interface (built manually by the implementor) from a separate file. In this case the interface and the implementation are the same, and typedefs were used to combine the two concepts syntactically.
While somewhat inelegant, this results in a directly compilable implementation with no need for extraneous virtual functions, abstract interface classes, or multiple inheritance. This approach is consequently suitable when high performance is required. However, the approach limits the system to a single implementation of the component, although different systems could compile versions with different implementation files.
A second, more flexible approach uses parallel implementation and interface classes. With care, a component implemented in this fashion can be invoked using precisely the same names and types as the simpler implementation above. Both the implementation and interface classes have the same inheritance hierarchy, but the implementation inherits its corresponding interface class as a virtual base class to avoid multiple copies of base interfaces. Unfortunately, all methods in the interface classes must be virtual, and the virtual base class inheritance exacts a runtime performance and space penalty. This approach does, however, permit multiple implementations, so local and remote objects (for instance) can be mixed in a single system, and share the same interface.
There are other possibilities, such as delegation to implementations in other languages, that could be useful in some contexts.
It would be nice if the code generator could reconcile modified class interfaces with the documentation, but this would require a more complex system that could parse modified C++ class definitions and perhaps read structured comments (as in JavaDoc or Docify++). There are several difficulties with this. The particular language in question may not be able to express some of the constraints available in the SGML metadocument. These constraints could be expressed using structured comments, but this would require a complex secondary notation, unless parts of the SGML document itself were embedded in the code. This approach ultimately results in the programmer doing the code generator's job, since the programmer must modify a description of the interface in both the native language's syntax and the structured comment syntax, and make sure they are consistent. Large amounts of documentation also clutter the language-specific presentation of the interface, and slow down compilation unless precompiled headers are used. It is much simpler just to put the interface description in one place, the SGML metadocument, and leave it at that.
Both in-band structured comments and out-of-band SGML documentation are useful in different contexts. The simplest way to apply these tools is to apply them to different tasks. Structured comments can be used to document the implementation in a particular language. SGML markup should be used to document the external interface to a component in a language-independent fashion. IDE (Integrated Development Environment) support that permits simultaneous editing of both the implementation and the interface description, combined with online incremental code generation, would be the ideal solution.
The differences between internal structured comments and external metadocumentation are similar to the differences between hyperlinks as implemented in HTML and external hyperlink databases as permitted in XML. The first is fast and easy to use, but hard to extend to complex circumstances. The second approach requires computer support to be as easy to use as the first approach, but is ultimately more flexible and robust.
SGML itself can be used as a target language. This is useful for externalization of simple data structures satisfying certain patterns, and also for using SGML as a simple scripting language consisting of sequences of messages to objects. A ``flow object'' interpretation of an SGML document is useful in these contexts.
Consider first the implementation of the Composite design pattern. Such a pattern permits a hierarchical, ordered tree to be built, and such a tree is a direct translation of the structure of an SGML document.
Our implementation builds the SGML binding of a class by subclassing (in C++) from a generic interface to attributed graph nodes, then generating code to map the generic, string-based interface used by the SGML interface to the native methods and attributes. The class can then be output to or read from an SGML externalization using a method in this abstract base class. If the Composite pattern is specified, then the nodes must also implement methods to access children and the generic interface using these methods to recursively descend or build the data structure.
Attributes in the class definition can be converted to and from CDATA in SGML, and bound to element attributes. Construction parameters for an object can also appear as attributes or content. A DTD must also be generated, and can be used to verify an SGML externalization. A basic SGML system cannot by itself do all the type checking implied by the METADOC, but it can add structural constraints specified in a protocol declaration to the DTD to perform protocol checking, and a language-independent type checker can be implemented as an SGML application.
The implementation of methods in SGML documents is more interesting than that of simple children in a composite. If we ignore the namespace issue for a moment, we can name SGML elements after every method and permit these elements to appear in the content model for an element. The appearance of an element named after a class then means that that class should be constructed and attached as a child of its parent; the appearance of a method element means that that method should apply to the object in which it appears as content, thereby updating its state. The parameters of a method can appear as attributes. Default parameters can be implemented using implied parameters, and return values can be bound to names using SGML ID and ID references later in place of other parameters.
Using this simple approach with standard SGML, we quickly run into a nasty problem: a single flat namespace must be used for all classes and all methods on all classes. This is awkward, and runs into difficulty if methods are overloaded. For instance, suppose we had an Image and a BitVector class in the same document, both with a ``shift'' method taking an integer. If we try to declare two SHIFT elements we will get a name conflict in the DTD.
A solution is to use prefixes on all method names, but this is rather verbose. However, we can use the transformation process described above to add name scopes to the SGML document, treating each class element as a namespace declaration object.
Such an SGML-based ``scripting language'' is limited in the sense that procedures and object classes cannot be defined dynamically, since new elements cannot be defined dynamically. This limitation can be overcome at the price of additional verbosity and loss of structural verification by changing the way SGML ``scripts'' are represented. Generic METHOD and CLASS elements that refer to classes by name can be defined, and used within a document that also provides elements for the definitions of functions and procedures. Alternatively, we could use the SGML syntax but process the document as XML and again forgo validation.
There is another simple capability that can be added to SGML systems at the same time that namespaces are added: parsing of LL(1), SLR, or LALR(1) languages embedded as the content of elements. Such parsing can expand concise formal language syntax into SGML if necessary. At the level of data structures in an implementation, there really is no fundamental difference between the parse tree for a formal language and the element structure of an SGML document, and this permits a potentially interesting unification of scripting capabilities and document definitions.
We have presented a prototype implementation of a system for expressing software object metainformation in SGML. We have used this metainformation to automatically generate code to build scripting interfaces and generic access to objects. This approach has much more flexibility than the CORBA IDL approach, and can be tuned to produce better interfaces in specific target languages from generic object descriptions. SGML can also combine formal and informal documentation in a clean way, and can itself be a target language.
The essential feature of the SGML approach is its extensibility, unlike the fixed and restrictive format of the CORBA IDL. This flexibility arises because there is an additional layer of metainformation, the SGML DTDs and the transformations on them. Unlike CORBA IDL, a good SGML system permits small incremental change without making it impossible to recover the necessary information needed by an application.
As stated above, we intend to extend these ideas and the lessons we have learned to METADTDs that can support multiple components. Benefits of such an approach include better documentation and reuse of document designs. Namespaces can also be added to current SGML systems without internal modification of existing SGML tools.
This research is part of the Metamedia Project within the Computer Graphics Lab at the University of Waterloo, and is supported by a grant from the Information Technology Research Council of Ontario. Additional funding for our lab is provided by the National Science and Engineering Research Council of Canada. Michael van Altena, now a graduate student at the University of British Columbia, did the initial implementation of the C++ and OTcl code generators, while Paul Prescod implemented the DSSSL documentation generator.