[Mirrored from: http://www.balise.com/current/articles/chahun.htm; text only in this archival mirror copy]

SGML Template Driven Database Extraction:
A New Approach to Report Generation

François Chahuneau

General Manager

A I S
35 Rue du Pont
F-92200 Neuilly sur Seine

fcha@ais.berger-levrault.fr

Sylvain Guennou

Research Engineer

Informatique CDC
113, rue J.M. Naudin
F-92220Bagneux

guennou@caissedesdepots.fr

André Blavier

Research Engineer

Informatique CDC
113, rue J.M. Naudin
F-92220Bagneux

blavier@caissedesdepots.fr

Abstract

Generation of SGML-coded documents as a result of database query processes is a commonly used practice. In most cases, however, the contents of such documents are entirely built from scratch as an SGML-formatted image of the query results. We present an extension to this practice, in cases when documents are made of a combination of human-generated parts and database originated parts. When such documents are updated, human-generated parts should remain untouched, while database originated parts (text, tables and graphics) should be regenerated or updated.

The method used here is that of SGML templates, which embed links targeted to a database. Such a technique can be used in many application fields, ranging from Web applications to industrial catalog publishing, where complex, human-generated document structures coexist with database extracts.

1. Introduction

Since its inception, SGML fostered many new ideas about possible relationships between the two dominant forms of structured information available for computer storage: structured documents and databases.

Putting aside the question of how to best store SGML structures into databases, the other (and equally important) issue concerns the generation of SGML structures from databases.

Generating an SGML stream as a special form of report generation from a database is today a classical technique which has been in use for several years: this is especially useful when one wants to preserve the neutrality and the generality of the database export format, to be able to reuse it (through post-processing) for publishing on multiple targets. This approach has recently been generalized through WWW applications, with the current trend of dynamic web pages, where HTML fragments are generated on-the-fly as replies to database queries.

However, this ``data pushing'' approach suffers one main limitation: the SGML structure is reconstructed from scratch each time, which assumes that the database is the only information source for producing SGML documents. For more complex cases where information is merged from several sources, including human input, another approach based on ``data pulling'' from the database seems more adequate. The application described in this paper falls into this category.

We have used this approach in the design of a publishing system for a financial analysis department, which will be taken as a case study. Reports are written by analysts based on rapidly evolving financial data stored in relational databases. Major productivity gains can be obtained by automating the database-document links (for text, table and graphics). Automated link activation is required both at the initialization step, which provides the analyst with a pre-filled document template, and during the ``refreshing step'' (automated update), after the analysis has been written and prior to the final document release.

2. A Publishing System for Financial Analysts

The concrete purpose of the work described in this paper is to provide a financial analysis department in a stock broking house with a semi-automated document production system that fits closely to analyst's working requirements.

The system we have designed replaces an older, non-SGML-based system that suffered a number of drawbacks. Among the weaknesses of that system was the fact that the documents it produced lacked processability: since the documents were not (or weakly) structured, it was hard to automate a number of transformations that are desirable in the overall publishing process (e.g.: creating compound documents using existing ones, or translating to HTML).

The goals of the new system, from the users' point of view, were:

to minimize time-consuming manual tasks in routine document manipulations (document initialization, document formatting, bilingual French-English publishing...);
to allow different views of the same document for internal or external purposes;
to enforce editorial consistency and achieve high-quality standards for paper and Web publishing.

2.1 Financial Analysis Documents

The role of the financial analysis department is to publish a variety of reports about financial markets and the financial health of major companies. Target readers of these documents are chiefly corporate investors who expect to find here both reliable quantitative data and well-informed strategic advice.

The department's production encompasses a wide range of documents, from short memos about single companies up to lengthy, comprehensive studies about companies or business sectors, as well as recapitulatory monthly booklets.

All these documents share a common feature: they intermix database-originated parts and human-written parts. Database-originated parts display financial numerical data, whose database values are updated on a daily basis (and, for some of them, in a continuous way). They are presented as tables and graphics in documents. Human-originated parts are mostly text fragments.

2.2 Document Lifecycle

From the point of view of an analyst, most documents are not created from scratch. Instead, the analyst asks for the information system to generate an initial document skeleton which already contains standard tables and graphics depicting key data about the business under discussion.

The document is then interactively edited and completed: the analyst adds text (expressing his or her analysis about available data) along with custom tables and graphics if needed. Since the document preparation period may take up to several days or even weeks, the analyst will, from time to time, asks for the system to update the key data it holds. This updating step for database-originated data should not alter or discard those parts of the document that were manually edited. However, automatically generated parts should be updated even if they have been manually altered.

Finally, the document is released, both on paper and in HTML format.

Figure 1 outlines these steps.

3. Design Goals and Principles

The lifecycle described above, which involves automated link activation both at the initialization and updating steps, is common practice in financial publishing, at least conceptually. A major goal in the technical design of the system was to create a generic framework that could be used in various applications involving database publishing. The use of SGML was then an obvious choice: among all of its well-known advantages, document processability was the key factor.

In order to maximize user-level control of the system and to minimize software development efforts when new document types are to be handled, we have adopted the technique of declarative documents: documents embed all the necessary information used to update them, i.e. they hold links targeted to the database.

The initialization step described above, where the user requests the creation of a new document containing key data, actually involves the activation of links embedded in a template document. The key point in link activation is that, although it fills the document with source data, it preserves the link itself while other elements (user-created ones) remain untouched. This allows for repeated, non-destructive updating steps: only database-linked elements get updated.

It should be noticed that the initialization step (activating links in a template document) and the updating step (activating links in an edited document) are actually the same process. This unified approach has a number of advantages: besides the gains it provides in software development, a template document and the edited documents that are derived from it are instances of the same DTD; they can be manipulated exactly in the same way and the template serves both as a specification and as an example.

(Since we have identified the initialization and updating steps as the same process, from now on we will simply refer to it as the ``updating'' process.)

4. System Description

The updating process applies a transformation to a source document that contains references, or links, to a data source. The process retrieves the data associated with each link, transforms them into an SGML structure which it then inserts appropriately into the document. The insertion may actually be the substitution of an obsolete SGML sub-tree. Note that the performed transformation is isomorphic with respect to the DTD.

The updating process only modifies so-called ``updatable elements''. An updatable element has a characteristic pattern of attributes (in much the same way as HyTime elements have attributes described by architectural forms [HyTime 94]) that specify how it is to be updated by the process.

It was decided not to let SQL (ODBC) queries appear directly in attribute values. Such queries are, of course, dependent on the physical database structure. A catalog of stored procedures (some of which reduce to simple SQL queries) provides an abstraction layer which insulates document-stored information from such details.

The global architecture of the system builds upon three main components: at the back end stands a relational database with stored procedures that ``frontalize'' queries; at the front end is the user agent: an SGML editor; between both, a middleware layer contains the document updating engine and communicates with the database on one side and with the SGML editor on the other side.

4.1 Updatable Elements

Analysis of our financial documents has shown that three structural categories are well suited for automatic generation: simple textual elements, graphic elements and tables. Fig. 2 shows the front page of a typical report with various links, template text and edited text.

We have chosen to implement updatable elements in the spirit of HyTime: an element is updatable if it has a characteristic pattern of attributes that describe how to update it, i.e. the pattern describes a link. An alternative approach is to use special elements rather than special attributes. We preferred the HyTime-like approach because it involves lighter DTD management and it is more transparent to end users.

4.1.1 Simple textual and graphic elements

In the case of updatable textual elements, the source data is expected to be a single value which is inserted as the element content.

Updatable graphic element attributes contain a command used by a graphic generator. The updating engine assumes that the element has a particular attribute used to reference the generated graphic file and updates it appropriately.

4.1.2 Tables

We use CALS tables because they are commonly handled by SGML authoring tools and because they fulfill our needs, with some adaptations as we will see below.

Our end-user requirements concerning tables are somewhat complex. Some tables contain both automatically generated parts and static parts (that sometimes stand not only in the table head or foot but also inside the body), and possibly human-generated parts. On the other hand, resorting to the extreme solution of having each table cell individually contain a simple textual link anchor would lead to unacceptable update performance, because of very fragmented database queries. These constraints led us to design a flexible updating scheme for tables.

In the simplest case, the root table element is updatable: the link, when activated, is expected to return a relational table, i.e. a list of tuples, that will be transformed into a CALS table structure. A dedicated attribute specifies if the tuple list should be unrolled vertically (1 tuple - 1 row) or horizontally (1 tuple - 1 column).

In more complex cases, individual table rows can contain links. Each such link should refer to a single tuple which is mapped to a CALS table row. It can also be helpful to attach links to individual table columns. But since CALS tables do not have an explicit notion of columns (there is no ``column'' element in the CALS table DTD), we had to associate updatable table elements with so called tcols elements whose sole purpose is to be link anchors. Like row links, column links should refer to a single tuple which is unrolled vertically.

Finally, individual table cells can be link anchors. They are then handled just like simple textual or graphic updatable elements.

Figure 3 Illustrates the four possible link types that tables may contain.

The updating process has an implicit knowledge of the CALS model and maps the source tuples appropriately, depending on the anchor's element type, and creates sub-elements as needed. Consistency is checked by the process, prior to link activation, to ensure that link-covered table areas do not overlap.

4.1.3 Document variables

As we will see, links targeted to the database contain query expressions that use a number of arguments. To enhance parametrization of links, we have introduced document variables, or docvar elements. These document variables are set by the analyst just before document initialization (by filling dialog boxes in the system's user interface), and are injected into the document structure during this step. A docvar is an empty element with two attributes: name and value. For example: <docvar name=''company'' value=''Saint Gobain''>.

Document variables can be referenced in links in two ways:

as the link source: this applies to simple textual elements. The element content is substituted by the referenced docvar value when updating occurs.
as query parameters: prior to invocation, variable references are substituted by their value. Example: getPrice $company, `11/18/1996'

4.2 Link Implementation

Updatable element links carry information of two kinds:

information about the data source: if the link refers to the database, it contains a query expression (stored procedure invocation); if it refers to a graphic, it contains a command for the graphic generator; if it refers to a document variable, it contains the variable name.
information about data formatting. This is especially useful for numeric data. As an example, for a query returning a price: 37.19, this data item is typically formatted as: 37,2 $ in French documents and as $37.2 in English documents

This information is described by the following attribute set:

genclass (db | graph | var) #IMPLIED
The link class characterizes the type of the data source (db: database, graph: graphic, var: document variable).
genget CDATA #IMPLIED
This attribute contains the expression used to retrieve the referenced data. If genclass=db, it contains the invocation of a database stored procedure, e.g.: getPrice $company, $docDate
If genClass=graph, it contains a command to the graphic generator.
If genClass=var, it contains a docvar name, e.g.: company
genmapping (tupleisrow | tupleiscol) tupleisrow
This describes how tuple lists are to be unrolled in tables, vertically or horizontally.
gentype (integer | numeric | money | date) #IMPLIED
This datatype information is used for formatting purposes: each type has default formatting properties. For example, money data are formatted with 1 decimal number by default.
printfn CDATA #IMPLIED
These print functions are expressions interpreted by the updating program. They are useful when default formatting is not appropriate.

4.3 Software Architecture

The software architecture is illustrated in Figure 4.

The database along with access procedures provide a data access and computing service that isolates clients from the internal data model and performs all financial calculations. The other components of the architecture are thus relieved of any further computations on raw data.

We use the Sybase database management system which facilitates the implementation of such a service through the use of stored procedures and of the Tabular Data Stream protocol (TDS) for communications between the server and clients.

Graphics are created by a graphic generator upon request from the updating engine. It communicates with the data service and delivers graphic files in the CGM format. We use the Gsharp product from Uniras for this purpose.

The document updating engine is an SGML transformation program which communicates with the data service through an ODBC (Microsoft's Open Database Connectivity) interface and delivers SGML documents. This program is written in the Balise language [Balise 96] which integrates an SGML parser and SGML-oriented facilities. Besides its main document updating task, it controls and ensures the consistency of document links.

Two things should be noted:

A clear division of labor was established between database stored procedures and Balise scripts: all computational and data aggregation tasks are handled by the stored procedures layer, while all formatting tasks are handled by the Balise script layer.
The updating engine is not tied to one DTD. All is needed is that the DTD uses the link attribute pattern defined above.

The front-end, user-level tool, is FrameMaker+SGML. Although not a native SGML editor, it can be considered as such since it integrates an SGML-FrameMaker document converter and is actually an SGML-conformant authoring tool. FrameMaker+SGML is also a powerful composition system that we use for paper publishing.

HTML conversion for Web publishing is handled by a generic Balise program we have developed and which is parameterized by SGML stylesheets.

5. Solutions we did not choose...

In this business area of financial analysis, using SGML is not a very natural approach, and is certainly not mandated by any legal obligation or tradition: the only rationale for using an SGML-based solution is to prove it works better than others! The design choices described above were the result of thorough thinking and carefully balanced decisions. Several other solutions or implementation variants were actually considered and finally abandoned, for various reasons. Two of them deserve being mentioned, since they seem reasonable alternatives.

5.1 OLE automation

To fill our requirements involving active composite documents, some components of which are ``wired'' to a relational database, Microsoft's tool suite (MS-Word, MS-Graph and Excel) with OLE automation seems a likely candidate.

However, OLE offers an application integration framework much more than a document integration architecture. Several applications cooperate in an organized manner to fill parts of the screen display and to send pieces of a PostScript stream to the printer driver, but at no point in the process is there any global and consistent data structure, describing the whole document, which could be exported and post-processed: the association between components is ephemeral. There is no simple solution, for instance, to convert the resulting compound document to HTML for Web publishing: this would require that each application be individually able to export HTML data, which is not yet the case.

Another reason for not adopting this solution is that the correspondence between relational tables in the database and document tables can be quite complex, which would not be an easy thing to implement with a tool like Microsoft's Excel. In addition, data volumes should be taken into consideration. Some of the documents produced are quite large: a monthly report represents more than 150 pages and 20,000 distinct numerical values extracted from the database. For such documents at least, batch update in asynchronous mode is certainly preferable to interactive solutions.

Finally, architectural flexibility was an important and permanent concern in the design of this application. For instance, document initialization and updating processes could in principle be run as well on client PC workstations as on the Unix server, since Balise and its ODBC gateway are quite portable software components. It turns out that, for organizational reasons, we decided for now to run all updates on the server centrally, essentially to accommodate batch updates of large documents in an efficient way (such resource-hungry tasks are scheduled at night). This would have been hardly feasible using an OLE approach. However, it remains possible, in the future, to modify the process distribution architecture and to run update tasks for small documents on client workstations.

5.2 Direct connection from FM + SGML to the database

The FrameMaker Development toolkit (FDK) is a C-level API which allows, among other things, the integration of FrameMaker with external software.

A possible, ``FrameMaker-centric'' approach, which we did not follow, would have been to used the FDK to set-up an ODBC link with the database and update the document from within the editor. In such a setting, HTML generation would have to be done either from FrameMaker itself or after SGML export from FrameMaker.

The main reason for not adopting such as solution was, again, the need for architectural flexibility, and the desire to remain as independent as possible from individual software products (which is, after all, what SGML has been invented for). In our systems integration approach, software products fulfill generic roles, and care is taken not to use them beyond or outside of this role, even when some tools would allow this because of their sophistication.

6. Conclusion

This application describes an interesting use of SGML structures to manage information ``threads'' within a compound document merging several information sources.

Such an approach is potentially applicable to all business areas which use compound documents made of dynamic, database-originated data combined with more static, human-originated data. Industrial catalogs are such an example.

Another typical application area is the Web. Currently, most Web pages dynamically generated from databases are the result of a the ``data pushing'' process. Introducing SGML-driven or HTML-driven ``data pulling'' processes in Web server back-ends would open the door to more sophisticated applications, where multiple information sources (including several databases) could be merged seamlessly.

An interesting extension to the approach described here would be to allow for conditional structure generation, depending on an external context. For example, financial reports about insurance companies do not show exactly the same kind of data as reports about oil companies; however, the global structure of these reports is similar. In our system, reports about insurance and oil companies share the same DTD, but need different templates. To avoid such a partial redundancy, the technique of context-controlled documents, as described in [Pasquier 92], might be appropriate.

Acknowledgmenents

We wish to express our thanks to Aurélie Bridault and Sonia Lopez-Fuentetaja for their active contribution to the design and implementation phases of this project.

Bibliography

[Balise 96] Balise 3 Reference Manual. AIS S.A. 1996.

[HyTime 96] DeRose, Steven J. and Durand, David G. Making Hypermedia Work - A User's Guide to HyTime. Kluwer Academic Publishers. 1994.

[Pasquier 92] Bible: a system for Design and Management of Context-Controlled Documents. PODP'92, Proceeding of the first international conference on Principles of Document Processing, Washington, October 1992.

SGML Template Driven Database Extraction: A New Approach to Report Generation