[This local archive copy mirrored from the canonical site: http://www.sgmltech.com/papers/seibudsgml.htm; links may not have complete integrity, so use the canonical document at this URL if possible.]

The European Union's Budget

SGML Used to its Full Potential

Authors

Tom Catteau

Abstract

The editorial process of the Budget of the European Union provides a good example of a production environment that is entirely SGML-based, and meets severe constraints in terms of production time, quality, and costs.

As such, it illustrates the fact that SGML realizes its full potential when used as a means of manipulating structured documents. It also highlights certain aspects of SGML, usually considered as advanced, making their significance apparent through a concrete example of their use.

Introduction

For a newcomer, SGML is often considered to be yet another way of formatting documents. Part of the strength of SGML indeed resides in its capability to structure complex documents. However, the full potential of SGML is realized when it is used as a means to manipulatestructured documents, where manipulation implies processing through a program or application.

Some aspects of SGML, although fundamental to the standard, are frequently regarded as advanced and not pertaining to an introduction to the subject. But their significance becomes apparent as soon as their use is illustrated in a concrete example.

The editorial process of the budget of the European Union is an annual, on-going process in which different players such as authors, translators, and editors all operate in a common environment to enter, translate, and correct data needed to produce the budget. The system, designed to fulfil requirements in terms of the timely delivery of high-quality documents, together with short production times, and hence minimized costs, is entirely SGML-based. It has evolved to a complete and mature production environment.

This paper gives an overview of the architecture of the system and describes the rationale behind the key technical choices that were made. It highlights certain aspects of SGML, such as concurrency and links, which are clearly explained by illustrating their use in the budget. It also outlines the principles on which the project's SGML modules are based, where these include conversions to and from popular document formats.

The need for reliability and stability is shown to have led to a client-server system in which SGML acts as the backbone of the modules which govern the production workflow. These modules communicate with each other through SGML-formatted messages.

The implementation has been made possible through the use of a fully-featured SGML parser and an associated application language that combine to make a powerful SGML engine.

Situation

The creation of the European Union's budget is a process which takes place in three phases and to which all the institutions which make up the EU's administration, contribute. The three phases are:

the preliminary draft,
the draft,
the budget.

The institutions are:

the Commission,
the Parliament,
the Council,
the Court of Justice,
the Court of Auditors,
smaller institutions.

For the preliminary draft, each institution makes up its own budget. This phase takes place from February to June. That version is then revised by the Council to produce the draft budget. This takes place from July to August. Finally, that draft is revised by Parliament. This takes place from September to December.

Each version of the budget is published in the 11 official languages of the EU and consists of seven volumes. Each of these volumes can be considered a large SGML instance, on which many people work simultaneously in different ways, progressing from one version to another.

System architecture

The system used to produce the budget is called SEI-BUD, which stands for Système d'édition informatisé du Budget. The system has a client-server architecture. The server is physically located in Luxembourg. Clients are distributed over several places: Luxembourg, Brussels, and Nancy (France).

Server architecture

The server consists of two modules which run continuously in the background: a scheduler and a process-server.

The scheduler is used to process requests originating from clients, messages which are put on a queue that is the input to the scheduler. Each message pertains to a certain class of messages, and for each class, a workflow has to be gone through in order to process the request successfully. The workflow is managed by the scheduler. Processing a message results in the posting of a command onto the queue of the process-server.

The background process-server (BPSRV) manages the execution of all the commands which are put on its queue by the scheduler. The scheduler has attributed a subclass and a class to the command. According to a configuration file, BPSRV limits the number of processes of the same subclass that can be executed at a time. It does the same on the level of the classes.

With this scheme, BPSRV:

manages the overall load of the machine,
assures that processes can only get blocked by the resources they need,
optimizes the response time by allowing a reduced number of processes to execute rapidly rather than a large number executing slowly,
prohibits the concurrent execution of some processes, which might harm the integrity of the system in the multitasking environment,
implements technical constraints, for example when only one occurrence of a particular type of process can be executed at a time.

When a command to be executed is added to BPSRV's queue, it is accompanied by two callback messages. After the execution of the command, depending on the status of the result (success or failure), one of the two messages is sent back to the queue of the scheduler, together with contextual information given by the scheduler.

Since the number of processes being executed is limited, the input queue might not be empty, and in fact, under peaks of load, might be well filled.

The distinction between scheduler and process-server allows for stability and reliability. The system is stable because failure of a process (which might be caused by wrong inputs, bugs, or external causes) is caught by the system. The system is reliable because of its modular concept and because the principle on which it is based (asynchronous communication between scheduler and process-server through the use of input queues) is simple and consistently used throughout the system.

Clients

A client consists of two parts: a control station and an editing station.

The control station is the place where requests are sent to the server, and the server's answer analysed. The different requests are summarized below.

There are two kinds of editing stations: stations specialized in one type of editing, related to the nature of the project, and customized versions of popular text-editors.

Communication between client and server

The communication protocols used are FTP and X.400, but except for the lowest level the implementation is protocol-independent.

The typical workflow in a phase of the budget

For the creation of a new version of the budget, the previous version is taken as a starting point. Several steps have to be performed to come up with a new version. These steps are

authoring: authoring is done in the master language, French;
translating into all the other languages;
reviewing of each document in each langauge;
publishing.

In all of these steps the same operations can be performed:

request for nomenclature
consultation of an editorial object
reservation of an editorial object
editing of aneditorial object
update of an editorial object
release of an editorial object
the printing of the differences of two versions of the same editorial object.

There are two levels of workflow. The first one is the system-wide workflow, which involves the sequence of authoring, translation, review, and publication. The second level is at the server where each user request triggers the start of an appropriate workflow. Both levels are managed at the scheduler, but differently.

The system-wide workflow is an enabling workflow. From status to status (authoring, translating, editing and publishing), it allows certain clients to work, but it does not do their work. They have to consult, reserve, and so on. Between the statuses of authoring, translating, editing, publishing, and again authoring, there are other statuses, which are END_AUTHOR, END_TRANSLATOR, END_CORRECTOR and END_PRINTER. To move to such a status, certain integrity checks are performed (for example the synoptism between the languages, that is structure comparison between all languages).

Implementation

In this section, we shall discuss some of the implementation of the system described above. We shall first describe the SGML engine, which serves as the basis for most of the modules developed for SEI-BUD. Then we take a look at DTDs where the emphasis is put on process-orientation. Finally, we discuss some SGML modules which are used in in SEI-BUD.

SGML Engine

The SGML engine used in this project, and in all applications at the SGML Technologies Group, is called SIT (SGML Integrated Toolkit). It consists of a fully-featured SGML parser combined with an application language.

Programs created with SIT are stand-alone applications, that can have one type of SGML instance as input, depending on the SGML declaration, and the DTD or DTDs used at the compilation of the program. A SIT application module consists of several parts:

an optional SGML declaration,
one or more DTDs,
one or more link process definitions, which contain the application code.

The execution flow of a SIT application is tied to the DTD using LINKs. Even without knowledge of LINK, the following example will make its use self-evident. Let us take the following example of a DTD which can contain paragraphs and lists.

<!DOCTYPE DOC[
<!ELEMENT DOC	- -	(P|LIST)*>
<!ELEMENT P	- -	(#PCDATA)>
<!ELEMENT LIST	- -	(P+)>
]>

If we want to associate with that DTD an application that processes instances of that DTD, we need the following skeleton:

<!LINKTYPE app DOC #IMPLIED [
<!LINK #INITIAL
	DOC 
	-- code for processing DOC starttag and endtag --
P
	-- code for processing P starttag, endtag, and data --
LIST
	-- code for processing LIST starttag and endtag --
>
]>

This is called a Link Process Definition (LPD) and it is the way provided by the standard to add processing capabilities to a parser.

Apart from a containing envelope, we see that there is one LINK. This link contains the code to be executed when certain tags are encountered in the input. #INITIAL informs the parser that this link should be used when starting to parse an input document. An #INITIAL link is required. Suppose now we want to process P within a list differently from the other Ps (for example, to insert bullets or to indent a P within a list). To do so, we create a new link, containing the execution code for a P within a LIST, and we inform the parser that within the context of a LIST, Ps are to be processed differently, by telling it to use the newly defined LINK when in a list.

<!LINKTYPE app DOC #IMPLIED [
<!LINK #INITIAL
DOC 
	-- code for processing DOC starttag and endtag --
P
	-- code for processing P starttag, endtag, and data --
LIST
	#USELINK ListLink
	-- code for processing LIST starttag and endtag --
>
<!LINK ListLink
P
	--code for processing P within LIST
>
]>

This example makes it clear that extensive use of LINKs lifts the burden of the implementation of the execution flow off the developer's shoulders. The developer can concentrate on the management of the execution flow through USELINK.

So much for the coupling of a document instance to the application code. Up to now, everything follows the standard. The application language itself, which implements the application code, is not standardized. The application language used in SIT was developed in-house, with facilities to deal with common tasks in the processing of SGML instances.

Experience shows that this approach, where applications are inherently tied to the input documents they can handle, proves to be an excellent way of processing SGML instances in terms of flexibility and rapid application development.

Applications created with SIT are used in a wide range of areas. Certain applications will test whether some semantic rules, not expressed in the DTD, are fullfilled. Others will extract a table of contents, still others might convert a document to another format, while yet more as is the case in SEI-BUD and as we shall see shortly, can be used to drive a complete client-server system.

A processing-oriented DTD

A DTD can be designed from several points of view: from the point of view of presentation, from the point of view of document processing, or from no view at all (which is probably often the case). In an SGML-based system, the quality of a DTD corresponds to the ease with which its instances can be handled. We talk of processing-oriented DTDs. Processing-oriented means two things:

the DTD ideally should reflect the possibilities of the system,
the DTD should be designed to facilitate the processing of its instances.

Capabilities of the system

As stated above, the DTD ideally reflects the possibilities of the system. For example it makes no sense to allow tables within tables in a DTD if neither of the conversion filters can process them, nor the printer print them. Of course, a program can be developed to manage more complex DTDs, but as the modules work in a chain, the DTD should reflect the possibility of the overall system, hence the capability of the weakest link of the chain. This is a simple principle, but when it is strictly applied it prevents unnecessary development, and it confines and defines the area in which those applications are tested.

Processing of instances

A DTD should also be designed to facilitate processing. To illustrate that, two simple examples will be given.

First consider the following definition of a cell which can contain either plain #PCDATA or #PCDATA within a P:

<!ELEMENT CELL - - (#PCDATA|P)>
<!ELEMENT P - - (#PCDATA)>

In this case, all conversion filters must duplicate the code for processing the content of a cell. The following alternative is hardly better:

<!ELEMENT CELL - - (P?)>
<!ELEMENT P - - (#PCDATA)>

Here, the aim is to avoid empty Ps within a cell. But in this case each conversion to that DTD should check whether the cell content is empty in order to know whether or not to include tags for P. To avoid code duplication and unnecessary checks, the following definition is better:

<!ELEMENT CELL - - (P)>
<!ELEMENT P - - (#PCDATA)>

As another example, let us compare the two following constructions:

<P>Introduction to the list<LIST><ITEM></ITEM></LIST><P>

and

<P><LIST><INT.LI>Introduction to the list</INT.LI><ITEM></ITEM></LIST></P>

In the first construction we do not know we are in an introduction to a list until we encounter a LIST starttag, whereas the second solution leaves no doubt from the beginning. In other words, the second solution is more processing-oriented than the first.

These two simple examples make clear the fact the a DTD, whose design was made from a process-oriented point of view, will significantly alleviate and streamline the work to be done in associated applications.

Conversion filters

Conversion filters are used to convert an editorial object between different DTDs and to and from popular document formats. Document formats used in SEI-BUD are Word RTF, WordPerfect, and Interleaf ASCII. Whereas conversion filters from DTDs to presentation formats are not that hard to imagine, the converse is less obvious. Nevertheless, documents formatted in any format can be considered to be instances of a particular DTD. Let us explain this with Interleaf documents.

An Interleaf document contains a declaration part and a content part. The content consists of components and tables. The DTD for an Interleaf document thus looks like

#<!ELEMENT DOC	O O	(DECL,CONTENT)>
#<!ELEMENT DECL	O O	#PCDATA>
#<!ELEMENT CONTENT	O O	((COMP|TABLE)*)>

Of course, the level of detail used in the DTD depends on the information that one expects to be present in the document and that one wants to extract. In Interleaf, for example, and in the case of SEI-BUD, it is unnecessary to take graphics into account since they do not occur in budget documents.

Shortrefs

Once a DTD is defined, the markup in the document has to be coupled to starttags and endtags of that DTD. To do so, short references are used. Several steps have to be performed. First, the markup that is to be recognized is added as a short reference to the SGML declaration. Here we want to recognize the end of the declarations, the start of a component, and the start of a table. So we add to the SGML declaration in the SHORTREF section the strings

<End Declaration>'
'<"'
'<!Table'

Then we link them to starttags and endtags in two steps: entity definitions and a shortref map.

#<!ENTITY edecl ENDTAG DECL>
#<!ENTITY scomp STARTTAG COMP>
#<!ENTITY stable STARTTAG TABLE>

#<!SHORTREF docmap 
	'<End Declaration>'	edecl
	'<"'			scomp
	'<!Table'		stable
>

This means that when docmap will be active, the string

<End Declaration>

will be interpreted by the parser, via the entity edecl, as the endtag of the element DECL. To make docmap active for the complete instance, we map docmap to the element DOC using the following markup:

#<!USEMAP docmap	DOC>

This ensures that within the element DOC, the parser will recognize and interpret the strings as declared in the SHORTREF map docmap.

Character transformations are implemented using the same mechanism of SHORTREF and USEMAP.

As an extension to the standard, the SGML Integrated Toolkit (SIT) allows short references to be regular expressions. A broad range of markup can thus be expressed in a more concise way than it would be without regular expressions.

If it is conceptually speaking simple to consider any document format as a particular DTD, in practice many details make that process a non-trivial task. The hardest part in converting a text-editing document to an SGML instance resides in the design of an appropriate DTD. The application itself is straightforward, and usually consists of rearranging elements and their content.

Tables

In addition to text and figures, the budget also contains tables. Tables are divided into table classes, each table class being implemented as a subdocument. This means that each table class has its own DTD and its own associated application code. In SIT, using the SUBDOC feature, a seamless integration of subdocuments and their applications with the base document and its application code is possible.

SGML declaration

The only changes made to the SGML declaration for conversion filters are those which are meant to avoid that some characters, such as

being wrongly interpreted by the parser. Therefore the starttagopener (STAGO) is redefined to

#<

, the endtag opener (ETAGO) to

#</

, and the markup declaration opener (MDO) is redefined to

#<!

This explains the

#<!

markup found above.

Printing of the differences

The printing of the differences occurs in two steps. First there is a non-SGML module that compares the two versions of the same document. The result of this module is the input to an SGML module which converts its input to an Interleaf ASCII document which visualizes the differences. Let us have a look at the intermediate result. The intermediate result is a document that combines the old and the new version of the document. If for example the first version shows

P>First paragraph.tP><P>Second paragraph.tP>

and the second

<P>First paragraph. Second paragraph. Third <HT>highlightedtHT> paragraph.tP>

then the intermediate result might be

<(old)P><(new)P>First paragraph.<(old)ins>Second paragraph. Third <(new)HT>highlightedt(new)HT> paragrapht(old)ins>t(old)P>t(new)P><(new)sup><(old)P>Second paragraph.t(old)P>t(new)sup>

To understand this output, we should first notice that this document contains two concurrent grammars, old and new. The old grammar is the original one, to which the element

<!ELEMENT ins	- -	(#PCDATA)>

has been added. The new grammar is the original one, to which the element

<!ELEMENT sup	- -	(#PCDATA)>

has been added.

For each tag we say to which grammar it belongs by mentioning it between parentheses. When we suppress all the tags belonging to the new grammar, as well as the element ins (insertion) and its #PCDATA, we get

<(old)P>First paragraph.t(old)P><(old)P>Second paragraph.t(old)P>,

that is, the old document. Conversely, when we suppress all the tags pertaining to the old grammar, as well as the element sup (suppression), we get the new version of the document:

<(new)P>First paragraph. Second paragraph. Third <(new)HT>highlightedt(new)HT> paragraph.t(new)P>

The meaning of the ins and sup elements is now straightforward. ins indicates to the old grammar, that what lies within its tags has been inserted, whereas sup indicates to the new grammar, that what lies within its tags pertains to the deleted part of the old instance. This document is processed as follows. In a concurrent grammar document, tags pertain to one grammar, but the #PCDATA is common to all grammars. So for the common elements, one LPD (for example associated to new) treats the document, the other LPD skipping all processing. When the old grammar encounters an ins, it sets a flag that tells the new LPD that what follows is inserted. The flag is unset at the closetag of the ins. In between the two ins tags there are only new tags. So ins contains only #PCDATA. The processing of that #PCDATA is skipped in the old LPD. The opposite happens when a sup element is encountered. Then a flag is set to inform the old LPD that it should treat what follows.

This application shows how natural the use of concurrent DTDs can be.

Scheduler

The scheduler is the server module that processes users' requests. It is completely implemented as an SGML module.

A first DTD is used to retrieve messages from the input queue.

<!DOCTYPE QUEUE[
<!ELEMENT	QUEUE	(READMSG*)>
<!ELEMENT READMSG	- - (#PCDATA)>
]>

The application of the starttag of queue passes in a loop READMSG starttags and endtags to the parser. Each time the parser gets a READMSG starttags, it executes the associated application, which retrieves the first message in the scheduler's input queue and passes it to the parser, where the associated application is executed, before the READMSG endtag is parsed.

The workflow associated with a request consists of states and events. A message is an event in the context of some state. A state represents either the initial state (in that case the event will be a request), or the execution of a process, the event being the result of the process, which is failure or success. The event E_PROC_OK is used to announce the success of the process PROC associated with the state S_PROC; E_PROC_NOK is used to announce failure of PROC.

The user sends his request in the form of an SGML-formatted message:

<(REQTYPE)S_REQ> <(REQTYPE)E_REQ ATT1="VAL1" ATTN="VALN">t(REQTYPE)S_REQ>

That message, being the input to the scheduler, will be parsed by the SGML parser and the appropriate code will be executed according to the LPD of the DTD REQTYPE. Typically, a process will be scheduled and sent to the process-server, together with the messages that have to be sent back to the queue of the scheduler in case of success and in case of failure. These messages will look like:

<(REQTYPE)S_PROC1><(REQTYPE)E_PROC1_OK ATT1="VAL1" ATTN="VALN">t(REQTYPE)S_PROC1>

(where att1 and att2 are whatever attribute, and val1 and val2 their value) for success and

<(REQTYPE)S_PROC1> <(REQTYPE)E_PROC1_NOK ATT1="VAL1" ATTN="VALN">t(REQTYPE)S_PROC1>

for failure.

Typically the attributes contain information needed by the scheduler (username, language, etc.). While the process PROC1 is being executed, other messages from any type of request can be handled by the scheduler. Then when BPSRV puts one of the messages onto the scheduler's queue, the scheduler will process that message, first by restoring its context, and then taking the appropriate action.

To exemplify this, let us sketch the outline of the implementation of the graph for consultation of an editorial object. The tasks that have to be performed are:

extraction of the data from the repository,
conversion of the date into the appropriate format.

The execution of this graph will be triggered by a message, posted by a user. The format of the message might tions to be performed at the same time.

The definitions for the states are:

<!ELEMENT S_REQ	O O	((E_REQ,S_EXTRACT)?)>
<!ELEMENT S_EXTRACT	O O ((E_EXTRACT_OK,S_CONVERT)| (E_EXTRACT_NOK,S_ERR))?>
<!ELEMENT S_CONVERT	O O ((E_CONVERT_OK,S_END) | (E_CONVERT_NOK, S_ERR))?>
<!ELEMENT (S_ERR, S_END)	O O	(E_SEND)>

Finally, the definitions for the events:

<!ELEMENT E_REQ	- O EMPTY>
<!ELEMENT E_EXTRACT_OK	- O EMPTY>
<!ELEMENT E_EXTRACT_NOK	- O EMPTY>
<!ELEMENT E_CONVERT_OK	- O EMPTY>
<!ELEMENT E_CONVERT_NOK	- O EMPTY>
<!ATTLIST (E_REQ,E_EXTRACT_OK, E_EXTRACT_NOK, E_CONVERT_OK, E_CONVERT_NOK)		USER	CDATA #REQUIRED
				LG CDATA#REQUIRED
				ID CDATA #REQUIRED
				FORMAT CDATA #REQUIRED>

The skeleton for the link process definition for this graph will be:

<!LINKTYPE lconsult CONSULT #IMPLIED [
<!LINK #INITIAL
S_DEM #USELINK S_DEM_LINK
S_EXTRACT #USELINK S_EXTRACT_LINK
S_CONVERT #USELINK S_CONVERT_LINK
S_ERR #USELINK S_ERR_LINK
S_END #USELINK S_END_LINK
>
<!LINK S_DEM_LINK
E_REQ
	-- application code for event E_REQ --
>
<!LINK S_EXTRACT_LINK
E_EXTRACT_OK
	-- application code for event E_EXTRACT_OK --
E_EXTRACT_NOK
	-- application code for event E_EXTRACT_NOK --
>
<!LINK S_CONVERT_LINK
E_CONVERT_OK
	-- application code for event E_CONVERT_OK --
E_CONVERT_NOK
	-- application code for event E_CONVERT_NOK --
>
<!LINK S_END_LINK
E_SEND
	-- application for event E_SEND --
>
<!LINK S_ERR_LINK
E_SEND
	-- application for event E_SEND --
>
]>

Summarizing, each type of request will be implemented using one DTD and an associated application. There will be as many DTDs as there are types of requests. These are implemented as concurrent DTDs each with their associated LPD. Since messages do not contain #PCDATA, the different messages will not interfere with each other.

The use of concurrent DTDs for the different types of requests ensures a strict separation of each request (each DTD has its own application code), while permitting different requests to be put in any order, since the parser will direct every message (which is an element) to one LPD according to the DTD to which the message belongs.

In practice, the generation the DTD, the squeleton of the associated LPD, and even the messages associated with the commands, can be generated automatically from a formal high-level description of the graph. This is beyond the scope of this paper

As a conclusion, we might say that the scheduler represents an SGML module where the input document is a sequence of messages created in real time, whose purpose is to drive the server. Using the mechanism of concurrent grammars for each type of request, the scheduler and BPSRV implement a truly multitasking environment, that is entirely SGML-based. Benefits of using SGML as the backbone of the system include:

the richness that messages can have, which boils down to the richness SGML instances can have,
the inherent multitasking capabilities of concurrent grammars,
all that being directed by a proven and fully-featured SGML engine.

Repository - document storage model

The document is stored in an Oracle database. The budget contains 11 languages and 7 volumes. One table is used per language and per volume.

In the budget's DTD, every element of the nomenclature is assigned a required ID attribute. Each editorial object has its root ID. To store the structure associated with an ID, at the level of that node, each of its subnodes are stored with their ID in a recursive way. Recursion ends at the level of the granularity of the DTD. As soon as there is no ID in the substructure of an element, that element is considered to be a leaf in the tree, and its content is stored as a whole, as one string in the database.

This scheme allows us to implement efficiently incremental updates and retrievals of parts of a document. It also allows a locking mechanism up to the level of granularity of the DTD, and all features of relational databases (indexes and recovery mechanisms) can be used to enhance the system's performance.

Versioning

When an update is sent to the repository, only the updates are actually stored, and the version number of the document is incremented. The history of modification and creation is kept on disk and any version can be consulted on-line at any time.

Conclusion

We have described a complex SGML-based client-server system that is used for the creation and maintenance of the European Union's budget, a huge 11-language document, revised three times a year.

We have shown that an SGML system can be much more than just having SGML instances as input and output. We have described and illustrated how SGML is used in every aspect of the system, ranging from the server modules, over SGML based processing modules to an SGML-formatted messaging scheme between clients and server.

Finally, we have outlined how at the very heart of the system presented here is SIT, the SGML Technologies Group's fully-featured SGML parser and integrated application language.

Please mail your comments to Tom Catteau at tct@acse.be

This paper was first published in the Conference Proceedings of SGML'97 US, December 1998, pp 645-653.