[This local archive copy mirrored from the canonical site: http://www.sgmltech.com/papers/proto2prod.htm; links may not have complete integrity, so use the canonical document at this URL if possible.]

From Prototype to Production System

Managing the Growth

Authors

Stéphane Bidoul

Keywords

Scalability
Production Systems
Dissemination Systems
Workflow
Quality Assurance
Repositories
Collaborative Work
Conversions

Abstract

This paper addresses scalability of SGML authoring and dissemination systems. An area highlighted is the need to have a set of detailed production procedures taking into account human as well as automated operations.

Introduction

SGML information systems usually come into being in the form of small-scale prototype systems supporting a few users and a relatively small set of representative documents. After a successful proof-of-concept phase comes the time of production on a larger scale where the problems encountered while growing to a full-scale production system are of a totally different nature from those uncovered during the prototyping phase. For example there are the different and sometimes contradictory constraints of the authoring and dissemination systems, which often show up only in high-volume/high-update rate conditions.

This paper addresses scalability. Neglecting the more obvious aspects of scalability it highlights some issues, which are not always considered when designing complex document management systems.

One aspect highlighted is the need to have a set of detailed production procedures which are adhered to in order to avoid cascading effects of incorrectly entered data, among other potential problems. An analogy could be an industrial production system where here there is:

software to take over work previously performed by humans;
closely defined workflow procedures;
the need for an information base;
the need for quality assurance.

The handling of objects in a well-populated information base is discussed, where software systems can deal with everything from relatively small amounts of data to huge amounts of information of many kinds, its manipulation, and management.

Overview of a typical non-trivial SGML system

This section gives an overview, based on real-world examples, of typical SGML systems. Essential components are described, as well as components which, though not directly SGML-related, are essential parts of a successful solution. Different requirements for production systems and dissemination systems are highlighted.

This description will serve as a reference model which will be used throughout this paper to support the discussion.

By non-trivial is meant a system with the following characteristics:

multi-user,
multilingual,
distributed, where authors and translators are possibly spread across a WAN.

While a single-user unilingual authoring environment can easily be created using an off-the-shelf SGML editor, storing the work in progress in operating system files, the situation gets more complex as soon as one of the above factors is added.

Production system

The components usually found in a production system include:

the SGML repository,
the authoring and translation environments,
the workflow subsystem,
the communication subsystems,
the quality assurance environment.

The SGML repository

The repository ensures the robust, persistent storage of the SGML document being produced. As a minimum it provides the necessary features for:

collaborative work including document locking and simultaneous work on different parts of the same document,
version control,
validation.

For a production environment, sophisticated search capabilities are somewhat less important than in a dissemination environment. Indeed, authors usually know what they want to do and which document they need to change.

In a multilingual environment the repository (or some associated component) should provide support for translators, such as replicating document structure changes automatically.

The authoring and translation environments

These document creation and modification environments usually take the form of an SGML editor, associated with content manipulation tools, such as a thesaurus, spelling and grammar checkers, and translation memories.

When user training costs are an issue, or when then turnover of users is high, it can sometimes be economically justified to develop an authoring environment by customization of already deployed tools (for example word processors) which are familiar to users. Thus training costs are reduced while paying a one-time customization effort.

In the latter case, conversion filters will be necessary between SGML and the user's authoring tool format.

It must be noted that a full SGML solution does not necessarily avoid the use of some sort of conversion filters.

It could be that the target DTD is not easy to handle for all authors; some constructs may be too complex to handle with an SGML editor (for example complex relationships between figures in a budget document).
In such case a conversion step could transform the data to an intermediate format (for example tables) easier to visualize and edit, while the inverse filter would perform a final validation and conversion back to the internal format.
In some cases not all validation rules can be expressed in a DTD (for example for content-dependent validations). In such cases some additional validation filters are necessary.

The workflow subsystem

The workflow subsystem is an important part of a multi-user production environment. Its purpose is to track the work performed by each person working on the system, to allow a precise monitoring of the production process, and to help the decision at each step of the procedure.

In some cases the workflow system can be used to dispatch the work automatically to users according to the result of the previous step.

The communication subsystem

The communication subsystem is often overlooked in a document management environment.

A few alternatives are often present:

a file sharing system,
a file transfer system,
an e-mail system.

The file sharing solution is very efficient on a LAN, while the file transfer or e-mail approach has to be set up when users are spread over a potentially slow WAN.

Dissemination system

Components covered include:

the documents' input environment,
the quality assurance environment,
the SGML repository,
the conversion filters.

The documents input environment

In a dissemination system there is usually an environment where documents are validated and prepared for insertion in the database. This is typically a place where authors put documents ready to be delivered. After a quality assurance phase, these documents are periodically taken in charge by a batch system which loads them in the diffusion database and indexes them for faster retrieval.

The SGML repository

In a dissemination environment the repository should provide the following basic features:

support for a large number of concurrent read-only accesses,
support for a sophisticated search engine.

Here the search capability must be emphasized, because the users requesting the documents usually do not know the structure of the information base; thus they need support from the system to find their way through the potentially vast amount of information.

Collaborative work support is less important, however, since most users only need read-only access, and the input environment only needs to lock document for the time of the update of the modified documents. This is somewhat different from the production environment where users need to lock (parts of) documents while they work on them.

The conversion filters

Conversion filters for dissemination have different requirements from those used in a production environment. For instance, an SGML to RTF conversion filter needs only to focus on presentation for dissemination, while in a production system the SGML to RTF conversion should focus on keeping the structure for later reconstruction of the SGML source.

Scalability

This section discusses the scalability of SGML systems. The more obvious aspects are not discussed. These necessarily include:

ensuring the repository can handle large documents and/or a large number of them,
ensuring the repository can handle the required number of users,
ensuring the editors can comfortably handle large and/or complex documents.

While certainly not exhaustive, this paper focuses on a few specific scalability-related problems, which may not be self-evident at first glance.

Allow users to work on parts of documents

In cases where the size of individual documents is potentially large, care must be taken to provide users with chunks of documents they can handle easily.

A first approach is to split the documents according to a predefined granularity. However, this is not always convenient because at different steps of the process a different granularity is necessary. For instance:

an author who knows precisely what to do may want to work on small parts of documents to gain speed;
later, translators may wish to receive larger chunks containing all the text to be translated;
during a proofreading phase it is convenient to have access to the full document to fix typos, which are spread all over the text.

The simple examples above show that the granularity of documents must not be fixed once and for all.

When users are locking documents for long periods, it can quickly become a scalability problem because a coarse granularity can lead to users working on different parts waiting for each other, simply because the system needs to lock documents for one user at a time. It is thus important for the system to let users select precisely which part of the document they want to work on.

This feature, however, has impact on the repository, which needs to let them access but also lock individual parts of the document, with a variable granularity over time. The authoring systems must also be ready to receive fragments of documents.

Variable granularity also has an important impact on the workflow system.

Indeed, most workflow systems are tailored to work on individual objects that keep the same scope throughout the process. However, it has been seen that at different steps of a procedure it is desirable to have different granularity levels. It is very important that the workflow system is flexible enough to allow for those scope variations across time. For example authors could work on several parts of a document, each part taking a whole acceptance procedure. Then, when the time comes to translate the work of the authors, the workflow system must be able to acknowledge the fact that all the authored parts are translated by a single user operating on a larger document fragment.

Taking care of variable granularity allows a system to manage documents as large as several megabytes modified by tens of users. And there is no potential limit on document size.

Implement an asynchronous system

As soon as conversion filters take a part in the production process, there are potentially long operations. By long we mean here more than a few tens of seconds.

This is a perfect place for a bottleneck. The worst case prototype architecture would be a linear system, processing each request one at a time, and potentially requesting the user to wait until his request is complete.

An individual request may take a few minutes, which is acceptable for individual users requesting a few conversions. However, if requests are queued, when several users issue commands at the same time, the first will be happy while the last one may well be angry after waiting for half an hour or more.

Several improvements are necessary to support more than a few users.

Since a large-scale production system is likely to have a lot of server-side processing, care must be taken that the server is able to handle the work of multiple concurrent users, while not producing bottlenecks. To reduce significantly the elapsed time seen by one user, the only way is to improve the performance of the individual processes. It is comparatively easier to improve the total throughput of the system.
System architects used to traditional transactional systems often overlook this problem.
Document management processing often involves long procedures, which must be taken into account in early system design. This means careful control of the system load, to let multiple users get their job done as soon as possible while avoiding overloading the server. This also means that each long process must be designed to be reentrant, since a single non-reentrant process may reduce the total throughput of the system.
The point is that by the very fact that individual document processing operations take longer that traditional applications, we need to implement system monitoring modules even when the number of concurrent users is relatively small.
The user interface can also be improved by implementing a mailbox-like system, where the user posts requests and gets notified when the results are ready. This will free the user interface for other tasks, such as preparing other requests.

Watch the communication subsystems

While not an SGML scalability topic per se, the following may take some implementors by surprise.

When delocalized users want to be connected to a document management system, we often think of e-mail to let them receive documents and submit their input.

While this is a perfectly valid and elegant solution, it must be taken into account that an existing e-mail system, which works very well for day-to-day exchange of small messages between workers, might not support the load of an automated system exchanging potentially large documents.

Dissociate the production from the dissemination system

As has been seen above, the requirements of a dissemination system are totally different from those of a production system. Separating the dissemination repository from the production repository not only avoids security problems (for example when external users cannot have access to work-in-progress), but it also allows you to choose which product is best for each task.

The importance of the workflow procedures

In this section we highlight the fact that a badly designed workflow procedure, perhaps copied from existing manual procedures, combined with new system constraints, can lead to delays and extra work.

Imagine the following simple manual procedure:

authors change a document in some master language,
translators apply changes to other linguistic versions,
proofreaders ensure that all languages are coherent, and fix typos,
the document is published.

This procedure is relatively straightforward and works very well.

Now it is decided to automate this procedure with an SGML system. Among other improvements the system brings automated validation of the translators' work, ensuring they produce documents which have the same structure as the master linguistic version. This eases the work of the proofreaders, who can rely on the fact that the linguistic instances are coherent at the structural level, and can concentrate on the content.

If this system is put into operation while keeping the same work procedure, it will induce a subtle problem. Indeed, if an author introduces a small structural error (say, a list item is badly tagged), the translators will reproduce it in all the target languages. In the manual system this error may remain unnoticed, since it may not be visible using a presentation-oriented system. Translators would produce a correct list and the master language version would be fixed by translators just before publishing.

In the new system, however, the translators will produce the document in all languages with the error, since their only option is to produce a document which is coherent with the master version. If the proofreaders then detect the error, they will need to fix it in all languages!

This is typically a problem which is easily overlooked during a prototyping phase where a few documents are tested without time constraints. However, operating in high-volume conditions where deadlines are critical and the number of languages is important, it can cause grey hairs to the proofreading and quality assurance teams.

So here is an adapted procedure, taking into account the new features of the system:

authors change a document in some master language;
proofreaders validate and correct the master language version;
translators apply changes to other linguistic versions;
proofreaders ensure that all languages are coherent at the content level, and fix typos in the other linguistic versions only;
the document is published.

Quality Assurance

Quality assurance is a crucial part of a scalable system. Only good quality content will allow for smooth operation and future extensions of the system. While this may seem obvious at first glance, a few simple rules must be kept in mind while implementing the quality procedures, in order to avoid bottlenecks.

Quality assurance in document management systems can take two forms:

human validation by sampling;
automated validation, which produces a quality report;
automated validation, which enforces some predefined quality rules.

Both techniques have advantages and inconvenience.

While manual sampling is the most accurate it is time-consuming, costly, and cannot take place at any time in the procedure. For instance, manual sampling is not feasible on work in progress.

Automated validation is usually cheaper and can be exhaustive (that is it may be applied to all documents). It is not feasible, however, to perform some kinds of content validation mechanically.

Both techniques are thus necessary to achieve good results.

To avoid bottlenecks, and avoid some financial surprises, it is necessary to evaluate precisely the time necessary for each task.

As described in the previous section, it is also important to place the quality assurance phase(s) at the right moment in the production process. This is especially true for automated validation tools which enforce some pre-defined rules. Indeed, some rules which must be valid for the finished document are not necessarily valid during the document production (for example linguistic coherency rules). In such cases, it is advisable that such validations tools only produce warnings, but do not enforce the rules which may not be applicable.

Conclusion

We have described the typical components of an SGML production system, and of an SGML-based document dissemination system.

Based on these typical system models, a few potential scalability problems have been described, not related to individual systems but to their integration in a complex production process. For each topic, hints have been given to technical solutions.

A system with fixed granularity (where users can only work on whole documents) can give rise scalability issues when the document size grows and multiple users need to work simultaneously to meet deadlines. This is solved by using an SGML repository, which gives access to variable-granularity document fragments but needs an adapted workflow system which is able to track parts of documents.
If one fails to take into account the long nature of document processing operations, the system can experience bottlenecks as soon as a few users work simultaneously. We have highlighted the need for a system-monitoring module, such as those found in high-load transactional systems.
Also highlighted has been the fact that a badly designed workflow procedure, perhaps copied from existing manual procedures, combined with new system constraints, can lead to delays.
The manual and automatic quality assurance procedures are critical parts of a production system. They need to be well evaluated, both to define their objective and cost. It is also necessary to integrate them at the right moment in the production process.

Those real-world examples confirm that a successfull large to medium-scale document production system must be designed and planned using techniques similar to those used for industrial production plants.

Please mail your comments to Stéphane Bidoul at sbi@acse.be

This paper was first published in the Conference Proceedings of SGML'97 US, December 1998, pp 533-537.