[This local archive copy mirrored from the canonical site: http://www.sgmltech.com/papers/proto2prod.htm; links may not have complete integrity, so use the canonical document at this URL if possible.]
SGML information systems usually come into being in the form of small-scale prototype systems supporting a few users and a relatively small set of representative documents. After a successful proof-of-concept phase comes the time of production on a larger scale where the problems encountered are of a totally different nature from those uncovered during the prototyping phase.
This paper addresses scalability of SGML authoring and dissemination systems. An area highlighted is the need to have a set of detailed production procedures taking into account human as well as automated operations.
SGML information systems usually come into being in the form of small-scale prototype systems supporting a few users and a relatively small set of representative documents. After a successful proof-of-concept phase comes the time of production on a larger scale where the problems encountered while growing to a full-scale production system are of a totally different nature from those uncovered during the prototyping phase. For example there are the different and sometimes contradictory constraints of the authoring and dissemination systems, which often show up only in high-volume/high-update rate conditions.
This paper addresses scalability. Neglecting the more obvious aspects of scalability it highlights some issues, which are not always considered when designing complex document management systems.
One aspect highlighted is the need to have a set of detailed production procedures which are adhered to in order to avoid cascading effects of incorrectly entered data, among other potential problems. An analogy could be an industrial production system where here there is:
The handling of objects in a well-populated information base is discussed, where software systems can deal with everything from relatively small amounts of data to huge amounts of information of many kinds, its manipulation, and management.
This section gives an overview, based on real-world examples, of typical SGML systems. Essential components are described, as well as components which, though not directly SGML-related, are essential parts of a successful solution. Different requirements for production systems and dissemination systems are highlighted.
This description will serve as a reference model which will be used throughout this paper to support the discussion.
By non-trivial is meant a system with the following characteristics:
While a single-user unilingual authoring environment can easily be created using an off-the-shelf SGML editor, storing the work in progress in operating system files, the situation gets more complex as soon as one of the above factors is added.
The components usually found in a production system include:
The repository ensures the robust, persistent storage of the SGML document being produced. As a minimum it provides the necessary features for:
For a production environment, sophisticated search capabilities are somewhat less important than in a dissemination environment. Indeed, authors usually know what they want to do and which document they need to change.
In a multilingual environment the repository (or some associated component) should provide support for translators, such as replicating document structure changes automatically.
These document creation and modification environments usually take the form of an SGML editor, associated with content manipulation tools, such as a thesaurus, spelling and grammar checkers, and translation memories.
When user training costs are an issue, or when then turnover of users is high, it can sometimes be economically justified to develop an authoring environment by customization of already deployed tools (for example word processors) which are familiar to users. Thus training costs are reduced while paying a one-time customization effort.
In the latter case, conversion filters will be necessary between SGML and the user's authoring tool format.
It must be noted that a full SGML solution does not necessarily avoid the use of some sort of conversion filters.
The workflow subsystem is an important part of a multi-user production environment. Its purpose is to track the work performed by each person working on the system, to allow a precise monitoring of the production process, and to help the decision at each step of the procedure.
In some cases the workflow system can be used to dispatch the work automatically to users according to the result of the previous step.
The communication subsystem is often overlooked in a document management environment.
A few alternatives are often present:
The file sharing solution is very efficient on a LAN, while the file transfer or e-mail approach has to be set up when users are spread over a potentially slow WAN.
Components covered include:
In a dissemination system there is usually an environment where documents are validated and prepared for insertion in the database. This is typically a place where authors put documents ready to be delivered. After a quality assurance phase, these documents are periodically taken in charge by a batch system which loads them in the diffusion database and indexes them for faster retrieval.
In a dissemination environment the repository should provide the following basic features:
Here the search capability must be emphasized, because the users requesting the documents usually do not know the structure of the information base; thus they need support from the system to find their way through the potentially vast amount of information.
Collaborative work support is less important, however, since most users only need read-only access, and the input environment only needs to lock document for the time of the update of the modified documents. This is somewhat different from the production environment where users need to lock (parts of) documents while they work on them.
Conversion filters for dissemination have different requirements from those used in a production environment. For instance, an SGML to RTF conversion filter needs only to focus on presentation for dissemination, while in a production system the SGML to RTF conversion should focus on keeping the structure for later reconstruction of the SGML source.
This section discusses the scalability of SGML systems. The more obvious aspects are not discussed. These necessarily include:
While certainly not exhaustive, this paper focuses on a few specific scalability-related problems, which may not be self-evident at first glance.
In cases where the size of individual documents is potentially large, care must be taken to provide users with chunks of documents they can handle easily.
A first approach is to split the documents according to a predefined granularity. However, this is not always convenient because at different steps of the process a different granularity is necessary. For instance:
The simple examples above show that the granularity of documents must not be fixed once and for all.
When users are locking documents for long periods, it can quickly become a scalability problem because a coarse granularity can lead to users working on different parts waiting for each other, simply because the system needs to lock documents for one user at a time. It is thus important for the system to let users select precisely which part of the document they want to work on.
This feature, however, has impact on the repository, which needs to let them access but also lock individual parts of the document, with a variable granularity over time. The authoring systems must also be ready to receive fragments of documents.
Variable granularity also has an important impact on the workflow system.
Indeed, most workflow systems are tailored to work on individual objects that keep the same scope throughout the process. However, it has been seen that at different steps of a procedure it is desirable to have different granularity levels. It is very important that the workflow system is flexible enough to allow for those scope variations across time. For example authors could work on several parts of a document, each part taking a whole acceptance procedure. Then, when the time comes to translate the work of the authors, the workflow system must be able to acknowledge the fact that all the authored parts are translated by a single user operating on a larger document fragment.
Taking care of variable granularity allows a system to manage documents as large as several megabytes modified by tens of users. And there is no potential limit on document size.
As soon as conversion filters take a part in the production process, there are potentially long operations. By long we mean here more than a few tens of seconds.
This is a perfect place for a bottleneck. The worst case prototype architecture would be a linear system, processing each request one at a time, and potentially requesting the user to wait until his request is complete.
An individual request may take a few minutes, which is acceptable for individual users requesting a few conversions. However, if requests are queued, when several users issue commands at the same time, the first will be happy while the last one may well be angry after waiting for half an hour or more.
Several improvements are necessary to support more than a few users.
While not an SGML scalability topic per se, the following may take some implementors by surprise.
When delocalized users want to be connected to a document management system, we often think of e-mail to let them receive documents and submit their input.
While this is a perfectly valid and elegant solution, it must be taken into account that an existing e-mail system, which works very well for day-to-day exchange of small messages between workers, might not support the load of an automated system exchanging potentially large documents.
As has been seen above, the requirements of a dissemination system are totally different from those of a production system. Separating the dissemination repository from the production repository not only avoids security problems (for example when external users cannot have access to work-in-progress), but it also allows you to choose which product is best for each task.
In this section we highlight the fact that a badly designed workflow procedure, perhaps copied from existing manual procedures, combined with new system constraints, can lead to delays and extra work.
Imagine the following simple manual procedure:
This procedure is relatively straightforward and works very well.
Now it is decided to automate this procedure with an SGML system. Among other improvements the system brings automated validation of the translators' work, ensuring they produce documents which have the same structure as the master linguistic version. This eases the work of the proofreaders, who can rely on the fact that the linguistic instances are coherent at the structural level, and can concentrate on the content.
If this system is put into operation while keeping the same work procedure, it will induce a subtle problem. Indeed, if an author introduces a small structural error (say, a list item is badly tagged), the translators will reproduce it in all the target languages. In the manual system this error may remain unnoticed, since it may not be visible using a presentation-oriented system. Translators would produce a correct list and the master language version would be fixed by translators just before publishing.
In the new system, however, the translators will produce the document in all languages with the error, since their only option is to produce a document which is coherent with the master version. If the proofreaders then detect the error, they will need to fix it in all languages!
This is typically a problem which is easily overlooked during a prototyping phase where a few documents are tested without time constraints. However, operating in high-volume conditions where deadlines are critical and the number of languages is important, it can cause grey hairs to the proofreading and quality assurance teams.
So here is an adapted procedure, taking into account the new features of the system:
Quality assurance is a crucial part of a scalable system. Only good quality content will allow for smooth operation and future extensions of the system. While this may seem obvious at first glance, a few simple rules must be kept in mind while implementing the quality procedures, in order to avoid bottlenecks.
Quality assurance in document management systems can take two forms:
Both techniques have advantages and inconvenience.
While manual sampling is the most accurate it is time-consuming, costly, and cannot take place at any time in the procedure. For instance, manual sampling is not feasible on work in progress.
Automated validation is usually cheaper and can be exhaustive (that is it may be applied to all documents). It is not feasible, however, to perform some kinds of content validation mechanically.
Both techniques are thus necessary to achieve good results.
To avoid bottlenecks, and avoid some financial surprises, it is necessary to evaluate precisely the time necessary for each task.
As described in the previous section, it is also important to place the quality assurance phase(s) at the right moment in the production process. This is especially true for automated validation tools which enforce some pre-defined rules. Indeed, some rules which must be valid for the finished document are not necessarily valid during the document production (for example linguistic coherency rules). In such cases, it is advisable that such validations tools only produce warnings, but do not enforce the rules which may not be applicable.
We have described the typical components of an SGML production system, and of an SGML-based document dissemination system.
Based on these typical system models, a few potential scalability problems have been described, not related to individual systems but to their integration in a complex production process. For each topic, hints have been given to technical solutions.
Those real-world examples confirm that a successfull large to medium-scale document production system must be designed and planned using techniques similar to those used for industrial production plants.
Please mail your comments to Stéphane Bidoul at firstname.lastname@example.org
This paper was first published in the Conference Proceedings of SGML'97 US, December 1998, pp 533-537.
Copyright © 1997 The SGML Technologies Group