[This local archive copy is from the official and canonical URL, http://www.sgmltech.com/papers/sbi1198.htm; please refer to the canonical source document if possible.]

A Transactional Approach to SGML Storage:

Why You Should Ask More From Your Repository

Author

Stéphane Bidoul

Keywords

SGML
XML
Repository
Database
Schema
Transaction

Abstract

Most SGML (Standard Generalized Markup Language) repositories are heavily oriented towards document storage. Because of this, there is a tendency to have an interface that is based on a check-out/check-in mechanism of documents or parts of documents. Such an interface is very well adapted to the way in which humans work when interacting with the repository. However, when SGML is considered as a data modelling language, and the stored data gets more complex, the document-oriented check-out/check-in approach becomes inappropriate as a data manipulation language.

In this paper the benefits of a transaction-based interface to an SGML database are presented, along the lines of the update capabilities of traditional databases. Several real-world applications of this mechanism are described. An interface of this type is then presented, and it is shown why this is a very flexible way to access any SGML database, including document-oriented information bases.

Biographical Note

Stéphane Bidoul is a project manager and has been working at ACSE sa/nv (a member of the SGML Technologies Group) since 1993 as a developer and systems architect for object-oriented distributed applications and complex documentary workflow automation systems (automation of the editorial process for the European Community budget, automation of the legislative procedures for the Belgian French Community Parliament, etc). All these applications have in common their use of SGML, either as a document storage and exchange medium, or as a formal message specification tool for communications between distributed application processes. He obtained a degree, specializing in electromechanical engineering, from the Free University of Brussels in 1992; he may be contacted at sbi@sgmltech.com.

Introduction

Information repositories are an important component of most information systems. These repositories take the form of a database (relational, object-oriented, or other) in traditional applications. When the information to be manipulated by the system is perceived as being 'documents', the choice goes to document repositories, which have very different characteristics. Most notably, document repositories have support for versioning, but, compared to traditional databases, they lack fine-grained access facilities to parts of the information (the granularity is usually the document or some kind of document fragment).

For today's businesses, which are increasingly information driven, the information stored in documents becomes as important as information stored in traditional corporate databases. However, this information is often inaccessible to the knowledge worker, because it is not available on-line, and when available, it is usually in a mostly unstructured format, unsuitable for precise automated queries and processing. Indeed, where in corporate databases it is possible to formulate queries and manipulate virtually any single item of data, the documents are often handled as an indivisible unit of information, except for simple meta-data (title, author, ...).

When SGML is applied to document management systems, unfortunately it is often seen only as a standard 'data format'. This view only addresses the problem of protection of data against tool changes; it does not add much in the way of semantics to the information.

This paper emphasizes the need to design document-oriented information systems much like traditional database-oriented applications, with thorough process and data analysis and modelling. To support this, it is held that the functionalities of SGML repositories must match more closely the capabilities of database management systems. Techniques to achieve this goal are presented and real-world applications developed by our Group using such a system illustrate the benefits.

The SGML/DBMS (Database Management System) Analogy

In many respects SGML concepts can be compared with database concepts. The most interesting analogy is the one whereby the SGML DTD (Document Type Definition) is considered to be a data modelling language. In database parlance, the DTD is equivalent to the combination of the schema and integrity constraints.

Consider a trivial example of a contact database, where each record holds a name, an e-mail address, and (optionally) a telephone number. In a database, a DDL (Data Definition Language) is used to define the schema of the database. In SQL (Structured Query Language), for instance, the DDL statement for the example could be the following:


create table CONTACTS

(

  ID    number(10)   not null unique,

  NAME  char(256)    not null,

  EMAIL char(64),

  PHONE char(24)

);

The structure of the data, together with a few constraints, are expressed in the DDL: the identifier must be present unique, and the name has to be present.

A corresponding SGML DTD could look like this:


<!DOCTYPE CONTACTS [

<!ELEMENT CONTACTS - - (CONTACT)*>

<!ELEMENT CONTACT - - (NAME,EMAIL?,PHONE?)>

<!ATTLIST CONTACT

          ID    ID    #REQUIRED

>

<!ELEMENT (NAME|EMAIL|PHONE) - - (#PCDATA)>

]>

Here, the DTD expresses roughly the same structure and constraints as the SQL table creation statement above.

This analogy of DTD versus schema plus constraints is now accepted by many people, and more and more applications are using SGML and/or XML (eXtensible Markup Language) as a general data modelling and representation tool, in addition to more traditional document structuring.

Extending the Analogy

At the repository level, most SGML database systems are heavily oriented towards document management. The interface they present to users and programmers is based on a check-out/check-in paradigm. To change the content of the database, a document or a fragment of a document must be extracted, changed, and put back into the system. This process is well adapted to the way in which humans work when interacting with the system. However, application programs could benefit from a more flexible interface.

Taking the contact database example, it is evident that the SQL database provides many features to query, insert, delete, and update individual contacts and/or contact data in a precise way. For instance the following SQL statement:

update CONTACTS set EMAIL='sbi@sgmltech.com' where ID=1;

would update the EMAIL field of the contact record with ID 1. The following SQL statement:

update CONTACTS set NAME=null where ID=1;

would fail, however, because it would break the constraint ensuring that the NAME field always has a non-null value.

If the SGML equivalent of the contacts is stored in an SGML repository, most systems provide only a less flexible check-out/check-in approach which is not very well suited to the creation of a contact database management application.

Thus, continuing the analogy, it could be said that current SGML databases have a reasonably good DDL (Data Definition Language), that is the DTDs, but a very poor DML (Data Manipulation Language).

While this example is trivial and the data probably not suitable for storing in an SGML system, the last section of this paper (Real-world applications) shows cases where complex SGML production systems benefit greatly from a true SGML database providing both a sophisticated DML and a check-out/check-in interface.

In the next section, approaches are presented that allow a real DML to be created for SGML databases.

Proposed Features

Four basic concepts of an SGML database are discussed in this section:

addressing techniques needed to identify the content to be manipulated;
elementary data manipulation operations;
validation services;
versioning services.

By way of summarizing the interfaces of the system are described.

Addressing

A DML needs ways to address the data that is to be manipulated. In SQL this functionality is provided by the 'where clauses'. Two broad categories of addressing are needed:

content addressing, used to identify content to be read, deleted, or updated;
position addressing, used to identify positions in the tree where new content must be inserted.

There are many possible location addressing techniques that can be used, the TEI (Text Encoding Initiative) extended pointers being an example. The HyTime location module also proposes very general addressing techniques.

Addressing techniques are based on the parse tree. At a minimum there is the need to address the tree nodes, for instance, through a combination of their ID attributes and a relative address (à la treeloc). It must also be possible to address data chunks (between element nodes). Position addresses can be expressed relative to tree nodes.

Data Manipulation Operations

Data manipulation operations include:

inserting content at a given position address;
deleting content (data content or whole nodes);
updating content (data content or whole nodes).

The type of API (Application Programming Interface) that can be provided to execute these operations depends in part on the validation services requested from the repository. This is the subject of the next section.

Validation

It goes without saying that it is important that the data be kept valid against the corresponding DTD, as the DTD is considered to be the schema of the database. Two approaches are possible to achieve this:

validate at all times, during each elementary operation;
validate on transaction boundaries.

Each method has its strengths and weaknesses. The first allows the data to be parsed when inserted into the repository, the content in the repository remaining valid at all times. It is therefore possible to restore the context in the parser and build the parse tree as the data is being inserted in the repository. Because parsing is allowed, it is possible to have a full SGML repository. It is also similar to the way in which relational databases work, ensuring that the integrity constraints are valid at all times.

The second method does not permit the parsing of SGML input since this operation generally requires a valid context, which is not necessarily always available. Parsing XML is permitted, however, provided the well-formedness is preserved. It is thus possible to provide an API to manipulate the tree and validate it against the DTD at the request of the client application, and on transaction boundaries. One such API could be the DOM (Document Object Model), currently under development in W3C [World Wide Web Consortium].

Both techniques have advantages and disadvantages, and are thus useful in different applications. Without going into too much detail, in general it could be said that the first is well suited to machine processing, while the second is more adapted to interactive manipulation of the repository content (with an SGML editor, for instance).

Versioning

Document-oriented applications often have a need for version control. Version control covers many different needs including the tracking of changes made to the documents, the retrieval of past versions, and so on. Database-oriented applications generally do not provide this functionality.

The 'best of both worlds' approach presented in this paper is a system which aims to provide equal support to both-document oriented and database-oriented applications. As such, it provides basic support for versioning, powerful enough to build sophisticated versioning systems, while keeping the fine-grained operations of the data manipulation language.

Keeping this approach in mind, here is a minimum set of features to support versioning in an SGML database:

each transaction is a logical unit of work, but also increases the version number of the instance;
an 'undo' feature can be used to restore an instance to a previous version;
it is also possible to clean-up historical data, when it becomes unnecessary to keep it.

Sample Transactions

A system working along these lines has been built by our Group. The following samples illustrate the kind of elementary operation which can be executed by the system. Of course, very complex transactions can be built by combining the basic primitives.

Consider this sample SGML fragment:

<SECTION ID="SEC1">
 <TITLE>The section title</>
 <FIGURES>
  <VALUE ID="V1">1000</>
  <VALUE ID="V2">2000</>
  <VALUE ID="V3">3000</>
 </FIGURES>
 <COMMENTS ID="SEC1-C">
  <p>Some text</>
  <p>Some more text</>
 </COMMENTS>
</SECTION>

The following transaction

<DELETE>
  <-- address of element to be deleted -->
  <ELEM-LOC ROOT-ID="V3">
</DELETE>

would remove the third value.

The following transaction

<UPDATE>
  <-- address of element content to be updated -->
  <ELEM-CONTENT-LOC ROOT-ID="SEC1-C" TREELOC="1 2">
  <-- new content -->
  <INPUT>Some new text</INPUT>
</UPDATE>

would replace the content of the second paragraph.

This would lead to the following result:

<SECTION ID="SEC1">
  <TITLE>The section title</>
  <FIGURES>
    <VALUE ID="V1">1000</>
    <VALUE ID="V2">2000</>
  </FIGURES>
  <COMMENTS ID="SEC1-C">
    <p>Some text</>
    <p>Some new text</>
  </COMMENTS>
</SECTION>

Needless to say, such transactions are not intended for end-users. It is very important, however, that such a precise level of control be available to applications:

when the check-out/check-in approach is used, the modified fragment goes through a difference analyser which generates the transaction, only updating the modified content, limiting history space consumption, and increasing the performance of the check-in operation;
any program wanting to manipulate the stored data can also generate transactions directly, much like an SQL application updating a relational database.

Interfaces

As shown in the above diagram, several interfaces are available to access the SGML database.

The most important one is the transaction interface (the DML for the SGML database), used to change the instances stored in the database. This is the lowest-level interface, providing fine-grained write access to the stored content.
Another very important interface is the browsing and navigation API. This API is very similar to the DOM and provide read-only access to the stored objects (elements, attributes, text content, and so on).
The check-out interface is built using the primitives of the navigation API. This is at a higher level and used by document-oriented applications which are capable of parsing SGML or XML.
The traditional check-in operation is built using the transaction interface, with the help of an integrated delta analyser. This 'SGML diff' application compares the stored version of the check-in fragment with the new version submitted by the user and generates a transaction which is submitted to the database in the normal way. From the user's point of view, the check-in works at the fragment level; however, the actual changes to the stored content are limited to the parts effectively modified by the user.

Implementation Considerations

Storage Model

When storing SGML in databases, one common approach is to work at the entity management level. This approach consists in the creation of an entity manager which fetches the entities from a database instead of operating system files. The entities are then stored as chunks in the database and version control acts at the entity level.

This approach is relatively easy to implement and does not require a high level of SGML awareness from the repository. It allows for the storage of the SGML fragments 'as-is', keeping the SGML source intact.

To support the requirement to have fine-grained write access to the stored content, a radically different approach was chosen.

The main stored objects are SGML instances, elements, attributes, text chunks, and processing instructions. Internal entities are resolved, except for SDATA entities. External SGML text entities are also resolved. SUBDOC entities are stored as separate instances in the database, while data entities are stored in the database as separate chunks.

Another point is that this approach allows for the creation of structure-controlled SGML applications, as defined in [Goldfarb 90], pages 588-93. Applications working with the content stored in the SGML database do not need the help of a parser, since the ESIS (Element Structure Information Set) is immediately available through the browsing and navigation API.

When updating, the application needs to provide a transaction, which contains the fragments to be inserted in the database in the form of SGML data, which must be valid at the place where it is inserted in the database. Inserted fragments can be as small as needed (a new paragraph or a new attribute value, for instance). These are parsed by the database and converted to the corresponding storage objects. They are then immediately available for processing through the browsing and navigation API.

Alternatively a check-in operation can be emulated by providing a transaction saying 'update that element with this new content', or by using an diffing process to generate the transaction corresponding to the smallest set of modifications needed to reproduce the changes required by the user.

Storage Back-End

As the schema of the database is provided by the DTDs, schema facilities of the back-end database are not used. Thus, there is no direct mapping between SGML concepts (elements, attributes, and entities) and relational database concepts (records).

Should an object-oriented database be used, SGML objects could be mapped to stored objects. However, we chose to have an architecture which is independent of the storage back-end. Thus, the basic requirement for a storage back-end is the capability to store and retrieve binary chunks. The content of those chunks is managed by the SGML database layer. Additional services of the storage back-end are of course exploited (robustness - commit/rollback, concurrency, security). Additionally, this allows the SGML database to run on top of flat files, as well as with a RDBMS (Relational Database Management System) such as Oracle.

Real-World Applications

Descriptions are given in this section of real-world applications of such a transactional interface to SGML repositories.

Manipulating Embedded Structured Data

Consider a document that contains text and highly structured numerical data. The classic check-out/check-in paradigm works well for a user who wants to change the text or the numerical data in a stand-alone authoring environment. This authoring environment could even be a specialized tool should the numbers have a structure that is too complex for display using an SGML editor.

However, if an application program had to manipulate these numbers, it would probably benefit from having them stored in a traditional structured database with flexible and precise access techniques (to perform computations and advanced validations, for instance).

Another situation where the check-out/check-in approach is not pratical is when the changes to the content are specified under the form of 'change requests' which are not immediately applied. Typical change requests are 'udate that number to this new value', or 'delete this section'. Using the transactional approach, the change request can be defined without the need to do a check-out first. Once defined, change requests can be applied later, in any order.

The approach presented in this paper provides the best of both worlds: a check-out/check-in mechanism with versioning, well suited to an editorial approach, with the flexibility that is expected for structured data manipulation.

Consider a customs tariff regulation, a legal document defining the rates applicable for importation of various goods. It includes textual parts (comments) and highly structured numerical values associated with short text labels (the rates). A traditional approach would lead the systems architect to store the structured data in a relational database and textual parts in a document storage system, with the need to have a complex synchronization mechanism between the two separate databases.

Based on the SGML database, the system can store the rates and textual comments in a single SGML instance, where synchronization between the two 'kinds' of data is ensured at all times. Features of the system include the following.

A 'traditional' document editor to manipulate the textual parts and perform proof-reading operations on the rates. A check-out/check-in approach is used for those parts, where the end-user selects the parts on which he wants to work by browsing the table of contents of the document.
A specialized tariff editor is provided to manipulate the rates, providing highly specialized features for manipulating the structure (split, merge, transpose, ...). This editor directly generates transactions submitted to the repository, in the same way as a traditional database application would do with an SQL database.
The rates data is immediately available to other applications which can access the SGML database to obtain the value of single rates.

Benefits

In short, it can be said that the repository provides a unique storage and access medium for the all the data, with capabilities of both:

a document management system for document-oriented work on textual content by end-users;
a database management system for the handling of numerical data by automated subsystems and end-user applications with a specialized user interface.

Having a common storage system for both numerical and textual data ensures the coherency of the data at each step of the production process.

Replicating Changes in a Multilingual Environment

In a multilingual environment, where documents are updated frequently, translators spend most of their time finding the changes authors made to the master language version. Once they have found the relevant changes, it appears that most changes are language-independent: numbers have been modified, parts suppressed, and so on. In language-dependent modifications, an important part is structure (eg chapters, tables). Finally, the actual text must be translated.

In practice, many changes authors make to the master language version of the document can be applied automatically to other linguistic versions. Here are some examples of such changes:

deleting a chapter, graphic, or table;
modifying numbers in a table;
when inserting a new chapter or table, a skeleton can be generated in other languages.

The basic principle is simple, using transactions on the SGML database. A specialized difference analyser compares the modified document submitted by the user to the original version in the repository. The resulting transaction is split into a language-independent and a language-dependent part. Both are applied on the master language version while only the language- independent part is applied to the other linguistic versions. The translators only have to complete or update truly language-dependent content.

Benefits

The possibility of defining updates to the repository as transactions (as opposed to check-in of fragments) gives rise to a very efficient solution. Once a 'master' transaction has been computed and split into its language-dependent and language-independent parts, the language-independent part can be applied on any number of 'slave' languages. Moreover, this split enables the translators to be shown a 'content only' view of the changes made by the authors, where all the language-independent changes have been filtered out.

Conclusion

Considering the DTD as the Data Definition Language (DDL) of an SGML repository, an approach to provide an equally sophisticated Data Manipulation Language (DML) has been shown. An SGML/XML repository working along the principles highlighted in this paper can be considered more like a true database management system than a document storage and retrieval system. This mechanism can be used to build complex applications that manipulate structured information stored in the repository, as well as document-oriented systems based on a check-out/check-in interface.

A repository providing such fine-grained data manipulation primitives is a key towards the creation of sophisticated corporate information systems where the data stored in document is treated on a par with data stored in traditional databases.

Reference

[Goldfarb 90] Charles F Goldfarb; The SGML Handbook, Clarendon Press, 1990

Please e-mail your comments to Stéphane Bidoul at sbi@sgmltech.com.

This paper was first published in the Conference Proceedings of Markup Technologies '98 US, November 1998, pp 101-107.