The Document Database: Relational, Object Oriented or Hybrid?

Michael J. McNamara
Xyvision Limited
246 Bedford Avenue
Slough
Berkshire
SLI 4RJ
United Kingdom
e-mail: Mike.McNamara@xyvuk.com

Introduction

Not since the LAN, has there been an arrival more anticipated in mainstream corporate computing than that of the object oriented database. Like the fabled "Year of the LAN," the "Year of OO" may be an occasion more marked by its prediction than in its observance. Certainly, evidence of that arrival to date is mixed. On the one hand, the most successful of the object oriented vendors, Object Design, of Burlington, Massachusetts, made it to the top of Inc. Magazine's 1994 list of the fastest growing small companies. On the other, Object Design's cross-town rival, Ontos, was recently forced to reorganize its management team and focus on making what the Aberdeen Group, a Boston based consulting firm, calls Object-Relational Enablers (OREs) - a class of product that lets object databases co-exist with relational databases, not replace them.

Ontos wasn't alone. Other products following the ORE approach appeared last year from dozens of companies. Many of these ventures, like Boston-based Open Environment Corporation's Encompass, focused on adding SQL programming capabilities to object oriented middleware, such as DCE (Open Software Foundation's middleware for distributed client/server systems). Even forthcoming operating systems from Microsoft, Apple, and IBM employ object oriented technology to enable enterprise-wide, distributed computing across systems that can include relational databases.

Meanwhile, the relational database companies themselves are not standing still. Some are offering hybrid products that combine relational and OO features. Progress Software Corporation (Burlington, MA), has added a front end to its database system, allowing it to handle OO-structured data within relational tables. Newcomer Illustra Information Technologies (Oakland, CA), offers a genuine OO database - except that it talks SQL (the standard relational database interface language) even though it stores objects directly rather than as relational tables.

Other database vendors prefer to downplay the OO approach altogether in favor of adding raw performance enhancements to their flagship relational products - features which greatly enhance the products' competitive positions relative not only to other relational database systems, but also to substitute technologies - among them OO database systems. Object Design's largest corporate investor is IBM. Yet, in early 1995, the computer giant will announce DB2 parallel edition - its entry into the parallel Unix market. The product will provide table partitioning and parallel support for the full range of database functions, including sorts, indexing, and utilities. In late 1993, Informix (Menlo Park, CA) announced its own parallized database engine, allowing users to perform concurrent sorts, indexes, and backup-and-restore functions. In future versions, the company plans to support table partitioning, enabling companies to store data on separate disks, based on content, to reduce I/O contention. Even before such performance enhancements are considered, relational databases yield impressive performance.

Users seem to have three choices - they can stick with the established, performance-optimized relational databases, opt for the emerging (but still untested) capabilities of object oriented database systems, or take a middle path by buying a hybrid database or some other variant of the ORE-style solution. For the third option, we prefer to use the word "inclusive" - as the term more aptly suggests the need to combine technologies where appropriate without presupposing what that might mean in a particular application domain.

Document-Based Information Storage

One of the newer arenas in the database battle is document management. This is an interesting development. It wasn't that long ago that documents were things you simply stored in files and retrieved in one piece when you wanted to display or print them. The whole notion of documents as data is relatively new. With the acceptance of sophisticated encoding languages, such as SGML (Standard Generalized Markup Language), it is now possible for high-end publishing and document storage and retrieval systems to break down documents into components (tables, figures, text blocks, etc.) and store these as uniquely addressable items. Those items can later be selectively retrieved, published as a whole document or combined and recombined to produce a variety of products.

The advantages of storing documents as information are compelling. Complex and physically large documents, such as legal research reports, pharmaceutical catalogs, and equipment maintenance manuals are more useful if the information they contain can be accessed randomly as in a database. Material need not always be presented in the same sequence or include the same information. Interested in carburetors? Now all the information on carburetors in your repair manuals can be presented as a single "package." There is no need to make the reader wade through extraneous material about other parts of all your engines. Want the information delivered on CD instead of paper? Now you can publish the same document without page breaks, and even include hyperlinks, sound and live action video. Do many of the engines in your product line share parts in common? Now you can store only one copy of redundant text for all the documents that use it.

The power to mix and match document content requires that document management systems see documents as compilations of objects, such as paragraphs, graphics, chapters, and books. Value-added information can be included in these objects (e.g., document type, data type, links to other objects, etc.) that allows objects to be accessed, retrieved, and processed. For example, objects can be tagged to show which product configuration the information pertains to.

According to an object oriented view, a paper document is merely one instance of how a document's components are rendered. What is much more important is how that information is organized and stored. Which brings us back to the database question. Because high-end document management systems treat the components of documents as objects, should not the document databases that support documents be object oriented? Well, no. At least not yet. The performance advantages of relational databases are simply too overwhelming, as are the investments that user organizations have already made in relational technology.

The Inclusive View

The view of computing that seems to have emerged in the 1990s is quite different from the "us-versus-them" mentality of the late 80s. Back then it was PCs versus minicomputers versus mainframes, token ring versus Ethernet, and UNIX versus everyone. If only from habit, it is difficult to overcome the urge to view technologies as if they were competing companies, and to expect the newest technology will always win. But as technology has become more powerful, the us-versus-them paradigm no longer dominates our lives - for two reasons. First, technology is now too important to be left in the hands of the technologists alone. Where before a company's technical experts might have thrown out a working solution in favor of one perceived as more elegant, today there is substantial economic pressure on those individuals to harvest value from what already exists.

Secondly, technology itself is becoming more inclusive. The whole recent emphasis on client/server computing has stimulated interest in an array of technologies for overcoming barriers - whether created by market competition, geography, or the evolution of technology itself.

Object oriented technology epitomizes the inclusive trend. By its very definition, OO means the encapsulation of external agents within a process. How those agents do what they do isn't the point. What matters is that they provide a consistent output when provided a consistent input. Encapsulation, data hiding, methods, and polymorphism are just some of the techniques defined by the OO paradigm to enable different pieces of software to work together - even if they were written by different people, for different reasons, at different times.

Another technology relied on by client/server computing, and that forms the basis of OO thinking, is the API (application programming interface). APIs are the key ingredient for building multi-tier, plug and play, distributed systems. They are "glue logic" that allows different programs to work together, without their developers necessarily knowing the details of how each other's program is written. Programmers simply know that inter-program messages (parameters passed with procedure calls) must conform to a specified API - that they contain certain kinds of content, formatted in a certain way.

By definition, the OO part of the application doesn't care how the external program does what it does. By the same token, the OO approach does not care how the data is stored. What is important is the interoperability, or communication, between the two. By using API-like approaches, a relational database can be used as effectively, if not more effectively than an OO database. The organization receives the benefits of both the OO and non-OO worlds. Additional performance features are now available to the OO software - without the heavy burden of conversion (including an extensive learning curve). With APIs, the choice between OO and relational technology is not an all or nothing proposition. The power of OO-style programming is that it encourages inclusion of disparate technologies rather than their competition. The benefits of inclusion are especially apparent where documents are involved.

The Perfect Document Manager

Ask the prospective user of a document management system to produce a wish list of the perfect system, and details concerning technology will likely fall near the bottom, if they make it on the list at all. If the user is a manager, chances are that aspects such as "return on investment", "integrates with the current environment", and "data security" are important. If the user works with the software day-to-day, desirable attributes might be "easy to use", "can handle a wide variety of documents of any size", and "high reliability." Although not specifically technical, taken together these points lead toward a technology strategy that is inclusive - one that employs OO document management functionality layered on a conventional relational database - and away from a pure OO approach. Why? The answer is that OO databases provide no extra OO benefits in a document management application (those benefits belong to the management software) but can carry substantial costs and risks. APIs provide an established, well-understood mechanism for bringing non-OO software, like relational database systems, under the OO tent. Frankly, in the case of document management system, an OO database may even be something of a mismatch. That is because OO databases have the extra cost and complexity needed to store procedures (an object's methods), not just data. Since computer-executable procedures are not normally part of documents, that capability is wasted in document management applications.

The main reasons for an inclusive approach, however, are not technical but marketing. An OO database vendor may be the fastest growing company, but the $30 million of revenue these companies generated in 1994 is insignificant compared to the $3 billion generated by Oracle, Sybase, Informix and the other household names of database management. There are good reasons why these companies are household names. Their high revenues indicate an acceptance with customers who in turn have invested not only in the database systems themselves but in an entire supporting infrastructure of corporate licenses, user education, and third-party tie-ins they would be loath to sacrifice to look like a trend setter.

Relational databases are also attractive to customers for other reasons. They are well understood, they provide easy access based on a universal access language, and they allow for highly efficient retrieval of customized reports. In the early 80s, it was this ability to generate highly individualized reports on-the-fly that first attracted end users away from flat databases such as IMS. That original need still exists today. In fact, if OO databases are to flourish in the corporation they would do well to take that particular page from the relational database book. If OO databases have an appeal to the end user (as opposed to the programmer looking for convenient persistent storage of objects) it has yet to be articulated. Most users are shielded from the details of the applications they use. What they are mostly concerned about is the speed, integrity, and security of their databases - features they have confidence that relational databases will provide.

Tables and Blobs

It is one thing to say you use relational databases. It is another to depend on them. An OO application implies that the internal mechanics of data storage are irrelevant to the rest of the application. What matters is that the data that comes back from an address is the same data that went out to that address. Changing databases should be as easy as changing plugs in a wall socket. If relational databases are in vogue today, fine, developers will write APIs for them. But if tomorrow OO databases become the standard, developers can write APIs for them too. Mechanics are, however, important where the performance of the application is concerned. A key question is: if you are going to use a relational database - how will you use it. To answer that question you need to know how this type of database works.

Relational databases offer two types of data containers - tables and blobs. Cells, formed by the intersection of table rows and columns, are appropriate for holding data of limited size and that are likely to be sorted or summarized in some way, such as employee names or sales figures. Blobs (a recent development, partly in response to the increasing popularity of document management and graphic applications) are free-form areas that can contain any kind of data at all - even a video. The key questions for optimizing document data storage in a relational database then are:

Looking at how relational databases work, the answer is almost obvious. Meta data consists of things like document names, lists of objects, author names and so on. This data tends to be of predetermined (and limited) size and it is also used to find and organize the content portion of the object. Meta data therefore fits the physical characteristics as well as the purpose of relational tables. Document content, on the other hand, is open ended - it can be of any size and any type. It is viewed or published and less likely to be used as a pointer or a tag. Therefore, blobs are the more appropriate container for content.

What about putting content outside the database in an external file and using the relational table only as a convenient way to organize and access document data? Several document management vendors prefer this approach (and others offer it as an option). It should be available to those users who prefer having their documents available to them as files. Most of the world is not yet object oriented and many tools require a UNIX or DOS file as input. Furthermore, many users are simply more accustomed to dealing with documents as files than as disembodied components within a database.

There are downsides to external file storage, however. One is that these files are outside the protection of the document management system. They can be altered or removed without the knowledge, security checks or revision control services that these systems offer. Another advantage that may be compromised is the ability to maintain document objects in known configurations or views. One configuration may organize content as an approved document; another might be organized as work in progress for the next revision; a third might be organized by subject category; and so on. Putting all these configurations out on the file system as unique documents may mean having redundant copies of the content these documents share. Use of a simple pointer and file system also makes it more difficult to pull together pieces of different configurations (which now exist as discreet files) in order to create or view a different configuration from one that has already been stored. These types of limitations defeat the purpose of having object oriented document management in the first place.

Regardless of how document management systems store data (as files, blobs or objects), users will not wish to sacrifice flexibility. Some relational databases limit the amount of data that can be stored in a blob. If the content stored is too large, it reduces the ability of the system to mix and match pieces of content such as when the user wants to substitute French captions for English or the steps in a repair task change from one product to another. If the content elements are too small it may over-complicate the task of organizing or publishing the document.

A New Standard

Selecting the right product always means knowing what questions to ask. In the past, the answers to those questions typically involved an either-or type of choice. The promise of OO-technology is that users are no longer locked into that type of thinking. With a hybrid approach, they can have the flexibility of treating documents as collections of objects, and have the performance advantages of relational databases. They don't have to go out on a limb with new technology - yet they can enjoy the benefits of that technology. Hybrid, object-relational enabling, inclusive - whatever you call it, the organization receives the maximum leverage from technology investments.

In the era of the API, object oriented software and client server architectures have created new benchmarks for technology excellence. Reuse, performance, return on investment, encapsulation of legacy resources, modular, and change are the buzzwords. What is key to the user is the opportunity to work with the most advanced features, in a stable environment, with the opportunity to transition smoothly into new technologies. What counts is the underlying architecture of the document management system - and whether it measures up to today's benchmarks