[Mirrored from WWW.SGML.COM WWW server: http://www.sgml.com/tag/9030101.htm]
One of the most misunderstood issues the SGML implementor faces is the need for a database management tool. Commercial database management systems are relatively pricey, and they are diffcult to evaluate unless you have one to play with for a while. The problem is, vendors of these tools don't want to tie up their valuable products and staVs while you learn about these tools.
Information Architects has developed The World's Cheapest SGML Database Management System (twcsbdms). The tool, available for free, is designed for the SGML implementor to learn about the concepts of storing hierarchical information into a database.
This article brieXy covers the need for SGML-smart database systems, some technical details about twcsbdms, and screen shots of the software itself.
There comes a time in the learning process of the SGML implementor when he or she will discover that the things we have been calling "books" and "documents" actually contain information that has value beyond the actual bound object.
The discovery sometimes happens when the new implementor goes through the formal process of "document analysis" and finds that there are intellectual objects that contain other intellectual objects, and that they have some kind of relationship. This relationship can be indicated in a book by using two-dimensional design concepts like varying the type size and placing rules and other graphical elements around the intellectual objects.
The discovery that a book is no longer just a book is usually followed by a complete sense of panic, because what was once possible to comprehend suddenly becomes much larger and harder to understand.
Now we have these things called chapters, and citations, and part numbers that have intelligence beyond their mere presence in a document. This leads the implementor to the conclusion that we need to consider another medium to store this information, since a book is a pretty static repository. What is needed is some kind of database to store these information objects electronically so they can be accessed by computer procedures.
The implementor first tries to break the book into objects and save them as files in the file system. This appears to work for the larger pieces of information, such as books and chapters, but doesn't allow much access to reusable objects that are stored deeper in the structure. There is also a problem when putting the pieces back together to create a publication. Strike one. The implementor needs some kind of manager to look after the data.
The next place to look is to other areas where the problem of electronic storage has solved a problem. The accounting department uses relational databases to store their information. Relational database managers are relatively cheap, and seem to be doing a good job for accounting. The implementor borrows a license and starts to store his new information objects into the relational database. It doesn't take long, however, to learn that a relational database is not necessarily a good place to hold information that is of an inherently hierarchical nature. Strike two.
The implementor gets on the phone and calls some new friend from the local SGML users' group. The friend tells him that he needs to get an object-oriented database that is designed to deal with SGML structures intelligently. He asks where he can find one of these things to play with and learn. His friend tells him that these databases are very new, and that he would need to get in line to buy one, let alone test drive it. Strike three.
I am constantly asked about data repositories for SGML data. Even for people who already are familiar with database management tools, the concept of hierarchical storage of intelligent text objects is a foreign concept. Dealing with these collections of information is very diVerent from simply mapping a relation between a customer number and a table containing accounting information. What is needed is a tool that can track peer-to-peer relationships, but can also manage relationships that span up, down, and across the family tree. At the same time as it is doing all of this, the database repository needs to understand information that might not be directly related to the data. Information such as the name and employee number of the author, what his or her writing experience is, when each object was created, archived, and last accessed, and other information that can be used to manage the publishing effort.
Each SGML database manufacturer takes a different approach to the problem. There are some basic commonalties, however, between the different products. From these common traits we have built The World's Cheapest SGML Database Management System.
Consider a simple document structure:
book title chapter title section title paragraph paragraph section title paragraph chapter title paragraph paragraph
To put this in SGML terms, the dtd looks like this:
<!ELEMENT book - - (title, chapter+) > <!ELEMENT title - O (#PCDATA) > <!ELEMENT chapter - O (title, (paragraph | section)+) > <!ELEMENT paragraph - O (#PCDATA) > <!ELEMENT section - O (title, paragraph+) > <!ATTLIST (book | title | chapter | paragraph | section) id ID #IMPLIED >
It is not diffcult to see that, by thinking of each of these elements as a potential intelligent object, each can be authored and published independently from the others. The problem is to store these objects in such a way that they can be extracted with all of their descendants intact, and that their position is remembered so their ancestors can make use of them.
Twcsbdms takes the approach of isolating every single object and tracking it and its relationships to the other objects in the database. It does this by using an SGML parser provided by Infrastructures for Information SGML Application Server, to break the incoming document object into pieces and storing them using a simple relational database manager. Every object is given a unique identifier, and is tracked for its lifetime by this number.
In the common case where an element contains another element, some indication of the nested object's unique identifier is placed with the parent object so that all objects can be rebuilt when necessary.
When an object is accessed, all objects that are contained within (its descendants) are rebuilt and displayed for the user.
The SGML parser that is included in the product is a tool developed by Infrastructures for Information. It takes the form of a dynamic link library (dll) that we have integrated by using calls from our database front-end. Twcsbdms uses a Microsoft Access-compatible database to store information objects.
On import, every object is extracted and given a unique identifier. This identifier is a serial number that is incremented for each object. In the case where an object contains another object, the contained object is also extracted and given a unique identifier, then a placeholder replaces its namesake in the container object's relational database record.
After each object has been parsed and disassembled, the database manager creates a record in the "element" relational table for the element and its contents, and records in the "attribute" relational table that contains all of the attribute information. Then, the contents of each element are placed in the field.
There is a rules file that allows the user to indicate which elements are not to be processed. The rules file takes what we call "partially qualified generic identifiers" (pqgi).
section:para:list:item *:xref *:table:row:cell
If the system finds an element that is in the list, it will not create a record for that element, but will place the element in the content field of the database. This increases the performance of the system, since a record need not be created for all elements, but eliminates the ability for these elements to be extracted by themselves. The pqgi is a form of the fully qualified generic identifier that indicates the hierarchy of a particular element all the way to the document type. The pqgi uses a wildcard character, "*" to indicate that any number of elements might appear at this point. For example, in the above example, an xref element will not be processed no matter where it is in the document hierarchy, and cells will not be processed only if they appear in a row inside of a table.
Figure 1 shows the main hierarchical view of the database. The "plus" sign to the left of the element name indicates that there are sub-elements or data content. A "minus" sign indicates that all children of that element are already shown. Clicking on either expands or collapses the structure for that element and displays the other sign. The little document icon indicates data content, which is displayed in the hierarchy window.
Extracting data can be done at any level in the hierarchy. Double-clicking the mouse on any element name will extract that element, including all data and structural content. For example, clicking on the section tag in the following view: (figure 2) will extract just that section.
The system uses the Windows notepad application to display extracted data: (figure 3).
Of course, a real document database system would allow you to extract data, make changes, and put it back into the database. This is significantly more complex, and outside the scope of this project.
The World's Cheapest SGML Database Management System is one of the first SGML applications to support the SGML Open public entity catalog standard. The system allows the user to point to a catalog file that will be used to resolve public identifiers into system identifiers. The SGML Open catalog file contains the keyword public, followed by the public identifier and its system identifier equivalent:
PUBLIC "-//TAG//DTD TAG Article DTD//EN" "c:\user\tag\dtd\article.dtd" PUBLIC "-//SIG//DTD Brian and Dale's Excellent DTD//EN" "http://www.SGML.com/SGMLig/dtd/excellen.dtd" PUBLIC "-//IAI//ELEMENTS Harmony Section Elements//EN" "\\roark\user_drive\iai\harmony\section.dtd"
By storing element content in database fields the size of each element's content is limited to 32k because of the database manager we are using. Storing the data like this is how the system emulates the hierarchical nature of an object-oriented database management system.
There is a limit to the number of element and content objects that can be displayed in the hierarchical list window. The limit is approximately 1,100 objects.
Loading performance is slow because of the relational database paradigm we are using. A fully-loaded database could take a couple of minutes to load and display.
We have no plans to turn this product into a production database. However, we do plan to expand the capabilities to further enhance its educational mission. Two main areas that we will explore concern the integration of the database into an editing environment, and integration into a delivery environment.
In order to integrate the product into an editing environment, some sophisticated code needs to be written to enable the system to allow it to track the changes between a version that was checked out and the edited version checked in. Another concern is in the case where a single object is shared by two diVerent parents. Suppose such is the case, and the object is changed. There needs to be some provision to keep the old copy, since it is conceivable that all parents will not want their copy of the child changed. Another issue is how to assign unique identifiers to the modified data. What happens if a single paragraph is broken into two paragraphs and edited extensively? These are tricky questions that all SGML database vendors must address.
Integrating into a delivery environment should be a little easier, since it does not involve predicting how changes should be managed. One plan is to integrate the database query capabilities with a hypertext transport protocol (http) server. By making the database accessible by the server, an integrator could write html forms to send queries to the database, which would extract the appropriate information and translate it to html on-the-Xy. This kind of database-web server integration is being done by database manufacturers, already, so the teaching goal of twcsbdms would be served by adding this capability.
Information Architects would like to acknowledge the help and fine products of Michel Vulpe and his staV at Infrastructures for Information. Their product, the SGML Application Server, made this project possible.
Ron Wilhelm wrote the basic product while finishing his studies in Computer Publishing at the Rochester Institute of Technology in Rochester, New York. He completed the development release as an employee of Information Architects.
The SGML University Press is making the product available in its next release of the SGML Power Tools cd-rom. Until then, the product is available here.
Contact: Information Architects
Gary Smith Voice: +1 303-766-1336
fax: +1 303-680-4906
e-mail: info@sgml.com
Contact: Infrastructures for Information
Michel Vulpe Voice: +1 416-920-6489
fax: +1 416-920-6493
Contact: SGML University
e-mail: sgmlu@sgml.com
Bio: Brian Travis is the president of Information Architects, Inc., the Managing Editor of <TAG> The SGML Newsletter, and is a principal on the ANSI X3V1 committee.