[This local archive copy mirrored from the canonical site: http://www.drmacro.com/papers/sgmlstoragemodel.html; links may not have complete integrity, so use the canonical document at this URL if at all possible.]
Author: W. Eliot Kimber
Last updated: 30 August 1998
Copyright © W. Eliot Kimber, 1998. This document may be copied or quoted as long as the copyright noticed is maintained or quotes are propertly attributed.
This is an explanation of the SGML Storage Model. For additional information, see http://www.isogen.com/papers/sgmldocm, a paper on SGML document management that covers many of these same issues. The paper at http://www.isogen.com/papers/sgmldocm/index.html may also be interesting. This paper does not address the issue of representing or addressing the logical structures derivable from the content of storage objects.
SGML was designed from the start to be a platform-independent data representation format. Two base requirements on the design of SGML are the ability to use multiple storage objects to represent a single logical document and the ability to refer to non-SGML storage objects from SGML documents. Because SGML needs to be platform independent, the design cannot make any assumptions about how storage is managed. It also has to enable the easy movement of collections of storage objects from system to system.
To meet these requirements SGML provides an abstract storage model ("entities") that is then mapped to real storage by the local processing system ("entity manager").
CONFUSION ALERT!!!! The term "entity" as used in the SGML standard is not the same as "entity" as used in the STEP standard. In SGML an "entity" is an abstract storage object. The STEP term "entity" is more analogous to the SGML term "element". In this document, the unqualified term "entity" is used with its SGML meaning of "abstract storage object". |
This approach leads to several facilities that are fundamental to the use of SGML and that serve to give SGML a particularly (if not uniquely) robust storage model:
In SGML (and HyTime) all addressing starts with references to the storage objects that contain the things to be addressed (the reference may be implicit, however, as in the case of SGML ID references). For example, to address an SGML element, you must first address the document entity that contains that element. To address a part in a STEP repository, you must first address the storage object that is the repository.
The SGML standard (augmented by ISO/IEC 10744:1997 Annex A.6 Formal System Identifiers) defines the following terms related to storage objects (these definitions are paraphrases, see the standard for the formal definitions):
NOTE:The SGML standard conflates internal string macros ("internal text entities") with abstract storage objects ("external entities"). Internal text entities are not relevant to this discussion--their existence is ignored in the following.
One of the key aspects of entities is their indirection. One of SGML's basic principles is that everything you need to know about a document in order to parse it (and, to a lesser degree, to process it) should be in the document prolog. Thus, references to storage objects from the document's content are indirected through an entity declaration, which binds a local storage object name to an external identifier. The external identifier for the entity then addresses the storage object or objects that contain the entity's data (the "entity body"). The external identifier may be a direct pointer to the storage objects (a "system identifier") or an indirect pointer (a "public identifier").
This approach has several advantages:
In addition, SGML requires support for the type of name indirection necessary to support URN-type schemes, that is, schemes for naming storage objects using some form of universal name. All useful and conforming SGML systems (but not necessarily XML systems) must provide some form of public-ID-to-system-ID mapping facility, which most do through the use of SGML Open (Oasis) entity mapping catalogs (a well-established industry standard).
As indicated in the foregoing sections, all SGML systems consist, in one way or another, of the same layers. At the bottom are the physical storage managers, the systems that actually manage data on storage media (file systems, database, etc.). Above the storage managers is the entity manager layer. Above the entity manager is the SGML parser and any processing applications. Processing applications talk to the SGML parser to get parsed SGML documents and to the entity manager directly to get data entities:
Processing Application (Does work with parsed SGML data and other data) ! A ! A ! ! V ! ! ! SGML Parser (Provides access to parsed SGML documents) ! ! ! A V ! V ! Entity Manager (Provides access to entities) ! A V ! Storage Managers (Provide access to "raw" data) |
Figure 1. Layers in an SGML System |
The communication flow is as follows:
Note that all data access goes through the entity manager, which provides a protective layer between the processing of the data and details of its storage. Of course, this layer may be very thin if it is just a passthrough to the file system or some well-known database or repository system. It is up to document authors to decide how indirect their external identifiers are (that is, do they use public identifiers or system identifiers).
One key aspect of SGML's abstract storage model is that it is not closed over SGML data: you can use it to model any kind of data regardless of how it is stored. The entity concept provides a neutral abstraction for storage objects and the storage manager concept provides a neutral abstraction for data management systems. The entity declarations used in SGML documents provide a generic syntactic connection between references to the abstractions and references to the realities that underlie them.
In short, any repository of any sort can be a storage manager. SGML simply requires that you formally declare it before you invoke it. The declaration provides just enough information that processing systems can have some hope of mapping storage identifiers to storage managers and human observers have some hope of understanding how the data is supposed to work.
In essence, a storage manager is nothing more than another entity that itself contains entities. Thus, any system of storage managers (repositories) and storage objects can be modeled by creating an SGML document that declares the storage managers and declares as entities the storage objects they contain. Such a document can be termed a "god" document because it is the one document that describes all the other entities in the system. In theory, the universe could be represented by a document that declares all the objects in the universe as entities--thus "god" document.
Another way to think about this is that the only way to address storage objects in SGML is in the context of their declaration as entities in some document. Thus there must exist an unnamed (and unnamable in an SGML-only context) document that serves as the ultimate context for referring to all other entities.
In the context of a single repository, the repository itself is the god document: it knows about all the storage objects inside itself and therefore represents a closed universe of entities. The repository is not itself namable from within itself nor does it need to be.
However, as soon as you have two repositories, there must be a larger context in which both repositories exist as named objects so that references can be made between them. Thus, there must be a repository that contains both repositories as storage objects. This repository can be an SGML document that declares both repositories as entities, thus establishing a common name space in which both repositories can be referenced by name.
For any set of repositories, it is always possible to synthesize an SGML document that declares them all as entities. This document can itself be declared as an entity by another document to form an even larger name space and so-on ad-infinitum.
Thus, SGML provides a simple, standardized mechanism for creating hierarchies of repositories (entity name spaces) without limit. By the same token, SGML provides a standardized mechanism for addressing storage objects of any type (and by application of a related mechanism, any identifiable logical component within those storage objects). This mechanism can be used as part of the task of integrating SGML content with non-SGML content or simply as a way of representing and interchanging the organization of repositories.
The SGML approach need not be limited to SGML documents (although SGML documents provide a well-standardized and supported representation syntax). Any data representation can get the same benefits by providing the same levels of abstraction and indirection mechanisms for pointing to storage objects. There is no magic to the SGML syntax beyond its standardization.
The following is a sample god document that binds together three repositories: a STEP repository, a relational database, and an SGML document repository. It uses the syntax of formal system identifiers, documented in <http://www.ornl.gov/sgml/wg8/document/n1920/html/clause-A.6.html>. Note that it is sufficient to provide the entity declarations in order to define the storage object relationships--there is no need to also create instance elements that reference the entities. This interpretation is formalized by the "bounded object set" facility of the HyTime Architecture (see <http://www.ornl.gov/sgml/wg8/document/n1920/html/clause-6.2.html>). This god document is then used by a second document in the service of creating a hyperlink between objects in different repositories.
<!DOCTYPE god-document [ <!-- Declare list of storage manager notations defined in this doc: --> <?IS10744 FSIDR COM-UUID generic-STEP-repository STEP-part-21 odbc-db texcel-information-manager > <!--================================================= Storage manager notations: =================================================--> <!NOTATION COM-UUID PUBLIC "Microsoft COM UUID object name space" > <!ATTLIST #NOTATION COM-UUID fsidr -- Define mapping to base storage manager forms in HyTime std. -- NAME #FIXED "localsm" -- Local storage manager -- > <!NOTATION generic-STEP-repository PUBLIC "ISO 10303//NOTATION Repository of STEP Entities//EN" > <!ATTLIST #NOTATION generic-STEP-repository in -- Container entity that contains the storage object -- NAME #REQUIRED fsidr -- Define mapping to base storage manager forms in HyTime std. -- NAME #FIXED "contnrsm" -- Container storage manager -- > <!NOTATION STEP-part-21 PUBLIC "ISO 10303-21//NOTATION Character Representation of STEP Repositories//EN" > <!ATTLIST #NOTATION STEP-part-21 in -- Container entity that contains the storage object -- NAME #REQUIRED fsidr -- Define mapping to base storage manager forms in HyTime std. -- NAME #FIXED "contnrsm" -- Container storage manager -- > <!NOTATION odbc-db PUBLIC "-//ODBC Owner//NOTATION Object Database B?? C???//EN" > <!ATTLIST #NOTATION odbc-db in -- Container entity that contains the storage object -- NAME #REQUIRED fsidr -- Define mapping to base storage manager forms in HyTime std. -- NAME #FIXED "contnrsm" -- Container storage manager -- > <!NOTATION texcel-information-manager PUBLIC "+//IDN texcel.com//NOTATION Texcel Information Manager//EN" > <!ATTLIST #NOTATION texcel-information-manager in -- Container entity that contains the storage object -- NAME #REQUIRED fsidr -- Define mapping to base storage manager forms in HyTime std. -- NAME #FIXED "contnrsm" -- Container storage manager -- > <!--================================================== Storage object notations: ==================================================--> <!NOTATION SGML PUBLIC "ISO 8879:1986//NOTATION Standard Generalized Markup Language//EN" > <!NOTATION STEP-entity PUBLIC "ISO 10303//NOTATION STEP Entity//EN" > <!NOTATION customer.record PUBLIC "My private structure for customer records in Oracle" > <!--===================================================== Now declare some repository instances: =====================================================--> <!ENTITY car-parts -- STEP repository, named using its COM unique ID -- SYSTEM "<COM-UUID>{00000306-0000-0000-C000-000000000046}" NDATA generic-STEP-repository > <!ENTITY plane-parts -- Part 21 file -- SYSTEM "<osfile>/data/STEP/repositories/aircraft/parts/engine.p21" NDATA STEP-part-21 > <!ENTITY boat-parts -- Part 21 file -- SYSTEM "<osfile>/data/STEP/repositories/boats/parts/hull.p21" NDATA STEP-part-21 > <!ENTITY cust-database -- Relational database of customers -- SYSTEM "cust-database" -- Value is a database name as defined by ODBC -- NDATA odbc-db > <!ENTITY im-one -- Texcel IM repository, named using Windows filename -- SYSTEM "<osfile>c:/texcel/repositories/im-one.txc" NDATA texcel-information-manager > <!--=================================================== Now declare some entities in those repositories: ===================================================--> <!ENTITY sump-pump.DC-3245-0-Z -- STEP representation of a sump pump part no. DC-3245-0-Z -- SYSTEM "<generic-STEP-repository in='car-parts'>OID-87420890001" NDATA STEP-entity -- System ID is an object ID in the repository. -- > <!ENTITY oil-pump.BRR-34G789-10-23 -- STEP representation of an oil pump part no. BRR-34G789-10-23 -- SYSTEM "<STEP-part-21 in='plane-parts'>#361" NDATA STEP-entity -- System ID is an ID in the Part 21 file. -- > <!ENTITY pwc.hull-2345-C -- STEP representation of a sump pump part no. DC-3245-0-Z -- SYSTEM "<STEP-part-21 in='boat-parts'>#23" NDATA STEP-entity -- System ID is an ID in the Part 21 file. -- > <!ENTITY three-initial-corp -- Customer database entry for TIC -- SYSTEM "<odbc-db in='cust-database'>(select where CUSTID is 'tic-0000001')" NDATA customer.record > <!ENTITY maint-and-repair -- SYSTEM "<texcel-information-manager in='im-one'>RID 10000004556" NDATA SGML > <!ELEMENT god-document - O EMPTY -- No need for content -- > ]> <god-document>
This may seem like a lot of declaration, but in practice the storage manager and data notations would be relatively static and shared across a large number of documents. In addition, you would normally have a large number of storage object entities relative to the number of storage managers and data notations, which this example does not show.
The first set of notations declares a set of storage manager notations. The external identifiers for the notations should resolve to the authoritative specifications for those storage manager types. The storage manager declarations also serve to declare the attributes used to configure and describe the storage manager instances.
For each storage manager, the attribute "fsidr" indicates the base type of storage manager it is. These types are defined in the FSI annex of the HyTime standard. The two types used here are "localsm" (local storage manager) and "contnrsm" (container storage manager).
Container storage managers require the declaration of an attribute nominally named "in". The "in" attribute is used in the declaration of entities contained by the container storage managers to indicate which storage manager instance they are in (more about which below).
The second set of notations declares the data content notations to be used for entities. There is no necessary relation between the storage manager notations and the data entity notations. In this example, there are two different storage managers for STEP entities, just to make the point. I could also have several different repositories for SGML documents, for example.
The third set of declarations is the declaration of the actual container storage manager entities (repository instances). Note that these are entity declarations, not notation declarations. From the SGML point of view, a repository is just another storage object, one that happens to contain other storage objects. Note that the notations of the storage manager entities are the storage manager notations declared above them. For example, there are two different Part 21 repositories, one for plane parts and one for boat parts (representing repositories provided by two different suppliers). Note also that the system IDs for the storage managers are direct references to storage objects, either in the local file system (the "osfile" storage manager, defined in Annex A.6), in the COM UUID name space, or in the ODBC name space. You could, of course, have one container storage manager in another.
The last set of declarations are the data entities themselves. Each data entity uses a formal system identifier to name the type of storage manager that contains it, the storage manager instance it's in, and the storage-manager-specific identifier for the storage object. As should be obvious, storage managers are named using start-tag syntax within the system identifier literal. The storage manager name (the "GI" of the tag) must be the name of a declared storage manager notation. The "in" attribute names a declared storage manager entity whose notation is the same as the storage manager name.
Thus, from the entity declaration you can get the containing storage manager from the value of the "in" attribute and from that get the name of the storage manager type. An entity manager must provide some way to correlate storage manager notation external IDs (system IDs or public identifiers) to code. For example, you might have registered a Part 21 parser for the STEP-part-21 storage manager notation. To get a particular entity from the part 21 file, the entity manager would presumably give the Part 21 parser the name of the particular part 21 file and wait to get the data back (however that might be).
It should be clear from this example that any system of related repositories and data objects can be represented using these techniques. Everything in this example is standardized except for the storage-manager-specific system identifiers, which are, of course, non-standard. However, they may be well defined and well understood (e.g., Microsoft COM, ODBC, etc.).
Finally, note that the god document shown above can itself be used as a central repository of addressable objects from SGML documents using normal HyTime-defined indirect addressing, avoiding the need to duplicate these declarations in multiple documents. For example, to create a document that links some part of the maintenance manual to one of the parts, you could do this:
<!DOCTYPE Proc-to-Part-Link [ <!-- Declare use of the HyTime architecture: --> <?IS10744 arch name="hytime" public-id="ISO/IEC 10744:1997//NOTATION Hypermedia/Time-based Structuring Language (HyTime)//EN" dtd-public-id="ISO/IEC 10744:1997//DTD AFDR Meta-DTD Hypermedia/Time-based Structuring Language (HyTime)//EN" form-att="HyTime" renamer-att="HyNames" doc-elem-form="HyDoc" options="hylink nmsploc" > <!-- Declare god document we'll use for pointing to entities indirectly: --> <!ENTITY god.doc SYSTEM "<osfile>/data/repositories/god.docs/god.doc.sgm" SUBDOC -- Avoid need to declare SGML notation. SUBDOC doesn't really mean anything special. -- > <!-- Element type declarations omitted. See HyTime architecture for nmsploc and hylink element types. --> ]> <Proc-to-Part-Link procedure="pump-install-task" part="oil-pump.BRR-34G789-10-23" HyTime="hylink" > <desc>Relates the installation procedure to the part being installed</desc> <!-- Indirection to maintenance and repair manual document: --> <nmsploc id="maint.manual" namespc="entities" locsrc="god.doc">maint-and-repair</nmsploc> <!-- Indirection to pump installation task within maintenance manual: --> <nmsploc id="pump-install-task" namespc="elements" locsrc="main.manual">proc.000001</nmsploc> <!-- Indirection to pump entity: --> <nmsploc id="oil-pump.BRR-34G789-10-23" namespc="entities" locsrc="god.doc">oil-pump.BRR-34G789-10-23</nmsploc> </Proc-to-Part-Link>
Notice that for this document that represents a single hyperlink, all we had to do was declare the god document as an entity and set up the indirect addresses to get to the entities declared in the god document. This serves to further centralize the details of data storage so that only the god document need change when the data moves from location to location. For example, if the plane parts, which are currently in a Part 21 file, were imported into a STEP repository, only the declaration for the oil pump entity in the god document would change--the hyperlink document above would be unaffected.
SGML abstract storage model and entity declaration syntax coupled with the Formal System Identifier facility of ISO/IEC 10744:1997 provides a robust mechanism for representing systems of repositories and storage objects, regardless of their data types.