[This local archive copy mirrored from the canonical site: http://www.drmacro.com/papers/sgmlstoragemodel.html; links may not have complete integrity, so use the canonical document at this URL if at all possible.]

The SGML Storage Model

Author: W. Eliot Kimber

Last updated: 30 August 1998

Copyright © W. Eliot Kimber, 1998. This document may be copied or quoted as long as the copyright noticed is maintained or quotes are propertly attributed.

This is an explanation of the SGML Storage Model. For additional information, see http://www.isogen.com/papers/sgmldocm, a paper on SGML document management that covers many of these same issues. The paper at http://www.isogen.com/papers/sgmldocm/index.html may also be interesting. This paper does not address the issue of representing or addressing the logical structures derivable from the content of storage objects.

INTRODUCTION

SGML was designed from the start to be a platform-independent data representation format. Two base requirements on the design of SGML are the ability to use multiple storage objects to represent a single logical document and the ability to refer to non-SGML storage objects from SGML documents. Because SGML needs to be platform independent, the design cannot make any assumptions about how storage is managed. It also has to enable the easy movement of collections of storage objects from system to system.

To meet these requirements SGML provides an abstract storage model ("entities") that is then mapped to real storage by the local processing system ("entity manager").

CONFUSION ALERT!!!!

The term "entity" as used in the SGML standard is not the same as "entity" as used in the STEP standard. In SGML an "entity" is an abstract storage object. The STEP term "entity" is more analogous to the SGML term "element".

In this document, the unqualified term "entity" is used with its SGML meaning of "abstract storage object".

This approach leads to several facilities that are fundamental to the use of SGML and that serve to give SGML a particularly (if not uniquely) robust storage model:

  1. The separation of the logical or abstract model of storage and any real storage manager
  2. At least two levels of possible indirection from a reference to a storage object to its local storage location.
  3. The concept that all storage objects have an associated "data content notation" that names the rules governing the interpretation of the data (roughly, "data type" or MIME type).

In SGML (and HyTime) all addressing starts with references to the storage objects that contain the things to be addressed (the reference may be implicit, however, as in the case of SGML ID references). For example, to address an SGML element, you must first address the document entity that contains that element. To address a part in a STEP repository, you must first address the storage object that is the repository.

TERMINOLOGY

The SGML standard (augmented by ISO/IEC 10744:1997 Annex A.6 Formal System Identifiers) defines the following terms related to storage objects (these definitions are paraphrases, see the standard for the formal definitions):

container storage manager
A storage manager that exits as a storage object in some other storage manager. E.g., repositories, ZIP files, tar archives, JAR files, etc. All repositories are, by definition, container storage managers.
document
An SGML document. An SGML document is a single (logical) storage object. SGML documents are character strings that nominally start with "<!SGML", making them, in theory, self describing. However, the SGML standard allows both the "<!SGML" and "<!DOCTYPE" declarations to be omitted. For documents of this type, it is simply an assertion that a given storage object is in fact an SGML document.
entity
An abstract storage object. Entities are either "text entities", meaning fragments of SGML documents, or "data entities", meaning entities that are standalone objects (either non-SGML data or complete SGML documents). SGML subdocument entities are a special form of data entity with an assumed notation of "SGML".
entity declaration
The part of an SGML document that binds a local entity name to the external identifier of the entity. Within a document, entities (storage objects) are referenced by their local entity name.
entity manager
A necessary component of an SGML system that provides the interface between an SGML parser or processing application and storage managers that store the data for the purpose of translating references to entities in SGML documents into data usable by the parser or processing application. The SGML standard does not define anything about entity managers except that they must exist.
external identifier
The part of an entity declaration that identifies, directly or indirectly, the storage object or objects that contain the entity's data. External identifiers may be public identifiers, system identifiers, or both.
formal public identifier
A public identifier that conforms to ISO 8879:1986 and ISO 9070. Formal public identifiers can be made universally unique by using a "registered owner name" as the first field of the identifier. Informal public identifiers may also be universally unique, but the uniqueification mechanism would not be defined by SGML (e.g., URNs, UUIDs, etc.).
formal system identifier
A system identifier that conforms to ISO/IEC 10744:1997 Annex A.6, Formal System Identifier Definition Requirements (FSIDR). Can include an invocation of a declared storage manager, including parameters in addition to the storage object identifier.
public identifier
A system-independent identifier for a storage object. Public identifiers must be mapped to local storage objects by some non-SGML-defined means, such as SGML Open (Oasis) entity mapping catalogs. Public identifiers that conform to ISO 9070 are "formal public identifiers". Public identifiers are, conceptually, a form of Uniform Resource Locator (URN).
SGML document entity
The entity that contains an SGML document (may be the root of a tree of related entities). An SGML document entity must contain the prolog ("<!SGML" and "<!DOCTYPE" declarations) and the "document instance", that is, the root element of the document. Both the prolog and instance may be organized into multiple subordinate external text entities.
storage manager
A system (normally software) that manages storage objects. Typical storage managers are file systems, databases, and Web servers. In SGML, storage manager types are declared as data content notations. Storage manager instances are declared as data entities.
storage object
A unit of data. In SGML, storage objects are interpreted by entity managers to produce either character streams (for text entities) or byte streams or data pointers (for data entities).
storage object identifier
A storage-manager-specific identifier for a storage object. E.g., a filename, object ID, etc.
system identifier
A system-specific storage object identifier, e.g. a filename, object ID, etc. System identifiers that conform to ISO/IEC 10744:1997, Annex A.6 are "formal system identifiers". They bind a storage manager name to a storage object identifier. The SGML standard (ISO 8879) does not define anything about system identifiers except that they are specified as string literals.

NOTE:The SGML standard conflates internal string macros ("internal text entities") with abstract storage objects ("external entities"). Internal text entities are not relevant to this discussion--their existence is ignored in the following.

ENTITIES AND INDIRECTION

One of the key aspects of entities is their indirection. One of SGML's basic principles is that everything you need to know about a document in order to parse it (and, to a lesser degree, to process it) should be in the document prolog. Thus, references to storage objects from the document's content are indirected through an entity declaration, which binds a local storage object name to an external identifier. The external identifier for the entity then addresses the storage object or objects that contain the entity's data (the "entity body"). The external identifier may be a direct pointer to the storage objects (a "system identifier") or an indirect pointer (a "public identifier").

This approach has several advantages:

  1. The details of storage are separated from the content and concentrated in the document prolog. This makes it possible to change how the data is stored without affecting the document content (the initial references to entities). It also makes it possible to determine all the storage object dependencies for a document without the need to process the entire thing, only the prolog. This can be a significant factor when the document content is quite large relative to the prolog (e.g., a 1Meg+ document). This would almost always be the case for STEP-type repositories exported in SGML/XML form.
  2. A single abstract storage object (entity) can be stored in several storage objects or as a portion of a larger storage object, in addition to being stored as a single storage object. This allows the actual storage to be managed separate from the abstract view of the storage as seen by authors and SGML parsers and processors. Thus, different optimization or packaging techniques can be used for data without affecting the abstract organization of the data. A likely example would be to spread single entities over a number of 2K fields in a relational database to take advantage of the storage optimizations they provide over normal blob storage. Reassembling the 2K chunks into a single virtual data object would be handled by the storage manager and/or entity manager and would be transparent to the SGML parser and processing applications.

In addition, SGML requires support for the type of name indirection necessary to support URN-type schemes, that is, schemes for naming storage objects using some form of universal name. All useful and conforming SGML systems (but not necessarily XML systems) must provide some form of public-ID-to-system-ID mapping facility, which most do through the use of SGML Open (Oasis) entity mapping catalogs (a well-established industry standard).

THE LAYERS IN AN SGML SYSTEM

As indicated in the foregoing sections, all SGML systems consist, in one way or another, of the same layers. At the bottom are the physical storage managers, the systems that actually manage data on storage media (file systems, database, etc.). Above the storage managers is the entity manager layer. Above the entity manager is the SGML parser and any processing applications. Processing applications talk to the SGML parser to get parsed SGML documents and to the entity manager directly to get data entities:

     Processing Application (Does work with parsed SGML data and other data)
         ! A     ! A
         ! !     V !
         ! !  SGML Parser   (Provides access to parsed SGML documents)
         ! !     ! A
         V !     V !
        Entity Manager      (Provides access to entities)
            !    A
            V    !
        Storage Managers    (Provide access to "raw" data)
Figure 1. Layers in an SGML System

The communication flow is as follows:

  1. A processing application decides to process an SGML document entity. It communicates the storage object identifier (e.g., filename) of the document entity to the SGML parser.
  2. The SGML parser gives the storage object identifier to the entity manager and requests the character string stored in the object. (The parser only operates on character strings and so can only ask for character strings.)
  3. The entity manager interprets the storage object identifier into a set of storage manager/storage object pairs. For each pair (there may be only one), it communicates with the storage manager to request the data in the storage object. Assuming it gets some data, it then interprets the data to produce a character string acceptable to the parser (this might include things like normalizing compressed character sets into normalized byte-length strings, normalizing byte order, concatenating multiple data files, etc.).
  4. The parser interprets the character string as an SGML document, communicating the results of parsing to the processing application (either as a stream of events or by building an in-memory representation of the parsed document {a "grove"}). When it finds a reference to an external text entity, it communicates the external identifier specified by the declaration for that entity to the entity manager, which applies step (3) to it.
  5. The processing application interprets the data from the SGML parser. It finds something that it interprets as a reference to a data entity (e.g., an attribute with a value prescription of "ENTITY"). Using the information it got from the parser, it looks up the external identifier of the entity. It communicates the external identifier to the entity manager with the request "give me the data for this entity". The entity manager then returns the data (the details of this communication are a function of the API provided by the entity manager--it is not standardized). The processing application then does whatever it wants with data. If the data is in fact another SGML document, it might perform step (1) on it.

Note that all data access goes through the entity manager, which provides a protective layer between the processing of the data and details of its storage. Of course, this layer may be very thin if it is just a passthrough to the file system or some well-known database or repository system. It is up to document authors to decide how indirect their external identifiers are (that is, do they use public identifiers or system identifiers).

STORAGE MANAGERS, REPOSITORIES AND GOD DOCUMENTS

One key aspect of SGML's abstract storage model is that it is not closed over SGML data: you can use it to model any kind of data regardless of how it is stored. The entity concept provides a neutral abstraction for storage objects and the storage manager concept provides a neutral abstraction for data management systems. The entity declarations used in SGML documents provide a generic syntactic connection between references to the abstractions and references to the realities that underlie them.

In short, any repository of any sort can be a storage manager. SGML simply requires that you formally declare it before you invoke it. The declaration provides just enough information that processing systems can have some hope of mapping storage identifiers to storage managers and human observers have some hope of understanding how the data is supposed to work.

In essence, a storage manager is nothing more than another entity that itself contains entities. Thus, any system of storage managers (repositories) and storage objects can be modeled by creating an SGML document that declares the storage managers and declares as entities the storage objects they contain. Such a document can be termed a "god" document because it is the one document that describes all the other entities in the system. In theory, the universe could be represented by a document that declares all the objects in the universe as entities--thus "god" document.

Another way to think about this is that the only way to address storage objects in SGML is in the context of their declaration as entities in some document. Thus there must exist an unnamed (and unnamable in an SGML-only context) document that serves as the ultimate context for referring to all other entities.

In the context of a single repository, the repository itself is the god document: it knows about all the storage objects inside itself and therefore represents a closed universe of entities. The repository is not itself namable from within itself nor does it need to be.

However, as soon as you have two repositories, there must be a larger context in which both repositories exist as named objects so that references can be made between them. Thus, there must be a repository that contains both repositories as storage objects. This repository can be an SGML document that declares both repositories as entities, thus establishing a common name space in which both repositories can be referenced by name.

For any set of repositories, it is always possible to synthesize an SGML document that declares them all as entities. This document can itself be declared as an entity by another document to form an even larger name space and so-on ad-infinitum.

Thus, SGML provides a simple, standardized mechanism for creating hierarchies of repositories (entity name spaces) without limit. By the same token, SGML provides a standardized mechanism for addressing storage objects of any type (and by application of a related mechanism, any identifiable logical component within those storage objects). This mechanism can be used as part of the task of integrating SGML content with non-SGML content or simply as a way of representing and interchanging the organization of repositories.

APPROACH IS NOT LIMITED TO USE WITHIN SGML DOCUMENTS

The SGML approach need not be limited to SGML documents (although SGML documents provide a well-standardized and supported representation syntax). Any data representation can get the same benefits by providing the same levels of abstraction and indirection mechanisms for pointing to storage objects. There is no magic to the SGML syntax beyond its standardization.

SAMPLE GOD DOCUMENT

The following is a sample god document that binds together three repositories: a STEP repository, a relational database, and an SGML document repository. It uses the syntax of formal system identifiers, documented in <http://www.ornl.gov/sgml/wg8/document/n1920/html/clause-A.6.html>. Note that it is sufficient to provide the entity declarations in order to define the storage object relationships--there is no need to also create instance elements that reference the entities. This interpretation is formalized by the "bounded object set" facility of the HyTime Architecture (see <http://www.ornl.gov/sgml/wg8/document/n1920/html/clause-6.2.html>). This god document is then used by a second document in the service of creating a hyperlink between objects in different repositories.

<!DOCTYPE god-document [
  <!-- Declare list of storage manager notations defined in this doc: -->
  <?IS10744 FSIDR
    COM-UUID
    generic-STEP-repository
    STEP-part-21
    odbc-db
    texcel-information-manager
  >
  <!--=================================================
      Storage manager notations:
      =================================================-->

  <!NOTATION COM-UUID
    PUBLIC "Microsoft COM UUID object name space"
  >
  <!ATTLIST #NOTATION COM-UUID
    fsidr  -- Define mapping to base storage manager forms in HyTime std. --
      NAME
      #FIXED "localsm" -- Local storage manager --
  >

  <!NOTATION generic-STEP-repository
    PUBLIC "ISO 10303//NOTATION Repository of STEP Entities//EN"
  >
  <!ATTLIST #NOTATION generic-STEP-repository
    in       -- Container entity that contains the storage object --
      NAME
      #REQUIRED
    fsidr  -- Define mapping to base storage manager forms in HyTime std. --
      NAME
      #FIXED "contnrsm"  -- Container storage manager --
  >

  <!NOTATION STEP-part-21
    PUBLIC "ISO 10303-21//NOTATION Character Representation of STEP
            Repositories//EN"
  >
  <!ATTLIST #NOTATION STEP-part-21
    in       -- Container entity that contains the storage object --
      NAME
      #REQUIRED
    fsidr  -- Define mapping to base storage manager forms in HyTime std. --
      NAME
      #FIXED "contnrsm" -- Container storage manager --
  >

  <!NOTATION odbc-db
    PUBLIC "-//ODBC Owner//NOTATION Object Database B?? C???//EN"
  >
  <!ATTLIST #NOTATION odbc-db
    in       -- Container entity that contains the storage object --
      NAME
      #REQUIRED
    fsidr  -- Define mapping to base storage manager forms in HyTime std. --
      NAME
      #FIXED "contnrsm" -- Container storage manager --
  >

  <!NOTATION texcel-information-manager
    PUBLIC "+//IDN texcel.com//NOTATION Texcel Information Manager//EN"
  >
  <!ATTLIST #NOTATION texcel-information-manager
    in       -- Container entity that contains the storage object --
      NAME
      #REQUIRED
    fsidr  -- Define mapping to base storage manager forms in HyTime std. --
      NAME
      #FIXED "contnrsm" -- Container storage manager --
  >


  <!--==================================================
      Storage object notations:
      ==================================================-->

  <!NOTATION SGML
    PUBLIC "ISO 8879:1986//NOTATION Standard Generalized Markup Language//EN"
  >
  <!NOTATION STEP-entity
    PUBLIC "ISO 10303//NOTATION STEP Entity//EN"
  >
  <!NOTATION customer.record
    PUBLIC "My private structure for customer records in Oracle"
  >


  <!--=====================================================
      Now declare some repository instances:
      =====================================================-->

  <!ENTITY car-parts -- STEP repository, named using its COM unique ID --
    SYSTEM "<COM-UUID>{00000306-0000-0000-C000-000000000046}"
    NDATA generic-STEP-repository
  >

  <!ENTITY plane-parts -- Part 21 file --
    SYSTEM "<osfile>/data/STEP/repositories/aircraft/parts/engine.p21"
    NDATA STEP-part-21
  >

  <!ENTITY boat-parts -- Part 21 file --
    SYSTEM "<osfile>/data/STEP/repositories/boats/parts/hull.p21"
    NDATA STEP-part-21
  >

  <!ENTITY cust-database -- Relational database of customers --
    SYSTEM "cust-database" -- Value is a database name as defined by ODBC --
    NDATA odbc-db
  >

  <!ENTITY im-one   -- Texcel IM repository, named using Windows filename --
    SYSTEM "<osfile>c:/texcel/repositories/im-one.txc"
    NDATA texcel-information-manager
  >

  <!--===================================================
      Now declare some entities in those repositories:
      ===================================================-->

  <!ENTITY sump-pump.DC-3245-0-Z
    -- STEP representation of a sump pump part no. DC-3245-0-Z --
    SYSTEM "<generic-STEP-repository in='car-parts'>OID-87420890001"
    NDATA STEP-entity
    -- System ID is an object ID in the repository. --
  >

  <!ENTITY oil-pump.BRR-34G789-10-23
    -- STEP representation of an oil pump part no. BRR-34G789-10-23 --
    SYSTEM "<STEP-part-21 in='plane-parts'>#361"
    NDATA STEP-entity
    -- System ID is an ID in the Part 21 file. --
  >

  <!ENTITY pwc.hull-2345-C
    -- STEP representation of a sump pump part no. DC-3245-0-Z --
    SYSTEM "<STEP-part-21 in='boat-parts'>#23"
    NDATA STEP-entity
    -- System ID is an ID in the Part 21 file. --
  >

 <!ENTITY three-initial-corp -- Customer database entry for TIC --
    SYSTEM "<odbc-db in='cust-database'>(select where CUSTID is 'tic-0000001')"
    NDATA customer.record
  >

  <!ENTITY maint-and-repair --
    SYSTEM "<texcel-information-manager in='im-one'>RID 10000004556"
    NDATA SGML
  >
  <!ELEMENT god-document
    - O
    EMPTY -- No need for content --
  >
]>
<god-document>

This may seem like a lot of declaration, but in practice the storage manager and data notations would be relatively static and shared across a large number of documents. In addition, you would normally have a large number of storage object entities relative to the number of storage managers and data notations, which this example does not show.

The first set of notations declares a set of storage manager notations. The external identifiers for the notations should resolve to the authoritative specifications for those storage manager types. The storage manager declarations also serve to declare the attributes used to configure and describe the storage manager instances.

For each storage manager, the attribute "fsidr" indicates the base type of storage manager it is. These types are defined in the FSI annex of the HyTime standard. The two types used here are "localsm" (local storage manager) and "contnrsm" (container storage manager).

Container storage managers require the declaration of an attribute nominally named "in". The "in" attribute is used in the declaration of entities contained by the container storage managers to indicate which storage manager instance they are in (more about which below).

The second set of notations declares the data content notations to be used for entities. There is no necessary relation between the storage manager notations and the data entity notations. In this example, there are two different storage managers for STEP entities, just to make the point. I could also have several different repositories for SGML documents, for example.

The third set of declarations is the declaration of the actual container storage manager entities (repository instances). Note that these are entity declarations, not notation declarations. From the SGML point of view, a repository is just another storage object, one that happens to contain other storage objects. Note that the notations of the storage manager entities are the storage manager notations declared above them. For example, there are two different Part 21 repositories, one for plane parts and one for boat parts (representing repositories provided by two different suppliers). Note also that the system IDs for the storage managers are direct references to storage objects, either in the local file system (the "osfile" storage manager, defined in Annex A.6), in the COM UUID name space, or in the ODBC name space. You could, of course, have one container storage manager in another.

The last set of declarations are the data entities themselves. Each data entity uses a formal system identifier to name the type of storage manager that contains it, the storage manager instance it's in, and the storage-manager-specific identifier for the storage object. As should be obvious, storage managers are named using start-tag syntax within the system identifier literal. The storage manager name (the "GI" of the tag) must be the name of a declared storage manager notation. The "in" attribute names a declared storage manager entity whose notation is the same as the storage manager name.

Thus, from the entity declaration you can get the containing storage manager from the value of the "in" attribute and from that get the name of the storage manager type. An entity manager must provide some way to correlate storage manager notation external IDs (system IDs or public identifiers) to code. For example, you might have registered a Part 21 parser for the STEP-part-21 storage manager notation. To get a particular entity from the part 21 file, the entity manager would presumably give the Part 21 parser the name of the particular part 21 file and wait to get the data back (however that might be).

It should be clear from this example that any system of related repositories and data objects can be represented using these techniques. Everything in this example is standardized except for the storage-manager-specific system identifiers, which are, of course, non-standard. However, they may be well defined and well understood (e.g., Microsoft COM, ODBC, etc.).

Finally, note that the god document shown above can itself be used as a central repository of addressable objects from SGML documents using normal HyTime-defined indirect addressing, avoiding the need to duplicate these declarations in multiple documents. For example, to create a document that links some part of the maintenance manual to one of the parts, you could do this:

<!DOCTYPE Proc-to-Part-Link [
<!-- Declare use of the HyTime architecture: -->
<?IS10744 arch
  name="hytime"
  public-id="ISO/IEC 10744:1997//NOTATION Hypermedia/Time-based Structuring Language (HyTime)//EN"
  dtd-public-id="ISO/IEC 10744:1997//DTD AFDR Meta-DTD Hypermedia/Time-based Structuring Language (HyTime)//EN"
  form-att="HyTime"
  renamer-att="HyNames"
  doc-elem-form="HyDoc"
  options="hylink nmsploc"
>

<!-- Declare god document we'll use for pointing to entities indirectly: -->
<!ENTITY god.doc
  SYSTEM "<osfile>/data/repositories/god.docs/god.doc.sgm"
  SUBDOC -- Avoid need to declare SGML notation. SUBDOC doesn't really
            mean anything special. --
>

<!-- Element type declarations omitted. See HyTime architecture for
     nmsploc and hylink element types. -->
]>
<Proc-to-Part-Link
  procedure="pump-install-task"
  part="oil-pump.BRR-34G789-10-23"
  HyTime="hylink"
>
<desc>Relates the installation procedure to the part being installed</desc>

<!-- Indirection to maintenance and repair manual document: -->
<nmsploc
  id="maint.manual"
  namespc="entities"
  locsrc="god.doc">maint-and-repair</nmsploc>

<!-- Indirection to pump installation task within maintenance manual: -->
<nmsploc
  id="pump-install-task"
  namespc="elements"
  locsrc="main.manual">proc.000001</nmsploc>

<!-- Indirection to pump entity: -->
<nmsploc
  id="oil-pump.BRR-34G789-10-23"
  namespc="entities"
  locsrc="god.doc">oil-pump.BRR-34G789-10-23</nmsploc>
</Proc-to-Part-Link>

Notice that for this document that represents a single hyperlink, all we had to do was declare the god document as an entity and set up the indirect addresses to get to the entities declared in the god document. This serves to further centralize the details of data storage so that only the god document need change when the data moves from location to location. For example, if the plane parts, which are currently in a Part 21 file, were imported into a STEP repository, only the declaration for the oil pump entity in the god document would change--the hyperlink document above would be unaffected.

CONCLUSION

SGML abstract storage model and entity declaration syntax coupled with the Formal System Identifier facility of ISO/IEC 10744:1997 provides a robust mechanism for representing systems of repositories and storage objects, regardless of their data types.