[This local archive copy mirrored from: http://www.stg.brown.edu/webs/tei10/tei10.papers/Simonspaper.html; see the canonical version of the document.]

Text Encoding Initiative
Tenth Anniversary User Conference

Using architectural forms to map TEI data into an object-oriented database

Gary F. Simons
Summer Institute of Linguistics
gary_simons@sil.org

Abstract

This paper develops a solution to the problem of importing existing TEI data into an existing object-oriented database schema without changing the TEI data or the database schema. After investigating the general problem of where the mismatch lies between the SGML model and the object model, the paper proposes a solution based on architectural processing. Two meta-DTDs are used, one to define the architectural forms for the object model and another to map the existing SGML data onto those forms. A full example using a critical text in TEI markup is developed.

Much of the promise of SGML lies in the fact that descriptively marked up data can be interchanged freely and used by multiple applications for analytical processing or publication formatting. Indeed, this is part of the motivation behind the Text Encoding Initiative's Guidelines for Electronic Text Encoding and Interchange [TEI94]. Given the fact that an SGML DTD has much in common with the conceptual model that results from an object-oriented analysis of a problem domain, it is logical to conclude that SGML data should be particularly amenable to being imported into software that uses an object-oriented data model. This is not a trivial task, however, since there are some fundamental differences between the SGML model of data and the object model.

The paper explores this general problem as it develops a solution to a more specific problem, namely, how to import existing SGML data into an existing object-oriented database schema without changing either the SGML data or the database schema. The target system is an object-oriented database system named CELLAR (for Computing Environment for Linguistic, Literary, and Anthropological Research [RST93]). The solution uses architectural processing to map the SGML data onto architectural forms that the CELLAR system can use to construct the appropriate structure of objects.

Section 1 of the paper discusses the basic differences between the SGML model of data and the object model, and illustrates why the mapping from SGML elements to objects is not a trivial one. Section 2 introduces the DTD for an architecture that maps SGML data onto objects. Section 3 gives a complete example of the automated process by which the SGML data are mapped onto this architectural DTD via an intermediate meta-DTD that encodes the mapping. The example used is that of a critical text edition encoded in TEI format. Finally, section 4 discusses the implementation and the results that have been achieved thus far.

The SGML model versus the object model

The problems inherent in importing SGML data into an object database stem from the differences between the SGML model of data and the object model of data. In speaking of the "object model of data," I am referring specifically to the way object databases [Cat97] and conceptual modeling languages [Bor85] represent information. Such systems replace the simple instance variables of an object-oriented programming language with attributes that encapsulate integrity constraints and the semantics of relationships to other objects.

The SGML model in a nutshell

In SGML, the fundamental unit of data representation is the element. Each element must have a generic identifier; it may optionally have a number of attributes or content or both. Each attribute has a name and a value; the value is represented by a string of characters. The content of an element may consist of character data or embedded elements or a combination of both. These generalizations may be expressed in terms of the following declarations:

<!ELEMENT element   - - (attr* & content?)   >
<!ATTLIST element   gi    NAME #REQUIRED     >

<!ELEMENT attr      - O  EMPTY               >
<!ATTLIST attr      name  NAME #REQUIRED
                    value CDATA #IMPLIED     >

<!ELEMENT content   - - (#PCDATA | element)* >

The object model in a nutshell

In the object model, the fundamental unit of data representation is the object. Each object must have a class, and is either a primitive object that stores primitive data like a string or a number, or is a complex object that has attributes. Each attribute has a name and a value; the value consists of embedded objects. These generalizations may be expressed in terms of the following declarations:

<!ELEMENT object    - - (attr)*                     >
<!ATTLIST object    class NAME #REQUIRED            >

<!ELEMENT attr      - - (primitiveObject | object)* >
<!ATTLIST attr      name  NAME #REQUIRED            >

<!ELEMENT primitiveObject    - - (#PCDATA)          >
<!ATTLIST primitiveObject    class NAME #REQUIRED   >

An unsatisfactory default mapping from elements to objects

Element and object are superficially similar: generic identifier corresponds to class, both have attributes, and both occur recursively. They differ fundamentally, however, in the nature of the attributes and the recursion. With elements, the attributes cannot contain embedded structure; the recursion of elements is allowed only within the content of an element. With objects, there is no specialized notion of content; rather, the recursive embedding of further objects takes place within the attributes.

An SGML document following the model of 1.1 can be automatically mapped onto the object model of section 1.2 by making four transformations:

Convert every instance of <element gi=X>...</element> to <object class=X>...</object>.
Convert every instance of <attr name=X value=Y> to <attr name=X><primitiveObject class="String">Y</primitiveObject></attr>.
Convert every instance of <content>...</content> to <attr name="content">...</attr>.
Embed every instance of #PCDATA within the tags <primitiveObject class="String">...</primitiveObject>.

For example, the following sample SGML element contains an instance of each of the four conditions listed above:

<phrase rend="ital">an italic phrase</phrase>

Following the nutshell model of SGML in section 1.1, this corresponds to the following semantic representation:

<element gi="phrase">
   <attr name="rend" value="ital">
   <content>an italic phrase</content>
</element>

This would be converted into the following object representation by the proposed default mapping:

<object class="phrase">
   <attr name="rend">
      <primitiveObject class="String">ital</primitiveObject>
   </attr>
   <attr name="content">
      <primitiveObject class="String">an italic phrase</primitiveObject>
   </attr>
</object>

An example of the kind of mapping we need

The default transformation described in the preceding section can easily be done on any SGML document, but it will seldom yield a result that actually fits the conceptual model of a target object database. Consider, for instance, the following simplistic SGML document:

<document>
   <creationDate>12-Jun-97</creationDate>
   <title>
      <maintitle>The main title</maintitle>
      <subtitle>a subtitle</subtitle>
   </title>
   <authors>
      <author>
         <name>First Author</name>
         <affil>Some Company</affil>
      </author>
      <author>
         <name>Second Author</name>
         <affil>Another Company</affil>
      </author>
   </authors>
   <p>An introductory paragraph</p>
   <div1><!-- The first section --></div1>
   <div1><!-- The second section --></div1>
</document>

The above represents a typical approach to encoding a document in SGML. But compare it to the following which is also typical of how a Document class might be defined in an object database:

class Document has
   creationDate : Date
   title        : TitleStatement
   authors      : sequence of Person
   content      : sequence of Paragraph or Division

The default mapping proposed in section 1.3 would first go wrong by putting all the subelements within the document in a single attribute named content; instead we want to map them into four different attributes. The first three subelements (<creationDate>, <title>, and <authors>) correspond to Document attributes of the same name. The remaining subelements (<p> and two instances of <div1>) correspond to objects that go into the Document attribute named content (which happens not to be explicitly tagged). Though the first three subelements correspond to attributes, they differ significantly in the way they do so. <creationDate> additionally carries the information that the embedded PCDATA content should be mapped onto a basic object of class Date. <title> not only corresponds to the attribute title but also to an object of class TitleStatement (which in turn has attributes maintitle and subtitle). By contrast, <authors> corresponds to the attribute and nothing more; each embedded <author> element corresponds to an object of class Person.

This example illustrates the following fundamental result when comparing the SGML model to the object model: some SGML elements encode an object, some encode an attribute, and still others simultaneously encode both. The basic challenge of importing SGML data into an object database is to determine which of these cases holds for each of the element types occurring in the data, and then to express formally how each maps onto the corresponding classes and attributes of the target database schema.

An architecture for mapping SGML data into objects

The HyTime standard [ISO92] first introduced the concept of architectural forms as a way to associate standardized semantics with elements in user-defined DTDs [DD94]. Now that this notion has been generalized in the SGML Extended Facilities (defined in Annex A of the revised HyTime standard [ISO97]), we can use it to good advantage in solving the problem at hand. Architectural forms provide a mechanism we can use to express the semantics of how SGML elements map onto the object model. See [Cov97] for pointers to other applications of architectural forms.

There are two basic element forms in the architecture, <object> and <attr>. Rather than having a third form for the case when an element corresponds to both an object and an attribute, this case is treated as being a mapping to an object, and the object form adds an architectural attribute to name the attribute it also maps to. A third form, <ignore> is used for the case when the SGML element does not correspond to anything in the target object model so the element content should be processed as though the start and end tags were not there. The definitions of these three forms are given below. (The definition of the architecture is abridged for the sake of this presentation; see [Sim97b] and [Sim97c] for the full definition.)

<!-- CELLAR.DTD (abridged version)
     Meta-DTD of the CELLAR architecture for mapping
     SGML data into CELLAR's object model of data             -->

<!ENTITY % content "object | attr | ignore | #PCDATA"           >

<!--                                                          --
  -- OBJECT: the element corresponds to an object in CELLAR   --
  --                                                          -->
<!ELEMENT object - -  (%content;)*                              >
<!ATTLIST object
     class       -- Create this class of CELLAR object        --
                 CDATA #REQUIRED
     parentAttr  -- Put the object in this attr of its parent --
                 CDATA #IMPLIED  
     contentAttr -- Put embedded objects in this attribute    --
                 CDATA #IMPLIED
     pcdataClass -- Create this class for embedded PCDATA     --
                 CDATA "String"  
     encoding    -- Put embedded strings in this encoding     --
                 CDATA #IMPLIED  
     id          -- A unique identifier for this object       --
                 ID    #IMPLIED
     attrName    -- Set this attribute of the object ...      --
                 CDATA #IMPLIED
     attrValue   -- ... to this value                         --
                 CDATA #IMPLIED
     attrType    -- The value is an IDREF or of named class   --
                 CDATA "String"

<!--                                                          --
  -- ATTR: the element corresponds to an attribute in CELLAR  --
  --                                                          -->
<!ELEMENT attr - -    (%content;)*                              >
<!ATTLIST attr
     contentAttr -- Put embedded objects in this attribute    --
                 CDATA #IMPLIED
     pcdataClass -- Create this class for embedded PCDATA     --
                 CDATA "String"  
     encoding    -- Put embedded strings in this encoding     --
                 CDATA #IMPLIED                                 >

<!--                                                          --
  -- IGNORE: the element corresponds to nothing in CELLAR;    --
  --         ignore it at this level, but process its content --
  --                                                          -->
<!ELEMENT ignore  - - (%content;)*                              >

The easiest way to explain these forms is by example. In the illustrative document in section 1.4, the <document> element corresponds to an object of class Document; the element content (unless an embedded element names a specific target attribute) goes into the content attribute of the object. The <document> element would be augmented as follows to indicate its mapping into the object model:

<document cellar=object class="Document" contentAttr="content">

This says that in the architecture named cellar, this <document> element corresponds to an <object> element whose class is "Document" and whose contentAttr is "content".

The <creationDate> element corresponds to an attribute. Its content goes into the creationDate attribute, and the embedded PCDATA needs to be converted into Date objects. Thus,

<creationDate cellar=attr contentAttr="creationDate" pcdataClass="Date">

The <title> element corresponds to a TitleStatement object, but it also corresponds to an attribute in that it maps into the title attribute of its parent object (that is, the Document). Thus,

<title cellar=object class="TitleStatement" parentAttr="title">

Finally, the <authors> element corresponds to the authors attribute; thus,

<authors cellar=attr contentAttr="authors">

From TEI data file to object database: a complete example

As stated in the introduction, the goal of this work is to import existing SGML data into an existing object-oriented database schema without changing the SGML data or the database schema. This section demonstrates a full example of the process. The SGML data file is a critical edition in TEI markup of a passage from the Second Epistle of Clement. A fuller treatment of this sample text along with examples of what can be done with it in the CELLAR environment is given in [Sim97a].

The input file and its DTD

The file for the critical text is as follows. Note that a significant portion of the content has been elided in the interest of brevity. The Greek text is encoded in TLG beta code.

<!DOCTYPE TEI.2 SYSTEM "textcrit.dtd"> 
<TEI.2>
<text>
<front>
<docTitle>2 Clement, chapter 7</docTitle>
<witlist>
<wit id=A type=Manuscript>Codex Alexandrinus
<bibl>A Greek uncial of the fifth century.  Housed in the British 
Museum.  Published in:  The Codex Alexandrinus in reduced photographic
 facsimile, with an introduction by F. G. Kenyon, London 1909.
</bibl></wit>
<wit id=C type=Manuscript>Codex Constantinopolitanus
   <bibl> . . . </bibl></wit>
<wit id=S type=Manuscript>Syriac Version
   <bibl> . . . </bibl></wit>
<wit id=L type=Edition>Lightfoot 1890
<bibl>Lightfoot, J. B.  1890.  The Apostolic Fathers: Clement,
Ignatius, Polycarp (2nd edition).  Part One: Clement, volume 2, pages 210-261.
Macmillan.  (Reprinted 1989 by Hendrickson Publishers, Peabody, MA)
</bibl></wit>
<wit id=Lb type=Edition>Loeb edition
   <bibl> . . . </bibl></wit>
<wit id=B type=Edition>Bihlmeyer 1970
<bibl> . . . </bibl></wit>
<wit id=W type=Edition>Wengst 1984
   <bibl> . . . </bibl></wit>
</witlist>
</front>
<body>
<div n=7>
<!-- ***************** Verse 1 ********************* -->
<s n=1>
w(/ste
<app><rdg wit='A L Lb B'>ou)=n</rdg>
   <rdg wit='C S W'><omit></rdg></app>
a)delfoi/
<app><rdg wit='A L Lb B'>mou</rdg>
   <rdg wit='C W'><omit></rdg></app>
a)gwnisw/meqa ei)do/tej, o(/ti e)n xersi\n o(
<app><rdg wit='C S L Lb B W'>a)gw\n</rdg>
   <rdg wit='A'>ai)w/n</rdg></app>
kai\ o(/ti ei)j tou\j fqartou\j a)gw=naj kataple/ousin
polloi/, a)ll' ou) pa/ntej stefanou=ntai,
<app><rdg wit='C L Lb B W'>ei) mh\</rdg>
   <rdg wit='A'>oi( mh/</rdg>
   <rdg wit='S'>ei) mh\ mo/non</rdg></app>
oi( polla\ kopia/santej kai\ kalw=j a)gwnisa/menoi.
</s>
<!-- and so forth for remaining verses  -->
</div>
</body></text>
</TEI.2>

The DTD for this file is the following:

<!-- TextCrit.DTD

     A DTD for encoding a text critical edition.  All tags
     are from the TEI guidelines (Text Encoding Initiative).
     The content models have been simplified to deal only
     with the tags needed for the sample text of II Clement.
     The aim is to faithfully represent the TEI scheme of
     markup without having to deal with the huge TEI DTD.

     This DTD reflects the "Parallel segmentation method"
     of encoding.  See section 19.2.3 of the TEI Guidelines.

     Gary Simons, Summer Institute of Linguistics
     Last revised: 18 october 1997                      -->  

<!ELEMENT TEI.2     - - ( text )              >

<!ELEMENT text      - - ( front, body )       >

<!ELEMENT front     - - ( docTitle, witList ) >

<!ELEMENT docTitle  - - (#PCDATA)             >

<!ELEMENT witList   - - ( wit+ )              >

<!ELEMENT wit       - - ( #PCDATA, bibl? )    >
<!ATTLIST wit       id   ID    #REQUIRED
                    type CDATA #REQUIRED      >

<!ELEMENT bibl      - - (#PCDATA)             >

<!ELEMENT body      - - ( div+ )              >

<!ELEMENT div       - - ( s+ )                >
<!ATTLIST div       n  CDATA  #IMPLIED        >

<!ELEMENT s         - - ( #PCDATA | app )+    >
<!ATTLIST s         n  CDATA  #IMPLIED        >

<!ELEMENT app       - - ( rdg+ )              >

<!ELEMENT rdg       - - ( #PCDATA | omit )    >
<!ATTLIST rdg       wit  IDREFS  #REQUIRED    >

<!ELEMENT omit      - O  EMPTY                >

The target object model

The conceptual model for the objects and attributes into which we want to import the input file is diagrammed below. The notation and the model are explained in [Sim97a]. Here suffice it to say that solid arrows mean "contains" and the dotted arrow means "holds pointers to."

Place here the file Textcrit.gif

The automated process for mapping elements to objects

At the outset, two DTDs are given. For this example they are:

textcrit.dtd: The original DTD for the SGML document to be imported to the object database (see section 3.1); this is called the client DTD in the HyTime standard.
cellar.dtd: The meta-DTD for the CELLAR architecture (see section 2); this is called the architectural DTD

To perform the automatic mapping from the client DTD to the architectural DTD, two additional DTDs must be defined:

my-textcrit.dtd: A substitute for textcrit.dtd which adds an invocation of architectural processing features to the original client DTD (see section 3.3.1 below).
map-textcrit.dtd: The meta-DTD which maps the elements and attributes of the client DTD onto the elements and attributes of the architectural DTD (see section 3.3.2 below).

The process for automatically mapping a client document onto its corresponding architectural document follows these steps:

Create a modified version of the client DTD that invokes architectural processing.
Create an intermediate DTD that maps from the client DTD to the architectural DTD.
Associate the client document with the modified DTD.
Run the architecture engine to translate the client document into the corresponding architectural document.

This process is illustrated in the subsections which follow.

Create a client DTD that invokes architectural processing

The input file we are using (from section 3.1) uses a DTD in the file textcritt.dtd. The first step is to define an alternate version of this DTD which invokes the desired architectural processing features. The result is as follows:

<!-- my-textcrit.dtd
     This is a version of textcrit.dtd that invokes
     the mapping to CELLAR architectural forms. -->

<?ArcBase mapping>

<!ENTITY % mappingDTD SYSTEM "map-textcrit.dtd" >
<!NOTATION mapping SYSTEM>
<!ATTLIST #NOTATION mapping
    ArcDocF  NAME  #FIXED "TEI.2" 
    ArcDTD   CDATA #FIXED "%mappingDTD" >

<!ENTITY % originalDTD SYSTEM "textcrit.dtd" >
%originalDTD;

Note that this DTD does not modify the original declarations for the elements and attributes of the client DTD in any way. Rather, it duplicates them exactly by including the original DTD in full at the end. The purpose of this version of the DTD is to declare that the architecture named mapping is to be used. This is done with the <?ArcBase mapping> processing instruction. Following this is the architectural support declaration. It consists of a notation declaration followed by an attribute definition list that sets options which control the architecture engine. In this case, ArcDocF specifies the generic identifier for the document (top-level) element of the architectural document, and ArcDTD names the file which contains the architectural DTD. For this step in the process, the architectural document is a new version of the <TEI.2> document that adds the attributes for the CELLAR architectural forms.

Create the mapping DTD

The second DTD to be created is a meta-DTD that defines the mapping of the elements in the client DTD onto the elements of the architectural DTD. The result for our example is as follows:

<!-- map-textcrit.dtd
     This maps textcrit.dtd onto CELLAR arc forms
     Gary simons, SIL, 18 Oct 1997 -->

<!afdr "ISO/IEC 10744:1992" --Allow multiple ATTLIST declarations-->

<?ArcBase cellar>
<!ENTITY % cellarDTD SYSTEM "cellar.dtd" >
<!NOTATION cellar SYSTEM>
<!ATTLIST  #NOTATION cellar
    arcDocF  NAME  #FIXED object 
    arcFormA NAME  #FIXED cellar
    arcNamrA NAME  #FIXED cellarNames
    ArcDTD   CDATA #FIXED "%cellarDTD" >

<!ATTLIST TEI.2
     cellar      NAME  #FIXED object
     class       CDATA #FIXED CriticalText    >

<!ATTLIST text
     cellar      NAME  #FIXED ignore          >

<!ATTLIST front
     cellar      NAME  #FIXED ignore          >

<!ATTLIST docTitle
     cellar      NAME  #FIXED attr
     contentAttr CDATA #FIXED title           >

<!ATTLIST witList
     cellar      NAME  #FIXED attr
     contentAttr CDATA #FIXED authorities     >

<!ATTLIST wit 
     cellar      NAME  #FIXED object 
     cellarNames CDATA #FIXED "class type attrValue id"
     attrName    CDATA #FIXED siglum
     attrType    CDATA #FIXED String
     contentAttr CDATA #FIXED description     
  -- id          automatically preserved from
                 client attr of same name --  >

<!ATTLIST bibl 
     cellar      NAME  #FIXED attr
     contentAttr CDATA #FIXED source          >

<!ATTLIST body
     cellar      NAME  #FIXED attr
     contentAttr CDATA #FIXED body            >

<!ATTLIST div
     cellar      NAME  #FIXED object 
     class       CDATA #FIXED CriticalTextChapter
     contentAttr CDATA #FIXED contents
     attrName    CDATA #FIXED n
     attrType    CDATA #FIXED String
     cellarNames CDATA #FIXED "attrValue n"   >

<!ATTLIST s 
     cellar      NAME  #FIXED object 
     class       CDATA #FIXED CriticalTextVerse
     contentAttr CDATA #FIXED contents
     attrName    CDATA #FIXED n
     attrType    CDATA #FIXED String
     cellarNames CDATA #FIXED "attrValue n"
     encoding    CDATA #FIXED GKOb            >

<!ATTLIST app
     cellar      NAME  #FIXED object
     class       CDATA #FIXED TextVariation
     contentAttr CDATA #FIXED readings        >

<!ATTLIST rdg
     cellar      NAME  #FIXED object 
     class       CDATA #FIXED Reading
     contentAttr CDATA #FIXED text
     attrName    CDATA #FIXED witnesses
     attrType    CDATA #FIXED IDREFS
     cellarNames CDATA #FIXED "attrValue wit" >

<!ATTLIST omit 
     cellar      NAME  #FIXED object 
     class       CDATA #FIXED String          >

<!ENTITY % originalDTD SYSTEM "textcrit.dtd"  >
%originalDTD;

This DTD declares cellar as the name of its base architecture. The architectural support attributes for this architecture declare that:

object is the top-level document element in the architectural document (ArcDocF),
cellar is the attribute in the client document which names the corresponding architectural form to use in the architectural document (ArcFormA),
cellarNames is the "attribute renamer" attribute (ArcNamrA; see below for an explanation), and
cellar.dtd (from section 2) is the architectural DTD (ArcDTD).

Like the DTD for the original document, this meta-DTD is also for <!DOCTYPE document>; thus the original DTD is included in full without modification at the end. The <!AFDR> declaration at the beginning instructs the SGML parser to permit duplicate ATTLIST declarations in this meta-DTD; otherwise it would be a syntax error for the DTD to both define an ATTLIST for an element and to read one from the original DTD.

The bulk of this meta-DTD consists of duplicate ATTLIST declarations for the elements in the client DTD. Their purpose is to add declarations for the attributes of the cellar architecture.

The mapping rules use these features that have not already been illustrated or discussed:

The "attribute renamer" (see, for instance, cellarNames under <wit>) takes a list of paired names. The architectural attribute which is the first member of a pair takes on the value of the client attribute which is the second member. Thus the first pair defined for <wit> says that the name for the class of the object to create comes out of the type attribute of the client element.
The three architectural attributes attrName, attrType, and attrValue work together to map an attribute of the client element onto an attribute of the target object. When the attrType is IDREF or IDREFS (as under <rdg>) , the resulting value is pointers to the objects associated with the given IDs.
The encoding architectural attribute allows one to build a multilingual database [ST97]. For instance, the declaration under <s> says that all of the strings in the content of <s> (including all its subelements) should be created with the CELLAR language encoding named "GKOb" (for ancient Greek).

Associate the client document with the modified DTD

Before performing the final step of automatic translation, the client document instance must be changed to use the modified DTD defined in section 3.3.1. That is,

<!DOCTYPE TEI.2 SYSTEM "my-textcrit.dtd">
<TEI.2>
   <!-- the content is as in section 3.1 -->
</TEI.2>

Run the architecture engine to translate the document

The final step is to run the architecture engine to perform the mapping to translate a client document instance into an architectural document instance. The parsers in the SP family [Cla97] are able to do this. For instance, the following command line

nsgmls -Amapping clement.sgm

applies just the mapping architecture and results in an output which adds the architectural attributes to the original document instance. The command line

nsgmls -Amapping -Acellar clement.sgm

applies the cellar architecture as well and performs the translation of the client document instance into the corresponding document that uses the object markup system of the CELLAR architecture.

For instance, performing this translation step on the sample Clement text (from section 3.1) yields a document like the following (note that most of the content is elided to avoid excessive detail):

<object class="CriticalText">
   <attr contentAttr="title" pcdataClass="String">
      2 Clement, chapter 7</attr>
   <attr contentAttr="authorities">
      <object class="Manuscript" id="A" contentAttr="description"
         attrName="siglum" attrType="String" attrValue="A"
         pcdataClass="String">
         Codex Alexandrinus
         <attr contentAttr="source" pcdataClass="String">
            A Greek uncial of the fifth century. . . </attr>
      </object>
      <!-- The other six authorities -->
   </attr>
   <attr contentAttr="body">
      <!-- The CriticalTextChapter and its conetnts -->
   </attr>
</object>

Parsing the architectural document into CELLAR

The final step in the process is to run a method of the CELLAR system that invokes a data input parser that converts the architectural document instance into the corresponding structure of objects. The input to the CELLAR parser is the ESIS output file of the nsgmls parser. At the heart of the implementation is a recursive function of 125 lines that processes one element at a time from the ESIS stream. This function relies on another 125 lines of code in smaller supporting functions. The source code for this parser is listed in full and explained in an electronic working paper [Sim97b].

Conclusion

The CELLAR architecture that has been implemented is actually richer than what is presented above. It also handles cases where:

an element does not correspond to anything in the object model so that it must be discarded along with all its content;
an element actually corresponds to two objects, one embedded within the other; and
the exact mapping relationship is conditioned by the context.

The full architecture, its implementation, and a number of complete examples (including all the files needed to run them) are presented in an electronic working paper [Sim97b].

The results to date have been promising. The goal of developing a general solution to the problem of importing SGML data into an existing object database schema has been achieved. Given the fact that the method permits superfluous markup to be ignored and unmappable elements to be discarded altogether, it is always possible to achieve a translation from an SGML file into a structure of objects in the database. The usefulness of the result depends on the degree of congruence between the conceptual model of the markup for the source data in SGML and that of the schema for the target object database.

Acknowledgments

I am deeply indebted to my colleague Robin Cover who has helped in many ways over the course of this project. He has gone the extra mile in helping me to find resources and in offering useful feedback and encouragement.

Bibliography

[Bor85] Borgida, A. (1985) Features of languages for the development of information systems at the conceptual level. IEEE Software 2(1): 63-72.

[Cat97] Cattell, R.G.G., et al. (1997) The Object Database Standard 2.0. San Francisco: Morgan Kaufman.

[Cla97] Clark, J. (1997) SP:An SGML System Conforming to International Standard ISO 8879 --Standard Generalized Markup Language, version 1.2. <http://jclark.com/sp/>. See especially "Architectural form processing," <http://jclark.com/sp/archform.htm>.

[Cov97] Cover, R. (1997) Architectural Forms and SGML Architectures, in The SGML/XML Web Page. <http://www.sil.org/sgml/topics.html#archForms>.

[DD94] DeRose, S. and Durand, D. (1994) Making Hypermedia Work: A User's Guide to HyTime. Boston: Kluwer Academic Publishers. See especially pages 79-90.

[ISO92] International Organization for Standardization. (1992) ISO/IEC 10744. Hypermedia/Time-based Structuring Language: HyTime.

[ISO97] International Organization for Standardization. (1997) Architectural Form Definition Requirements (AFDR), Annex A.3 of ISO/IEC N1920, Information Processing--Hypermedia/Time-based Structuring Language (HyTime), Second edition 1997-08-01. <http://www.ornl.gov/sgml/wg8/docs/n1920/html/clause-A.3.html>.

[RST93] Rettig, M., Simons, G., and Thomson, J. (1993) Extended Objects. Communications of the ACM 36(8):19-24.

[Sim97a] Simons, G. (1997) Conceptual modeling versus visual modeling: a technological key to building consensus. Computers and the Humanities 30(4):303- 319.

[Sim97b] Simons, G. (1997) Importing SGML data into CELLAR by means of architectural forms. <http://www.sil.org/cellar/import/>.

[Sim97c] Simons, G. (1997) Using architectural forms to map SGML data into an object-oriented database, in Proceedings of SGML/XML '97, Washington, D. C., 8-11 December 1997. See <http://www.gca.org/conf/sgml97/> for conference information.

[ST97] Simons, G., and Thomson, J. (in press) Multilingual data processing in the CELLAR environment. To appear in John Nerbonne (ed.), Linguistic Databases. Stanford, CA: Center for the Study of Language and Information. (The original working paper is available at <http://www.sil.org/cellar/mlingdp/mlingdp.html>.)

[TEI94] Sperberg-McQueen, C. M. and Burnard, L. (1994) Guidelines for Electronic Text Encoding and Interchange. Chicago and Oxford: Text Encoding Initiative.