Using XML to Describe a Document Hierarchy

Problem: Storing information about file hierarchies and links

Full paper at http://www.stg.brown.edu/pub/filerep.html

We need to represent information about a set of hierarchically linked documents in order to manipulate them as we perform a conversion from HTML->XHTML->OEB (XML).

Specifically, we have developed a web-based system for grabbing web sites (or sub-trees) and converting them to OEB documents. This system must allow easy manipulation, reconfiguration of file set, file attributes, etc.

Solution: Use XML file description to describe data and XSLT for management and transformations

-- a strategy that turned out to efficiently & naturally support a wider range of uses than originally anticipated.

Benefits:

Multi-purposing; document instance can be used for many purposes
Generalizable: DTD and associated approach and software tools can be easily generalized as Web mirroring tool or as a website analysis tool

The general structure of the DTD is that of <document> and <media> elements, each one of which corresponds to a file. The first few element declarations in the DTD are as follows (attribute declarations trimmed for brevity):

    <!ELEMENT filerep	 (user, metadata, spine+, directoryroot) >
    <!ELEMENT directoryroot ( (directory | document | media)+ ) >
    <!ELEMENT directory	 ( (directory | document | media)+ ) >

Issues:

(1) Reconciling need for IDs/IDREFs with need to retain filenames

Although we gave <document> elements IDs, so we could refer to them, we realized it was expedient to retain the actual file path, in order to (for example) transform or validate the actual files via system commands.

One uses ID/IDREFs for things such as easily finding all the files to which a given file has links (or conversely which files point to a given file), or displaying the titles of items in the spine XML/SGML mandates that IDs/IDREFs may contain only [ . - a-z A-Z 0-9]. However, file paths may have other characters, especially slash and underscore.

Solution: Modified version of the file path for IDs/IDREFs, and also add a separate CDATA attribute containing the actual file path. Here is an example of a <document> element:

    <document name="index.html" id="root-index.html" source="fetched">
      <uri>index.html</uri>
      <origuri>http://www.guildhallinn.com/index.html</origuri>
      <title>Guildhall Inn Bed and Breakfast</title>
    </document>

(2) Multiple hierarchies and non-hierarchical document sets

The solution described above works very well for a standard website whose home page functions as the root of the document hierarchy, but our requirements include other structures:

a.) Multiple hierarchies

The user may wish to include several distinct websites in their OEB book.

b.) Non-hierarchical document sets

The user may wish to make a book using XML source documents which may not have any links to one another. For example, The user might want to create a book from several disparate XML TEI documents.

The solution to both problems was to create a "dummy" root element. For example, this is useful both for adding a whole new tree to an already-uploaded HTML hierarchy, and for uploading multiple, non-linked TEI documents. For example, an upload of non-linked TEI documents would have a fake root:

               MyDoc (fake root)
                /      |      \
               /       |       \
        chap1.xml   chap2.xml  chap3.xml

(3) Links: Meta-information about relationship between files

As mentioned, one of the reasons to make sure IDs work correctly is to make is easy to add and delete documents or subtrees from the hierarchy. To achieve this, we first decided to set up a pointer (using an IDREF on the <link> element) to each <document> from within the <document>s that pointed to them.

However, for reasons of ease and speed, we replaced this uni-directional linking with bi-directional linking. That is, we added the <linkers> element, which has pointers to the other <documents> which point to the <document> within which the <linkers> element is located:

    <document name="index.html" id="root-index.html" source="fetched">
      <uri>index.html</uri>
      <origuri>http://www.guildhallinn.com/</origuri>
      <title>Guildhall Inn Bed and Breakfast</title>
      <link id="root-index.html-link1" idref="root-mapDir.html">
        <origuri>./mapDir.html</origuri>
        <fulluri>http://www.guildhallinn.com/mapDir.html</fulluri>
      </link>
      <link id="root-index.html-link2" idref="root-touristInfo.html">
        <origuri>./touristInfo.html</origuri>
        <fulluri>http://www.guildhallinn.com/touristInfo.html</fulluri>
      </link>
    </document>

    <document name="mapDir.html" id="root-mapDir.html" source="fetched">
      <uri>mapDir.html</uri>
      <origuri>http://www.guildhallinn.com/mapDir.html</origuri>
      <title>Map and Directions</title>
      <linkers>
        <itemref idref="root-index.html"/>
      </linkers>
    </document>

Conclusion

As stated before, this approach to the representation of information about file hierarchies and links is generalizable, and is not limited to the use to which we have put it. Anyone attempting to use this approach should be aware of the inherent pitfalls described here which arise when representing these hierarchies using XML; I have described the most important ones, and their solutions. However, one should also remember that using XML to represent this information really makes one's life easy, if for no other reason than because XSLT is such a wonderful tool. If we had not used XML we might have had to multiply the number of tools needed. XML also made it very easy to re-purpose the information gathered about the file hierarchies. This is an advantage of SGML/XML in which we have believed for quite some time; a practical demonstration of its truth is nonetheless striking and gratifying.