Using XML to Describe a Document Hierarchy

Paper by Carole E. Mah, carolem@stg.brown.edu
Original idea by Shawn Zeller, shawn@stg.brown.edu
Last modifed Saturday, 02-Sep-2000 14:03:49 EDT

Originally presented in a shorter form as a poster session at Extreme Markup Languages 2000 in Montreal Canada, August 15-18, 2000.

Problem: Storing information about file hierarchies and links

We need to represent information about a set of hierarchically linked documents in order to manipulate them as we perform a conversion from HTML->XHTML->OEB (XML).

Specifically, we have developed a web-based system for grabbing web sites (or sub-trees) and converting them to OEB documents, while allowing the user to interactively add, subtract or replace documents, as well as add a whole new tree or sub-tree. An additional requirement was ease of OEB-related modifications, such as creating fallbacks, adding documents to the spine, and editing the metadata.

This meant we had to store a great deal of meta-information -- filenames, the relationships between files, attributes of the files such as their media type, markup scheme, etc. in a way that would allow easy manipulation and reconfiguration.

Solution: Use XML file description to describe data and XSLT for management and transformations

-- a strategy that turned out to efficiently and naturally support a wider range of uses than originally anticipated.

Originally, we planned to store all this meta-information in a database which is a well understood tool for manipulating related pieces of data.

However, it soon became clear that the varying kinds of operations we

needed to perform on and using this metadata, and the nature of that metadata, made an XML solution more powerful and easier to implement. Furthermore, it would fit in well with the tools and methods we already had in place to do the transformation of the data itself (i.e. translating the documents themselves from HTML->OEB, using transformation tools such as XSLT and XML::DOM).

Therefore, we developed a DTD for this purpose. As we did so, it became increasingly apparent that this solution was the right one as it allowed easy multi-purposing -- the same document instance can be used to do such varied things as

providing various different displays about the files to assist the user in modifying the document set
make the OEB package file
provide necessary auxiliary information for the actual data transformation

This revealed a further benefit to the solution: it could be equally useful as a general Web mirroring tool or as a website analysis tool.

The general structure of the DTD is that of <document> and <media> elements, each one of which corresponds to a file. The first few element declarations in the DTD are as follows (attribute declarations trimmed for brevity):

<!ELEMENT filerep	  (user, metadata, spine+, directoryroot) >
<!ELEMENT directoryroot ( (directory | document | media)+ ) >
<!ELEMENT directory	  ( (directory | document | media)+ ) >

Issues:

(1)

One problem we encountered during the development of the DTD was with IDs. Initially it seemed easiest to store each file path name on the ID of the <document> or <media> element. Using IDs in this way has many advantages -- for removing files, adding files to the spine, and getting a quick file list for various uses, using an ID/IDREF scheme is essential. However, XML/SGML mandates that IDs/IDREFs may contain only NAME characters -- namely, period (.), hyphen (-), a-z, A-Z, and 0-9. File paths, in contrast, may have other characters; the underscore is particularly common in filenames, and there are always slashes (/) in the file path as directory separators.

The directory separator is the first problem, since period, hyphen, letters, and numbers can all appear in filenames, the slash cannot be converted into one of those characters. However, the hyphen is rare enough in filenames that our initial quick solution was to change them into hyphens in the ID. We then did a regular expression substitution from hyphen to slash before trying to manipulate the actual files. A better solution might have been to scan the filename for hyphens, and if there are any, then make the directory separator something other than a single hyphen. For example, perhaps something unlikely to occur in a filename -- such as "-.-" or some other unlikely series of hyphens and periods.

However, the underscores and other non-NAME characters force an even better solution. By simply adding a CDATA attribute (we'll call it FILENAME) to <document> and <media>, who's content is the original unaltered filename, the problem is solved. One uses the FILENAME attribute when gathering filenames for use in (for example) transforming or validating the actual files via system commands, and one uses ID/IDREFs for things such as easily finding all the files to which a given file has links (or conversely which files point to a given file) and displaying the titles of items in the spine. Here is an example of a <document> element:

<document name="http://www.guildhallinn.com/index.html"
      id="http--www.guildhallinn.com-index.html"
      source="fetched">
   <title>Guildhall Inn Bed and Breakfast</title>
</document>

This is a sufficient structure for some uses of the system. However, in our case, the web hierarchy needed did not need to be reflected in its original form, but rather re-written for use on a non-networked handheld device. For this reason, it made more sense to re-write the file paths entirely during transformation, while still keeping every bit of information about the original file paths and hierarchy, as that information is crucial during the upload process for keeping track of file relationships, uploading additional files, etc.

For this reason, the DTD's final form added elements such as <uri> for the new, local file path, and <origuri> for the original file path. Here is an example of a <document> element with these added elements and with the URIs re-written in terms of a local, non-web filesystem (as a contrast to the example above):

<document name="index.html" id="root-index.html" source="fetched"> 
<uri>index.html</uri> 
<origuri>http://www.guildhallinn.com/index.html</origuri> 
<title>Guildhall Inn Bed and Breakfast</title> 
 </document>

(2)

The solution described above works very well for a standard website Who's home page functions as the root of the document hierarchy, but our requirements include other structures:

Multiple hierarchies
The user may wish to include several distinct websites in their final OEB book. For example, our Vermont traveller may wish to make a book containing the websites for three different Bed & Breakfast Inns in that region.
Non-hierarchical document sets
The user may wish to make a book using XML source documents which may not have any links to one another. For example, The user might want to create a book from several disparate XML TEI documents.

The solution to both of these problems was simply to create a "dummy" root element. For example, this is useful both for adding a whole new tree or subtree to an already-uploaded HTML hierarchy, and for uploading multiple, non-linked TEI documents.

An HTML upload might start as:

       guildhallinn.com-index.html 
        /              |         \ 
       /               |          \ 
     mapdir.html     about.html   touristinfo.html 
       | 
       | 
     map.jpg

And then the user might want to add another Inn:

                  Inns (fake root) 
                  /               \ 
                 /                 \ 
         guildhallinn.com        innatwillowpond.com 
            |                           | 
            |                           | 
          (...)                       (...)

Similarly, an upload of several non-linked TEI documents would also have a fake root:

                  MyDoc (fake root) 
                 /         |        \ 
                /          |         \ 
             chap1.xml  chap2.xml   chap3.xml

This makes it possible to use the same DTD and all the same file/web-grabbing software for many different kinds of file uploading scenarios, regardless of whether it is actually hierarchical or not. The one problem to be aware of when doing this is to make sure to avoid naming conflicts, because (for example) a single web hierarchy can contain many "index.html" files. Even when doing a single tree, we take care of this by specifying the whole file path in the ID, but now one has to supplement this by creating fake subdirectory names as well.

(3)

As previously mentioned, one of the reasons to make sure IDs/IDREFs work correctly is to make is easy to add and delete documents or subtrees from the hierarchy. To achieve this, we first decided to set up a pointer (using an IDREF on the <link> element) to each <document> from within the <document>s that pointed to them.

<document name="index.html" id="root-index.html" source="fetched"> 
    <uri>index.html</uri> 
    <origuri>http://www.guildhallinn.com/</origuri> 
    <title>Guildhall Inn Bed and Breakfast</title> 
    <link id="root-index.html-link1" idref="root-mapDir.html"> 
      <origuri>./mapDir.html</origuri> 
      <fulluri>http://www.guildhallinn.com/mapDir.html</fulluri> 
    </link> 
    <link id="root-index.html-link2" idref="root-touristInfo.html"> 
      <origuri>./touristInfo.html</origuri> 
      <fulluri>http://www.guildhallinn.com/touristInfo.html</fulluri> 
    </link> 
  </document> 
 
  <document name="mapDir.html" id="root-mapDir.html" source="fetched"> 
    <uri>mapDir.html</uri> 
    <origuri>http://www.guildhallinn.com/mapDir.html</origuri> 
    <title>Map and Directions</title> 
  </document> 
 
  <document name="touristInfo.html" id="root-touristInfo.html" source="fetched"> 
    <uri>touristInfo.html</uri> 
    <origuri>http://www.guildhallinn.com/touristInfo.html</origuri> 
    <title>Tourist Attractions Near Guildhall Inn</title> 
</document>

However, for reasons of ease and speed, we replaced this uni-directional linking with bi-directional linking. That is, we added the <linkers> element, which has pointers to the other <documents> which point to <document> within which the <linkers> element is located:

<document name="index.html" id="root-index.html" source="fetched"> 
    <uri>index.html</uri> 
    <origuri>http://www.guildhallinn.com/</origuri> 
    <title>Guildhall Inn Bed and Breakfast</title> 
    <link id="root-index.html-link1" idref="root-mapDir.html"> 
      <origuri>./mapDir.html</origuri> 
      <fulluri>http://www.guildhallinn.com/mapDir.html</fulluri> 
    </link> 
    <link id="root-index.html-link2" idref="root-touristInfo.html"> 
      <origuri>./touristInfo.html</origuri> 
      <fulluri>http://www.guildhallinn.com/touristInfo.html</fulluri> 
    </link> 
  </document> 
 
  <document name="mapDir.html" id="root-mapDir.html" source="fetched"> 
    <uri>mapDir.html</uri> 
    <origuri>http://www.guildhallinn.com/mapDir.html</origuri> 
    <title>Map and Directions</title> 
    <linkers> 
      <itemref idref="root-index.html"/> 
    <linkers> 
  </document> 
 
  <document name="touristInfo.html" id="root-touristInfo.html" source="fetched"> 
    <uri>touristInfo.html</uri> 
    <origuri>http://www.guildhallinn.com/touristInfo.html</origuri> 
    <title>Tourist Attractions Near Guildhall Inn</title> 
    <linkers> 
      <itemref idref="root-index.html"/> 
    <linkers>     
</document>

Conclusion

As stated before, this approach to the representation of information about file hierarchies and links is generalizable, and is not limited to the use to which we have put it. Anyone attempting to use this approach should be aware of the inherent pitfalls described here which arise when representing these hierarchies using XML; I have described the most important ones, and their solutions. However, one should also remember that using XML to represent this information really makes one's life easy, if for no other reason than because XSLT is such a wonderful tool. If we had not used XML we might have had to multiply the number of tools needed. XML also made it very easy to re-purpose the information gathered about the file hierarchies. This is an advantage of SGML/XML in which we have believed for quite some time; a practical demonstration of its truth is nonetheless striking.