[Archive copy mirrored from: http://www-personal.umich.edu/~janete/tei_etd.htm]

An SGML/HTML Electronic Thesis and Dissertation Library

Janet Erickson
April 22, 1997

Example pages of a proposed Dissertation and Thesis Library are available at http://dns.hti.umich.edu/misc/diss.example/.

The Electronic Thesis and Dissertation Project (ETD) was launched in 1987 at an Ann Arbor meeting arranged by UMI and attended by representatives of Virginia Tech, the University of Michigan, SoftQuad, and ArborText. Virginia Tech funded the development of a Document Type Definition (DTD) for dissertations and theses; SoftQuad's Yuri Rubinski wrote the initial DTD. The project continued at VT, with collaboration from the Coalition for Networked Information, the Council of Graduate Schools, and UMI among others. Since 1994, many of VT's students have submitted their dissertations and theses in Adobe's Portable Document Format (PDF). The long-term plan is to have the theses and dissertations submitted in both PDF and SGML. VT is waiting for suitable software to develop before requiring submissions to be in SGML format.

This aim of this project is to describe a potential online library of dissertations and theses at the University of Michigan. The focus is on the SGML markup of sample dissertations and a user interface for searching and retrieval. For this project, I acquired four dissertations to show the breadth of types that would need to be covered by the selected DTD. The first is the doctoral dissertation of Rebecca Price-Wilkin on the architectural history of a French church (The Late Gothic Abbey Church of Saint-Riquier). Price-Wilkin's document will be used as an example of an image-rich dissertation. The second is the doctoral dissertation of David Ruddy on the medieval travelogue Mandeville's Travels (Scribes, Printers, And Vernacular Authority: A Study in the Late-Medieval And Early-Modern Reception of Mandeville's Travels). Ruddy's will show how historic texts can be represented in this model. A third dissertation is from Michele Tepper, and is titled "The Mind of His Own Country": Nation and the Embodiment of Culture in Modernist Literature. It demonstrates the diversity that can be found within a dissertation, as each chapter can be thought of as an independent unit. Fourth, I acquired the dissertation of William Wheeler on global warming and agriculture from the University of Pennsylvania's School of Agriculture Economics and Rural Sociology. This document, titled Three Essays on Discounting and the Evaluation of Global Warming Policies, contains many tables, graphs, and formulas, allowing me to speak to these important issues.

The DTD

The first challenge to the project was selection of a Document Type Definition (DTD), or the rules by which markup would be applied to each document. The ETD project at Virginia Tech had completed a DTD (the ETD DTD) for use on theses and dissertations. I initially elected to use this DTD for many reasons. It seemed rather simple, such that students with a passing knowledge of HTML markup could learn to use it with minor difficulty. Also, it had been designed with the material in mind, so there was some anticipation of suitability to the task. Lastly, I had made the faulty assumption that the DTD had been used and perfected such that it would be the most efficient way to markup these dissertations.

As I investigated the ETD DTD, its many limitations became apparent. This DTD provides few attributes, most of which are 'id's. This does limit the complexity that is presented to a novice user, but also reduces the DTD's flexibility. No line breaks are available and there is no additional containment for the different parts of a <head>. You can also only have only one <head> element in each division. Footnotes are referred to with <a> elements, as in HTML; this would work if the <footnote> element had been referred to in a content model other than its own. Use of this DTD was also hampered by the complexity of the four dissertations selected for the project. One of the documents has several lists and an introduction that precede the chapters; the ETD DTD does not have structures for these, so these lists would have to go into the first <chapter> element. The <chapter> element does not include attributes other than 'id,' so a 'type' attribute could not be added for clarification. After the chapters, there is a conclusion, followed by the figures and illustrations. This same dissertation has 150 illustrations and figures. There is no clear indication of what structure within the DTD to use for the images, though the <mm> element (multimedia object) might work. It is, however, not referenced in any content model other than its own so that applications of the ETD DTD could not use the <mm> element. In sum, the ETD DTD, as I found it in early March 1997, was insufficiently tested and has enough problems that it could not be used for this project.

Because of the limitations of the ETD DTD I selected an alternative DTD. I am most familiar with the Text Encoding Initiative (TEI) DTD; it has the flexibility to deal with the complexities of many documents, including dissertations. Markup of all four dissertations was done using the TEI Lite DTD. Since my initial investigation of the ETD DTD beta version, it has been updated to version 1.0. This version fixes many of the problems with the beta by adding various floating elements to the <etd> content model and including multimedia objects (<mm>) in other content models. I discovered this new version too late to change all of the dissertations to that DTD. Further electronic dissertation projects may well be able to use this DTD successfully.

The Markup

The process used to markup each dissertation was slightly different depending on its length and complexity. Two of the four began as Microsoft Word for Macintosh files; Price-Wilkin's dissertation was in Word for Windows NT/95 format and Wheeler's was in WordPerfect 7 for Windows 95. I began with Tepper's dissertation, as it was the shortest, had no images, and had a reasonable number of footnotes. For this document, I used the SGML tools bundled with WordPerfect 7 for Windows 95. It was a simple process of cutting and pasting from the Rich Text Format (RTF) that I had created from the MSWord for Macintosh version to the SGML document instance. The only difficulties lay in inserting lines of poetry, as WP's software does not have the split function that SoftQuad's Author/Editor does. The split function allows the user to surround a larger portion of text as a particular element then split that section into smaller versions of the same element. In WP, each line had to be separately tagged with an <L>, making for a tedious but effective process.

Ruddy's dissertation is both longer and has more extensive footnotes than does Tepper's. It also includes Middle English characters that need to be referred to by character entity reference, such as the thorn (þ). The manner in which I processed this dissertation was determined by the substantial number and size of the footnotes. I saved the document as RTF, uploaded the file to a UNIX machine, then used a short Perl program to automatically markup the text. This processing relied on my ability to distinguish among the various RTF codes, which start with a curly brace then the codes for describing the text from that point to the ending curly brace (e.g., {/footnote Source cited above.}). Unlike SGML, the ending marker is generic, not indicating the element to which it refers. Because of this, some guesses had to be made on where the footnotes ended. Occasionally the Perl algorithm failed and notes had to be indicated by hand. Also, some special characters were missed in the processing or were transposed to another character in changing from one platform to another. Some of these transposed or missing characters were not fixed in the final version. Missing characters would have to be found in the original word-processed file and then placed in the SGML version. My final step in converting from RTF to SGML eliminated all RTF codes that had not been identified and converted in earlier steps, thus the characters they were meant to represent were lost.

As I moved through the various dissertations, they became more complex, added more elements, and I learned more efficient ways of processing them. Wheeler's dissertation on agricultural economics had more than 50 complex equations. The document was created in WordPerfect 7 so the formulas were done in WP's equation editor. At this point, I could redo the equations using TeX, a mathematical representation language. I would need to learn TeX to do this and would also have to understand the formulas well enough that in redoing them I could make accurate reproductions. As I did not have the time or expertise in mathematics to do this, I chose instead to take advantage of another WP feature: automatic conversion to HTML. As part of this conversion, WP changed the equations to GIF images and the other formatting to HTML codes. These codes were regular and more easily identified than the RTF codes used in processing Ruddy's dissertation. Thus, with another Perl program, I was able to change the markup from the HTML DTD to the TEI Lite DTD. This again required some clean-up, though hand-processing was quite limited in comparison to previous efforts. ISO characters were successfully changed by WP from internal coding to character entity references. Final tweaking of the document was done using Author/Editor and PSGML through EMACS.

Price-Wilkin's dissertation presented the most complex challenges. It includes several indexes that were included in the front matter, and appendices in the back matter. Between these were an introduction, four chapters, and a conclusion, followed by a substantial number of figures, illustrations, and tables. Each figure and illustration was provided in three forms: a thumbnail, 100 dpi images for on-screen viewing, and 300 dpi for printing. I chose to use the 100 dpi images in the SGML version to decrease the download time. TEI Lite has apparatus to enclose a thumbnail image in a reference to the larger image, but I chose not to do this to simplify the processing and re-conversion to HTML for normal browsers. The images provided with Price-Wilkin's dissertation were in JPEG format, which is not supported in older versions of the SGML browser Panorama. I used a UNIX shell script to convert these images to GIF format, which is understood by both the new and old versions of Panorama.

Again, I used WordPerfect as a first step to convert from RTF to HTML. As the HTML produced by WP is quite generic (e.g., what should be a <H1> in HTML is converted as <P><STRONG>, everything seemed to be a <P>), it was difficult to identify such structures as bibliographies, lists, and quotes. Due to the length of this dissertation and time constraints, some of these structures remain as <P>s. Another useful enhancement of the markup would be links from the references to figures and illustrations within the text to the appropriate image in that section of the document.

Searching Dissertations

Dissertations submitted to the University of Michigan can currently be located through the MCAT database on the University's MIRLYN online catalog. Works from all U.S. universities can be searched through the Dissertation Abstracts International (DAI) CD-ROM from UMI and the print version of DAI. The Dialog commercial database also contains the DAI. Each of these sources provides varied access points to the document surrogates; the actual dissertation is not searchable. MCAT allows the user to search author, title, and abstract fields of UM dissertations. The print DAIincludes keyword and author indexes for all dissertations done in the U.S. and selected ones from other countries. The CD-ROM allows searching on title, author, institution, year, UMI order number, keyword, and advisor's name for more recent dissertations (post-1980). Abstracts are not available on the CD-ROM for dissertations completed between 1861 and 1979.

The interface for this project builds on what is available on the CD-ROM. Therefore, users will be able to search on the following fields: title, author, institution,(1) year, UMI order number, advisor's name, keyword in the title or abstract, and keyword in the text. Each of these fields should be combinable using boolean operators, though this version does not implement that option. The results from a search will be presented in various way, depending on the number of resulting hits and the fields searched.

Retrievals for searches on title, author, institution, year, UMI number, or advisor's name are all done in the same manner. If there is only one result, the user will see its title and abstract and the choice between viewing the dissertation in HTML or SGML. If HTML is selected, the user would see a list of the parts, such as chapters and their titles (if available) from which to select, and the choice if seeing the whole document in HTML. If SGML is selected, the whole document is sent to the user's SGML browser. Because of the size of some dissertations, delivery of only part of the SGML document to the user is desirable. The delivery of single chapters in DIVs is complicated by IDs and IDREFs that cross DIV boundaries, such as pointers that are linked to notes via ID/IDREF. The single result keyword query displays its result in the same manner as the other query types, except that a list of the hits within the dissertation replaces its abstract in the display. These hits are shown using keyword in context (KWIC), so that 20 characters of context surround each.

If a search results in more than one hit, the user will see a list of titles and abstracts in date order (most recent first) with the choice of HTML or SGML. As above, choosing SGML would download the whole dissertation to the user's desktop; selection of HTML would provide a table of contents, allowing the user the choice of seeing the whole document or only a portion of it. Below is a decision diagram for this interface design.

decision diagram

Table of Contents items will include the introduction, chapters, bibliography, and appendices; each is indicated by the 'type' attribute on the numbered DIV element. The ability to automatically generate this TOC list is dependent on consistent markup in the choice of attribute values.

The dissertations will be stored and searched as SGML. Results retrieved as SGML can be sent to the user's desktop without modification. When whole document results are retrieved as HTML, an intervening Perl program will convert those SGML tags with corresponding HTML codes to that markup; other SGML tags will be stripped out of the results. If a user selects a part of the document to retrieve (e.g., Chapter 1 only) in a non-keyword search, the full DIV of the appropriate section will be converted to HTML and returned. This retrieval will utilize the region-generating functions of the OpenText indexing software.

The multiple result keyword query displays its results in the same manner as the other query types, except that a list of the hits within each dissertation replaces the abstracts in the display. These hits are shown using keyword in context (KWIC), so that each is surrounded by 20 characters of context. On-the-fly conversion of the SGML to HTML means that the HTML does not have to be regenerated if changes are made in the SGML source and that only one copy of each dissertation needs to be stored on the system.

For two of the dissertations (Price-Wilkin and Tepper), I chose to include both the footnotes and the text as divisions in an overall division with 'type' equal to 'chapter.' With this in mind, a keyword search that had hits within the text sub-division of the chapter would need to retrieve the larger chapter in order to include the notes with the text. The position of the notes does not complicate matters in Ruddy's dissertation, as the pointer to each note is immediately followed by the note itself. The Wheeler dissertation needs some revision in this regard, as the notes for all three chapters are set off in a DIV0 of their own at the end of the complete document. As there are only 20 footnotes in this dissertation, they could easily be moved to follow the appropriate pointers.

Users can also browse the collection by year or subject. The other available fields, such as author, can be searched to produce pertinent lists to browse. Browsing by year lets the user see what has been added to the collection recently. Subject browsing making it easier to see the latest research in an area. In the ETD model, initial subject keywords are assigned by the author and placed in a <keyword> element in the document header. From these keywords, indexers within the library would assign controlled vocabulary subject headings, such as Library of Congress Subject Headings. For ease of subject access and indexing, an alternative approach would be more appropriate. Each dissertation will be listed under the department to which it was submitted, with inter-departmental dissertations listed in each sections. This provides a simple mechanism for determining where to put a dissertation. This system of indexing may not scale well to a larger collection. As the collection grows, indexing could be done with a combination of department and specialization within the department.

In browsing the Year List or Subject List, the user could select either SGML or HTML, as shown in the diagram below. Retrieval of the whole or part of a dissertation proceeds as above.

dissertation retrieval diagram

Queries and System Response

From the initial screen (figure 1, appendix), the user can access the dissertations in several ways. First, they can select the 'browse' option which leads to another screen that offers a choice of browsing by author, subject, or year (figure 3). Screen shots of these interface pages are included as an appendix to this document. The author listing (figure 4) contains all dissertations in alphabetical order by author's last name, followed by the title and year it was produced, then the choice of SGML or HTML. Selecting SGML retrieves the whole document in Panorama; selecting HTML leads the user to another page with the table of contents for that dissertation (figure 6). Significant TOC items such as chapters and appendices will have hypertext links to the relevant SGML text division (changed to HTML in the process of retrieval). The user can also choose to retrieve the complete document in HTML.

Browsing by year and subject (figure 5) is quite similar to browsing by author; the initial screen of each is different, however. Selection of browsing by year brings up a list of all dissertations in completion date order; browsing by subject produces a list of dissertations by UM academic department.

The current search screen is fairly rudimentary and non-functional (figure 2). It is anticipated that boolean options will be added to this interface to allow more complex searches. The single result of a search for "tepper" as an author produces the author/title/year listing, the choice of HTML or SGML, and the complete abstract (figure 7). All of this information can be pulled from the original SGML source.

A keyword search on the term "chronicle" results in many hits (figure 8). The screen shot prepared for this query is only an example of the possible results. Again, the author, title, and year of the dissertation are provided along with the HTML/SGML choice. For each dissertation with results, the larger heading of the textual division that the hit occurs in is provided as a link, with the specific keyword hits highlighted in a list. This list includes these keyword hits with approximately 20 letters of context on either side, so that hits that use the word in a relevant context are easier to locate.

Why Use SGML?

The use of SGML markup on dissertations allows far more complex searching. For fully marked up documents, searches can be made on bibliographic citations or such citations could be extracted from each dissertation to create a citation database as a secondary product. Because the whole dissertation would be online, it could be searched and retrieved, rather than searching and retrieving only the limited document surrogate (title and abstract), then waiting for delivery of the complete dissertation. Logical divisions within the text can be marked up; this structure can be utilized for retrieval of these smaller portions of a document to reduce download time. SGML is also independent of platform, such that a single document can be shown successfully on any number of computers without conversion. It lacks the proprietary coding that makes word-processed documents difficult to transfer between applications and platforms. As Web technology improves, the raw SGML will become even more useful as it will translate to this new system equally well.

When the ETD project was first suggested in 1987, the World Wide Web and HTML did not exist. Teaching graduate students how to do SGML markup would have been quite a challenge at that point. Since then, the Web has introduced students to HTML. HTML is easy to learn, has few rules, and a few, simple tags. Many word processors can already produce HTML without the user knowing anything about markup. One might be tempted to leave the dissertations in HTML format as it is currently so much easier to produce than SGML. HTML markup is generic, with few ways to distinguish between the various parts of a document. It does not allow textual divisions that allow you to section off pieces to be retrieved. The automatic markup done by these word processors is simplistic, with all text blocks tagged as <P> and headings as <P><STRONG>. With this type of generic tagging, it is difficult to pinpoint variant structures for application of SGML markup. In addition, though a user may specify what each word-processing style maps to in HTML, the conversion does not always produce HTML markup that conforms to the user-set mappings.

One problem with using SGML is the variety in markup that can be produced by multiple users. As can be seen in the various ways that I have presented notes within each dissertation, there is a great deal of variation available to the user. In this case, notes were placed immediately following the pointer, at the end of the chapter, and the end of the document. What is most striking about this is that fact that a single SGML tagger allowed this much variety in markup. One can only imagine the variations in markup among graduate students.

One way to avoid significant variation in applying markup is to have a central office for converting word-processed dissertations to SGML. Within the University, there is already standardization required in preparation and formatting of dissertations. Dissertation printouts submitted to Rackham are reviewed for compliance with these standards. With SGML, a stylesheet attached to the SGML document would impose these formatting rules. The SGML DTD would impose some restrictions on how markup could be applied to a dissertation, and this markup would be reviewed. As SGML conversion tools grow more sophisticated and simplified, it will be easier to rely on the markup output by these tools.

Other Important Decisions

Special characters

The Price-Wilkin and Ruddy dissertations include a number of non-ASCII characters. Price-Wilkin's French text; Ruddy's includes both modern and medieval Western European characters. The Western European characters should be available in ISO Latin-1 and therefore available via standard HTML character entity references. The Middle English characters, such as the yogh will have to be dealt with in some other way -- perhaps GIFs -- until Unicode is the norm. For this project, the Latin-1 characters were changed to character entity references in the conversion process; the remaining special characters will show up incorrectly in the Ruddy's dissertation.

Integrating Current Technology

Mathematicians and economist, among others, have developed mechanisms for presenting formulas and equations. The TeX formula description language is frequently used to typeset formulas, and is the language referred to specifically in the ETD DTD as a way to process equations outside of the SGML document. As noted above, the equations used in Wheeler's dissertation were created with WordPerfect's equation editor and converted to GIFs. This is a temporary, insufficient solution to the problems of presenting mathematics online. The Graduate School will have to work with the departments to determine whether students will be required to submit their formulas as images, in TeX format, or in some other manner. Another area where practice varies is in selecting image types. Price-Wilkin used JPEG images for the photographs and diagrams in her dissertation. A common version of Panorama (vers. 1.50), an SGML browser, is unable to display JPEG images without use of an external viewer. Consequently, the images were converted to GIFs for this project. Version 2.0 of Panorama is able to handle JPEG and many other image formats internally, so this should not be a long-term problem.

Conclusion

A library of dissertations in SGML is a feasible endeavor. The difficulties I encountered in selecting and using a DTD may have been settled with the revision of the ETD DTD. It is structurally straightforward and both this structure and the element names fall into a pattern familiar to most graduate students. With the popularity of HTML, the task of teaching students how to use SGML is simplified. Conversion to SGML from the word-processed format would add a step to the dissertation submission process, an addition that may be resisted by some already burdened students. Display of non-ISO Latin-1 characters and mathematic formulas still pose significant problems for implementation of this system. These problems are outweighed by the benefits of having the University's dissertations available online. The SGML structure provides multiple access points and comprehensive searching, while archiving the documents in a format that is not platform-dependent.


1. As this is intended to be a collection of UM dissertations, this option may not be available initially. As more institutions create digital libraries of their dissertations, this option would be added to allow the user to broaden their search.