Managing Medieval Gigabytes: DRAFT

by Murray McGillivray, Carl Gutwin, and Todd Reed Note: this draft version is missing references and links.
1.0 Introduction

1.1 E-texts for humanists

In the last few years there has been a tremendous increase in the amount of electronic text ("e-text" in this paper) available to humanists, and this is certainly only the beginning of an increasing digitization of the humanities that will remain a trend well into next century. Most professional humanists now sit at a desk on which the traditional tools of our trade--the books, pens, little scraps of notes, and so on that have characterized the lair of the humanist since before the advent of printing--have been joined by a little hulking electronic monster of one brand or another. Some few of us still simply curse at the absurdly complex apparatus that has replaced their typewriters; most of us, however, are at least beginning to make use of the Internet and other tools of electronic scholarship. The new tools, ranging from electronic library catalogues and bibliographies, through new vehicles of scholarly communication, to digitized versions of primary texts, make some traditional tasks easier, and they also enable new approaches to many fields. Their increased popularity among scholars means that the amount of data that must be stored in electronic form for scholarly use is increasing alarmingly quickly.

Among the developments of the last few years that have contributed to this increase in the use and preparation of e-texts, the most important one is probably the evolution of the Internet itself and the advantages it offers for easy electronic mail and increasingly easy search and retrieval software--beginning with ftp and uucp and escalating first into gopher and then to the World Wide Web.

For humanists in particular, the work of the Text Encoding Initiative has also been a key element helping us to enter the electronic world. We might also cite the popular summer seminars offered by the Princeton-Rutgers Center for Electronic Texts in the Humanities in conjunction with the University of Toronto's Centre for Computing in the Humanities, and such efforts to take advantage of the special characteristics of electronic communication as the burgeoning world of humanities listservs, electronic fora such as InterScripta, and electronic journals (including, now, the pre-publication release of papers by Exemplaria). It is clear that in some of these areas, humanists have been following the lead of colleagues in the sciences, where electronic publication of research results has become one of the most common methods of scholarly communication, a trend that seems certain to overtake the humanities.

1.2 Special status of medieval and Renaissance studies

Medievalists and Renaissance specialists, however, have been leaders rather than followers of this trend in the humanities. There is not space to list here all of the contributions we have made to the field of electronic text, but it is useful to remember that even in one little corner of our world, the area of Old English studies, we can count such early developments as the ANSAXNET listserver, Pat Conner's Beowulf Workstation, the electronic database of the Dictionary of Old English, and Peter Baker's OE Edit program (just to give examples).

The contributions to this conference show that this early prominence of medieval and Renaissance studies in the world of electronic text is not something that is destined to go away. So many ambitious projects on the scale of the Canterbury Tales Project, Kevin Kiernan's Beowulf project, Ian Lancashire's Renaissance Electronic Text project, the Plea Rolls project, Buzzetti and Tabaroni's work, and the Speculum Naturale Project, act rather as a confirmation of that prominence.

Like scholars in other areas, and like librarians and government holders of databases, medievalists, in the forefront of humanists, are starting to have to be concerned about storage and dissemination of what is becoming a large amount of electronic material. Because of their particular needs, including the need to store and retrieve images of manuscript pages, medievalists are also peculiarly hungry for storage megabytes and particularly in need of advanced indexing and retrieval mechanisms. 1.3 Digital libraries The concept of the "digital library" is one has become a focus for discussion of storage and retrieval questions regarding both public-domain and copyright material--in the early years of the next century, we are liable to see the unveiling of large repositories of digital information, to which we will be able to gain access through publicly available networks, paying a small charge for the use of copyright materials. Some existing collections of electronic texts, such as the materials at the University of Virginia and at the Oxford Text Archive and the catalogue at Princeton-Rutgers, give us part of an idea of the kind of repositories the humanities will have in the future. These, however, are very small projects compared to such things as the multi-agency Digital Libraries Initiative, a collaborative project of the National Science Federation, the U.S. Defense Department's Advanced Research Projects Agency (ARPA), NASA, and six American universities. This American initiative focusses (as a whole) on earth and space sciences, aiming to make very large collections of documents, maps, photographs, motion video, and so on, available to researchers and where appropriate to the general public. The Library of Congress has its "National Digital Library Program," which is also very ambitious, and we are probably all aware of similar, though more modest, efforts such as the British Library's electronic explorations and the Curia project in Ireland.

Looking towards the future, the outcome of these trends will be almost certainly be the availablity of massive electronic text collections, on par with today's conventional librarys. Just to give an idea of what that might mean, we can estimate that the contents of the University of Calgary Libraries, including books, microforms, audio-visual materials, maps and air photos, represent roughly 7,000 gigabytes (7,000,000 megabytes) of data, including roughly 4,000 gigabytes of textual data.

1.4 Our project

We have begun a project to explore ways of dealing with such future enormous collections of e-texts, making use wherever possible of existing tools for document identification and structuring, for compression and indexing, and for delivery. In this forward-looking project, we are concerned with a number of questions, the answers to which are not intuitively obvious, at least to us. This conference itself has highlighted for us some aspects of these questions that we had been ignoring, such as the degree to which we might expect future archives of humanities materials to be characterized by cross-document hypertext or hypermedia linking. We have, on reflection, no real idea of how central to make support for such links in our general conception. It is our confident expectation that we will continue to be flummoxed in this way by emergent problems. Nevertheless, we have made some heuristic assumptions that are guiding our design choices for the pilot project stage of this work.

2.0 Assumptions and design criteria

Assumption 1: We will need to deal with very large collections of humanities e-texts.

Our response to this assumption is to adopt as a base for the pilot project a public-domain compression and indexing utility called "mg" ("managing gigabytes") developed by Ian Witten, Alistair Moffat, and Timothy Bell. This software deals especially well with text, achieving high compression rates on large text corpora. It is particularly well adapted to compressing humanities materials, since it supports both text and image compression, and has a separate and very efficient set of routines for compression of text-images, that is, of pictures of printed book or journal pages. It doesn't deal very well with long documents, nor does it handle linkages of any kind between one compressed chunk and another very well internally, and we will describe the problems these inadequacies create for us later in this paper.

2.2 Assumption 2: A structured document language like SGML will continue to form the basis for text encoding in the humanities.

We have heard SGML criticized very cogently as a vehicle for encoding of medieval and Renaissance primary materials at this conference. Nevertheless, the multi-year international effort of the Text Encoding Initiative has produced a standard that, while it may not adequately serve the needs of all communities of humanists by itself, will undoubtedly serve as a basis for further evolution in text encoding. SGML as adopted by TEI has important advantages for humanists, including cross-platform portability, defined structures for dealing with many encoding situations we face in making electronic versions of primary texts, and reliance on international standards. There are already quite a number of texts available in TEI P2 or P3, and the advent of TEI Lite is likely to make the standard even more popular.

Two of the disadvantages of TEI for our project are that every text has to be accompanied by a (sometimes very complex) Document Type Declaration (DTD), and that internal sections of texts, such as chapters or paragraphs, are not easily separable from the whole text because of the hierarchical structuring that the DTD enforces. Again, these disadvantages will be discussed below.

2.3 Assumption 3: The searches that humanists are likely to undertake should be well supported.

What kinds of search humanists might want to undertake if it were possible to access a full-text index of the equivalent in data of a large research library is in some ways difficult to predict, since the type of search desired will be conditioned to some extent by the size of the archive. Imagine, for example, that it were possible to search your whole university library for uses of the word "revolution" between 1730 and 1776 in either British or American texts, and to have the relevant documents delivered to your desktop--with the word highlighted--within seconds.

We have to keep in mind that an indexed large collection may itself change the kinds of research that people do fairly dramatically. Thomas Corns suggests that when databases are available that encompass whole periods of literature, computer-based analysis will be able to see individual works in their proper context:

"The technical transformations in computer technology of the last half decade, some still in the process of percolating through to humanities computing, offer opportunities for radical transformation in literary computing, in its scale, in the ends it may address, and in its status and role in the larger academic community. Simply, enhanced text storage and retrieval can alter not only what we may do but can relate in myriad ways to the concerns and approaches which currently animate English studies and the study of other literatures." [p. 129]

Because this is a pilot project, we have decided to concentrate on supporting types of search that we know that humanists already carry out--searches for words and collocations, for uses of particular words and collocations within defined distances from one another, and searches for words and collocations within particular elements of a document (for instance, in a title, or in a poem). We have borrowed these goals fairly shamelessly from existing tools for textual analysis, such as TACT, WordCruncher, and the PAT search engine used at Virginia (among other places).

Researchers in the sciences are typically looking for the whole of a short document--a relevant scientific or technical report. Researchers in the humanities tend to use whole books more, and to be interested, therefore, in sections of a longer text. We want, for this reason, to also support some kinds of "browsing" before the user takes the step of ordering a book or article from the archive, and this is one of the more difficult technical challenges we face.

2.4 Assumption 4:

Future systems by which digital libraries may be served to scholars will resemble the World Wide Web.

Initially, for this reason, we are developing a utility that will compress and index large collections of SGML-encoded humanities materials and serve them to the Web. We are pleased to be able to rely on the availability of freely available Web tools, particularly Panorama Free (c), the availability of which means that we do not have to expend our energy designing an SGML browser.

2.5 Project goal

The goal of this project is to construct a framework and pilot system for creating and making available large collections of humanities texts and images. The pilot system will be a suite of tools, based on mg and SGML, that will compress and index large collections of SGML-encoded e-texts and associated image files and that will allow users to search or browse the compressed archive using Web tools and retrieve either significant portions of e-texts in response to word and collocation searches, or whole e-texts.

3. An overview of creating and delivery of humanities e-text collections

An overview of the structural framework that also shows the various pieces in this project is given in Figure 1. The framework shows five stages in the process of creating collections from e-texts and delivering them to users and researchers across the World-Wide Web: document preparation, archive creation, search and retrieval, wide-area access, and final display. Our approaches to each of these areas is described below.

Document preparation

In the first stage, e-texts and images are prepared for collection. As mentioned above, we assume that collections of humanities e-texts will be comprised of text encoded using a scheme such as the TEI's version of SGML. Preprocessing will resolve a number of issues for each source text. First, large works must be divided into smaller pieces (hereafter called "documents") in order to provide finer-grained access. Second, we record the structure of each work, in order to support navigation and queries based on structure. Third, we find and mark links in the text that point to other documents, pictures, or manuscript images. Finally, we annotate images with the text that they are associated with so that they can be retrieved from the archive through textual queries. Section 4.2 below discusses the issues raised by these tasks in more detail.

Creation of the collection

Next, the e-texts are put together into a collection, and a full-text index is created that will be the basis for later queries. An engine like mg that creates a full-text index when it compresses the collection is necessary because of the potential sizes of future collections: mg can produce a document archive and full-text index that together use less than fifty percent of the texts' original space.

mg creates archives based on the unit of a document, a section of text that is of an "appropriate" size for later retrieval. For example, it may be appropriate to divide a novel into chapters, pages, or paragraphs; a poem could be divided into pages, stanzas, or even lines.

Search and retrieval

Researchers and users of the text collection will retrieve documents by submitting queries to a search engine, which will then find matching documents through the collection's index. In our framework, queries will be composed in a public interface such as a World-Wide Web form and then passed to mg's search engine.

mg provides built-in support for ranked queries and boolean queries. Additional types of search, such as phrase search, proximity-based search, or investigation of colocation, must be supported with a post-processor.

Wide area distribution

Access to the collection will be provided to a large community through the now-common World-Wide Web. A Web browser such as Netscape or Mosaic will be used to select collections for consideration, to compose queries, and to review search results.

A Web browser serves two purposes. First, it provides a standard, multi-platform, widely available tool for delivering the collection to any place in the world that has an internet link. Second, it provides a simple mechanism for first-stage review of search results, reducing the number of documents that must be decoded in full. To show these first stage results, we can display structural information that indicates what original text the document comes from, and its location within that text. The initial results of searches, the paragraphs, poems, etc. that contain the words or phrases being searched for, may initially be served to the user translated into HTML, to be followed with the full e-text in SGML if the user requests it.

Final display of texts

When the user has selected documents to view, we can display them as they were meant to be seen using a tool that converts SGML and TEI codes into screen representations. This tool allows the format, fonts, and special characters that exist in the original text to be properly displayed. In this project, as mentioned above, we use an application called Panorama, from SoftQuad (although we use the commercial version, Panorama Pro, for development, there is a publicly-available version, Panorama Free).

The final display interface must support the types of actions that users and researchers wish to perform. These include:
- the ability to see the overall structure of the original text, to show the context of retrieved documents, and to provide a mechanism for navigating through the text;
- highlighting search terms in the document, which may require insertion of additional tags into the retrieved text for display purposes;
- the ability to move forward and backwards to other documents in the work, which will require fetching appropriate documents from the collection;
- the ability to follow links to pictures, manuscript images, or other works.

The Panorama viewer acts as one of the "helper" applications for the Mosaic WWW browser. When Mosaic encounters a file that is encoded with SGML, it sends the text to Panorama for display. The Panorama interface can also send requests back to Mosaic, such as a request for a different part of the document or for an external link.

4. Tools & techniques:

This section describes in greater detail some of the tools and techniques that were outlined above. We first describe the mg text-compression and indexing system, and then discuss some of the strategies that we use to deal with TEI encoded documents. Although the following paragraphs provide some technical details, they also introduce issues that must be considered when building a literary text collection.

4.1 The mg system

mg is a software engine for compressing, indexing, and retrieving documents and images. It consists of applications that support the creation of large collections, and tools for searching those collections. mg works under the Unix operating system, and has been used for production of many kinds of collections including a large digital library of computer science documents in New Zealand, which is mounted on the WWW. The main strength of mg is its ability to handle large collections (the name mg comes from "managing gigabytes"). Below, we introduce the four main functions of mg, and then summarize mg's strengths and weaknesses for the purposes of this project and for humanities e-text collections more generally.

4.1.1 Text compression

Standard electronic representations of text (such as ASCII) do not store text very efficiently. By using knowledge about regularities in the text, compression schemes can significantly reduce the amount of storage required and the amount of time needed to send a document across a communication link. A variety of schemes are used: one, for example, replaces sequences of text with pointers to previous occurrences of those sequences. This technique saves space because a pointer usually takes less space than the text itself.

The price of compression is the time spent in compressing and decompressing the text, and in the added complications of allowing random access to the compressed text. There are trade-offs between these costs and the amount of compression desired.

Archives that mg constructs typically use only 30% of the space used by the original files. The index itself takes up about 7 - 10% of the size of the original text file, and auxiliary files add another 5%. The total size of the index and the compressed archive is from 42% to 45% of the size of the original collection of e-texts.

4.1.2 Images in collections

Images often comprise a large part of the space used in text collections, and images can be similarly compressed to reduce space and transmission time. While text must be compressed so that the original can be reconstructed exactly (called "lossless" compression), images can be further squeezed by permitting a certain amount of information to be lost in the process (called "lossy" compression). These two techniques may be used for different kinds of images in literary collections; certain images, such as manuscript images, may or may not require exact reconstruction.

mg does not include software to compress most image types, so they must be compressed before inclusion in an mg archive. Common compressed image formats such as JPEG, GIF, or TIFF are possible choices.

4.1.3 Index construction

mg creates a full-text index to the works in a collection, meaning that the index lists all the documents in which a particular word appears. mg does not include information about where the words appear within a document, but these locations can be found by secondary searching of the decoded documents.

Unlike many indexing applications, mg includes "stop words" like "the," "is," and "and" in its index. These words are often ignored by indexing utilities because they can account for up to 30% of the total words in the collection and can greatly increase the size of the index. In collections of humanities e-texts, however, these words could play important roles many types of analysis. mg uses innovative data structures and compresses the index to save space, and can index a collection's entire vocabulary for only a small increase in index size.

mg also uses stemming and case-folding in its indexes. Stemming involves conflating index entries for words that share morphological roots (for example, poets, poet, and poetical would all be stemmed to poet). Stemming can provide for searches based on these roots, but also decreases the accuracy of what is retrieved. Case-folding treats all characters as lower case, reducing the number of words that must appear in the index, but preventing any searches that use specific cases. Both of these drawbacks can be overcome by carrying out a secondary search of the retrieved documents (as long as their number is small).

4.1.4 Search and retrieval

mg provides access to collections through a separate query engine (called mgquery). The query engine supports two types of search, ranked search and boolean search.

Ranked search takes a list of terms as input and attempts to determine which documents in the collection are most similar to the search terms. Documents where the search terms occur frequently are considered to be more similar, and the set of retrieved documents can be ranked using this metric. Ranked searches are appropriate in situations where the user does not wish to search for specific terms, but knows a set of words that pertain to a subject or topic.

Boolean search involves the construction of a boolean expression, combining search terms with the boolean operators and, or, and not. Boolean searches are more precise than ranked searches, but they often require more care in construction in order to keep the number of retrieved documents small.

There are several other types of search that may be necessary for analysis or reference use of a literary collection. Although not directly supported by mg, many of these can be supported through postprocessing. This usually entails that the system construct a boolean query to obtain a small set of candidate documents, and then perform slower text searches and comparisons on those documents using regular expressions.

The types of search that we feel should be supported initially include indications of a term's global frequency, phrase search, proximity-based search, and search for the collocates of a particular word.

4.1.5 Summary

mg provides many useful capabilities to a project such as this one, but also presents certain constraints and limitations. mg's strengths are its excellent compression, its full-text index that includes stop words, its abilities to include images in collections, the possibilities for producing multiple indexes for a single collection, and the fact that it is a freely-available system.

mg's main weaknesses revolve around its notion of a document, and around support for different kinds of search. First, mg has a somewhat restrictive notion of a document that does not take the structure of a text into consideration. In a literary collection, different texts will require different ideas of what a document is, and this forces additional work in recording structure so that users can navigate the collection in natural ways.

Second, mg does not support a wide range of searches. Although many of these can be accomplished through postprocessing, this solution increases the [overall] system's complexity and requires additional processing time. In addition, since mg's index locates words only to the document level, additional work is necessary to find and display search terms.

4.2 TEI Encoding:

4.2.1 Structural information in SGML

Applying mg to SGML encoded text should help to solve some of the problems inherent in mg. Specifically, TEI or other SGML encoding contains the very information that is missing in mg, that is, information about the structure of an e-text. mg itself does not really know anything about the documents into which it has divided an e-text except that they are text. SGML tagging, on the other hand, and particularly structure-rich SGML, like TEI encoding, means that the e-text itself contains information about its parts and their relation to one another. In a sense, however, this information about any particular piece of an e-text is only contained in the e-text as a whole (including the SGML Document Type Declaration and the Document Instance). And because mg must divide the e-text into smaller portions to make search and retrieval efficient, we need to find a way to indicate in each small portion what its relation to the whole structure of the e-text it is taken from is.

4.2.2 Partitioning SGML texts into mg documents

Prior to indexing and compressing TEI encoded text with MG, the texts must be preprocessed to partition each text into documents. In most cases, partitioning a work into paragraphs is appropriate. In other cases, such a poetry, more judicious decisions are required about document granularity. A short poem would likely be contained in its own document, while a lengthy poem should be reduced to smaller units. Ideally, an automatic technique, applying certain predetermined heuristics, can be used to generate logical units of text appropriate for the type of work.

4.2.3 Document granularity

Fine document granularity significantly reduces the amount of useless information returned by a query and improves the efficiency with which useable results are returned to the user. Unfortunately, choosing a fine granularity has the adverse effect of isolating text from surrounding material that often provides important contextual information. For example, a search of the works of Chaucer for the word "noble" might return the line "That she had, this noble wif" if each line was made an mg document, but without knowledge of its immediate context (the surrounding text) and the work from which it comes (the Book of the Duchess), this line would be of little use to most researchers.

4.2.4 Including structural and contextual information in mg documents

In SGML-encoded texts the list of tags that are open at the point that a particular mg document begins will also be of particular interest. Such information can be used to support advanced queries and the ability to browse a work (such as jumping to bibliographic information, or to the start of the enclosing chapter). Support of these features in MG requires that all contextual information be prepended to each document. We are devising an engine to read the SGML tags from the e-text and use them together with document-number information obtained by a first compression run to construct a header for each mg document of something like the following form:

<MGHDR>
<TITLE>Troilus and Criseyde</TITLE>
<AUTHOR>Geoffrey Chaucer</AUTHOR>
<EDITOR>W.W. Skeat</EDITOR>
<PLACE>Oxford</PLACE>
<PUBLISHER>Clarendon</PUBLISHER>
<YEAR>1896</YEAR>
<DTDDOC>23575</DTDDOC>
<TEIHDRDOC>23576</TEIHDRDOC>
<FIRSTDOC>23577</FIRSTDOC>
<LASTDOC>23995</LASTDOC>
<OPENTAGS>Div1 type=book n=1 {23577}\stanza n=26 {23603}\</OPENTAGS>
</MGHDR>

Such a document header can be used to provide the user with enough information as a first query response--for example, title and author of the work plus location information within the work--that he or she will often be able to know whether looking at the document itself would be useful or not. It can also be used to make mg support such browsing strategies as (for the header above) looking at the whole of Book 1 of Troilus and Criseyde to make sure that it is what is wanted, and it can be used to locate bibliographic information (in the TEI Header) or to identify for the delivery system the mg documents that contain the whole work and its DTD, should a user wish to order up the complete e-text. Enclosing the entire header between the tags and avoids the possibility that SGML tags internal to the document will be confused with the mg header tags.

Adding such a header to each document does not significantly increase the size of the compressed archive, since much of the text of the header will be repeated from document to document and will therefore be replaced with pointers. In a test of this strategy on Project Gutenberg (ASCII) texts, the original collection without headers before compression was 80MB, resulting in a 40MB archive after compression; with headers added the </body></html>collection was 90MB before compression and the compressed archive was 41MB.