SGML: Canterbury Tales

**The Canterbury Tales Project**

Authors: Peter Robinson and Elizabeth Solopova

The Canterbury Tales Project, Faculty of English, University of Oxford

In this talk, we will outline the aims of the Canterbury Tales Project, say something of the methods we are using to achieve these aims, and reflect on the particular difficulties we have found in our way. Our first major publication, the CD-ROM of the fifty-eight pre-1500 manuscripts and early printed editions of the Wife of Bath's Prologue, is to be published within the next three months. We will illustrate what we have to say by demonstration from the prototype CD-ROM.

The Canterbury Tales Project has two major aims. Our first aim is as old as textual scholarship: we want to find out, as nearly as we can, what Chaucer actually wrote. To do this, we have to begin as all textual critics must when faced with a large and complex textual tradition: we have to establish, using all the means at our disposal, a narrative history of the textual tradition. Once we have a clear sense of the sequence of copying of the manuscripts, we can begin to discriminate which manuscripts appear closest to the head of the tradition, and in turn use that information to filter Chaucer's own text from the mass of scribal variation. Though this aim is old, some of the methods we use -- computer collation, techniques of computerised stemmatics borrowed from evolutionary biology, database analysis -- are very much of the late twentieth century.

Our second aim grows from the first. In order to arrive at a history of the whole textual tradition, we have to gather and analyse every piece of relevant information in every one of the eighty eight pre-1500 witnesses to the Canterbury Tales. That is: we have to acquire copies of each of the 25,000 pages of text; we have to transcribe every word in every page in every witness of the text; we have to collate all these transcriptions word-by-word against one another to create the record of agreement and variation which will be the foundation of our narrative history of the text. We have become aware, as we accumulate this body of information, that we are creating an extraordinary research resource: exact original-spelling transcriptions of some six million words of fifteenth-century manuscript and early printed edition material. Not only this, but we are assigning every one of these six million words to a lemma and a grammatical category, so that it will be possible (for example) to locate every occurrence of the verb 1st person present singular and the second person present singular subjunctive, etc., of the verb to be in all this material. All this, of course, will be useful to us in our search for Chaucer's own text. But clearly it will be even more useful for scholars who may have no interest in Chaucer (unthinkable as it seems, to have no interest in Chaucer): for researchers into the history of the language, into dialect, into orthographic and morphological change.

This, then, leads to the second major aim of the Canterbury Tales Project: to make available to scholars all the material we gather, in as useful a form as we can manage. Although our first aim, to recover Chaucer's text, remains our own primary ambition, we have come to see that this second aim may well be the more important. It is certainly the most achievable: we have so far seen only a fraction of the material, and it will be years before we are able to make any firm statements about the whole textual tradition. Even when we do, scholars are likely to quarrel with us, and perhaps reject our whole approach. In the meantime, scholars can start using the resources we are making; and they can continue to use these resources for generations, long after our arguments for the priority of this manuscript or that are forgotten.

Clearly, we are not yet ready to say very much about the history of the tradition. Our success, or otherwise, in this first aim can not be judged from this first CD-ROM. However, you will be able to judge from this first CD-ROM how far we might achieve our second aim: the provision of information useful to other scholars.

Here is the opening screen of the latest prototype of the Wife of Bath's Prologue on CD-ROM. We are using the program DynaText, from Electronic Book Technology, to present our work. You will see, in this opening screen, the DynaText metaphor of table of contents to the left, and text to the right.

Figure 1: DynaText opening screen

The table of contents gives an immediate impression of the scope of the CD-ROM: sections containing the base text for collation, the witnesses, the collation; articles; an all-text spelling database and bibliography. On the right hand side, beneath the title and an invitation to see a full electronic title page and introduction, you see the beginning of the base text for collation. This base text is, in essence, the text of the Hengwrt manuscript, very lightly emended. At the head of the base text, a rubric invites the reader to click on any word to see just what readings the witnesses have, or do not have; or, the reader can click on the number beside the line to see what witnesses have or do not have that line. Notice that the numbers beside the line are in red: throughout the CD-ROM, red is used to indicate the beginning of a hypertext link. We follow the invitation in the rubric, and click on the first word of the text proper, 'Experiment'. The textual variant screen appears, showing us all the variants at that word:

Figure 2: Textual variants screen

A screen like this will appear for each of the six thousand words in the base text, showing all the variants at that word. A rubric at the top of the screen gives information about available hypertext links. Below this, we are reminded of the line in the base text and of the word in the base text, 'Experience', where all the readings shown in this screen appear. We are then given the forty witnesses which have the same substantive reading 'Experience' as the base text; below this, are the ten witnesses which have the variant 'Experiment', and so on. The rubric indicates two hypertext possibilities. The first possibility is to see the text of all witnesses to a particular reading, by clicking on the icon to the left of that reading. Thus, clicking on the icon beside the reading 'Experiment' invokes this screen:

Figure 3: Witnesses with 'Experiment'

Here is the text of this line in all ten witnesses which read 'Experiment'. Observe that only one of these witnesses actually has the spelling 'experiment'. In our collation, we have regularised the spelling variation out to leave only substantive variation at this level. We have not discarded the information about spelling variation: as we show later, this information about spelling variation becomes the foundation of the spelling databases.

The first hypertext possibility from the textual variant screen was to see the text of all witnesses to a reading, by clicking the icon to the left, as above. The second hypertext possibility from the textual variant screen is to click on the sigil for any one witness. This will take the reader to that line in the transcription of that witness (you can also reach the same point by clicking on the line for that witness in the last screen). Thus, clicking on the sigil Fi against the reading 'Experiment' takes us to the first line of our transcription of the Wife of Bath's Prologue in the Cambridge Fitzwilliam McClean 181 manuscript:

Figure 4: The Fitzwilliam manuscript in transcription

Note here that we do not show 'E' as the first letter of 'Experiment' but a bracketed ellipsis, [..]. If we want to see what the manuscript actually has, above this first line in the transcription there is a camera icon. If we click on this, a digitised image of this page appears:

file=fitzpim.gif

Figure 5: The Fitzwilliam manuscript -- image and transcription

We can see now why in the transcription the first letter appears as [..]: the scribe left a space for the ornamental capital but this was never executed. Wherever there is a page break in the witness, our transcription shows a camera icon, and you can compare our transcription of any page with an image of the witness just by clicking on that icon. Thus, in this CD-ROM you will never be more than a few clicks away from a manuscript image; it will be very easy indeed for people to find the mistakes in our transcription.

You can burrow your way into the witnesses, as this account shows: by moving from the base text to the collation to a transcription to a page image, and so on. The CD-ROM will also permit you to go, very easily, direct to a transcription, or a page image, or many other places. Look again at the table of contents window:

Figure 6: The table of contents window

The + signs against some contents entries indicate that further headings lie beneath that entry. Here, clicking on the + beside 'The Witnesses' brings up headings for each of the fifty-eight witnesses. Then, opening up the entry for Ad1 (British Library Additional 5140) shows the separate items for each witness: notes on the transcription; the transcription itself; the description of the witness; the transcription of the glosses; a catalogue of all the images of this part of the witness; and finally the spelling database for that witness. You move to any one of these simply by selecting the item you want to see and clicking on it. Here is the introduction to the transcription, pointing out particular difficulties in our transcription.

Figure 7: An introduction to a transcription

Following this, we have the transcription itself, as we have already seen. Then we have a description of the witness, as follows:

Figure 8: A witness description

These descriptions have been prepared by Dan Mosser, of Virginia Polytechnic and State University, as part of his projected new description of all manuscripts of the Canterbury Tales. Over the last fifteen years, Professor Mosser has examined every manuscript of the Canterbury Tales and we are fortunate to be able to present something of the results of his research on this CD-ROM. After the witness description, we present a transcription of the glosses in every manuscript:

Figure 9: Transcription of glosses

The transcription of the glosses has been made by Stephen Partridge, of the University of British Columbia, and we are again fortunate, as we have been with Dan Mosser, to be able to publish this. The transcriptions of the glosses will be linked to the transcriptions in the text by hypertext links both in the text and in the glosses. Thus, in this example the arrow in the top window beside line 9 of the transcription of British Library Additional 5140 (Ad1) will take the reader to the gloss on that line, shown in the bottom window. In turn, the arrow against that gloss in that window will take the reader back to the transcription.

The last item for each witness is the spelling database. These spelling databases are perhaps the most remarkable and novel feature of the CD-ROM. Here is the beginning of the spelling database for the Ad1 manuscript from the prototype CD-ROM, covering fifty lines of the text of the Wife of Bath's Prologue:

Figure 10: The spelling database for a witness

Clicking on any lemma at the left (in red, on the CD-ROM) will bring up a summary of all the spellings of that word in that witness:

Figure 11: Summary of spellings of a word in a witness

Here, we are told that there are fourteen spellings of 'and' in Ad1. Further, we are told that four of these are in initial position, and in these cases the word is spelt 'And.' If the user clicks on the (10) beside the 'and' in the window above DynaText will bring up a window showing all occurrences of non-initial 'and' in this witness. thus:

Figure 12: All occurrences of non-initial 'and'

Clicking on the line number to the left will take the reader straight to that point in the transcription. Similarly, clicking on the '(4)' beside 'And' in the previous window will bring up all instances of line initial 'and' in Ad1:

Figure 13: All occurrences of initial 'and'

In this instance, we have assigned each of the spellings of 'and' to the lemma, 'and', and also further classified each spelling as 'initial' (that is, standing first in the line) or not. Similarly, we have assigned every instance of the verb 'to be' to the lemma 'be' and then defined the grammatical form for each occurrence of the verb 'to be': as first person present singular, second person present singular subjunctive, etc. We do this as part of the collation process. Working through the Collate interface, this lemmatisation and part of speech classification can be done extremely quickly. We estimate it will take us around six weeks to achieve a full lemmatisation and part of speech classification of every one of the three hundred thousand words in the fifty-eight witnesses of the Wife of Bath's Prologue. There will be a spelling database similar to the above for every one of the fifty-eight witnesses on the Wife of Bath's Prologue CD-ROM. Each of these spelling databases will contain a record of the 6000 spellings, approximately, of the words in that witness, lemmatised and classified as above.

In addition to these fifty-eight 'single witness' spelling databases, the CD-ROM will contain a single 'all witness' spelling database. This will draw together all the 300,000 spellings in all fifty-eight witnesses into a single database, as follows:

Figure 14: All-witness spelling database

As with the spelling database for an individual witness, clicking on the lemma to the left (in red, on the CD-ROM) will bring up the all-witness spelling database open at that point:

Figure 15: An entry in the all-witness spelling database

This screen shot shows the five different spellings of non-initial 'and' found in the fifty-line sample of the fifty-eight witnesses to the Wife of Bath's Prologue in the prototype CD-ROM. The spellings of initial 'and' are grouped separately. Hypertext links will take the reader to the single-witness spelling database for a given witness (as in Figure 11 above), or to the text of all occurrences of this spelling in a particular witness (as in Figures 12 and 13 above). Similarly, there are hypertext links from every word in the spelling database for each witness to the corresponding position in the all-witness spelling database.

In this account, we have concentrated on what is unusual in this CD-ROM: the presentation of textual variation, both of spellings and of substantive readings, in the context of full-text transcriptions, collations and digital images of each witness. Other parts of this CD-ROM are more conventional. In the 'Articles' section we present writings about our work. Some of the articles republish items published in our first Occasional Papers volume. Others are written specifically for this CD-ROM: thus articles by Dan Mosser on aspects of his witness descriptions. Finally, the 'General Bibliography' presents a bibliography of relevant works.

The fact that we are able to give this presentation of our prototype, and intend to publish the full CD-ROM within three months, is sufficient proof that we have found solutions to the major technical problems of making an electronic edition. We will say very little in this talk about how we did this work. In essence, we have transcribed all the witnesses into plain text files containing markup in Collate format. We then use Collate to carry out the collations, and to generate all the collations and spelling databases in Standard Generalised Markup Language (SGML). We also use Collate to convert the witness files into SGML. Accordingly, all our work -- all transcription, all collation -- is done with Collate encoding, which is a much simpler and easier form of markup to use than SGML. Indeed, we very rarely have to see SGML, or work with it directly, but instead rely on the tools in Collate to make the SGML for the CD-ROM. Most of the tools we use will be available in the forthcoming Project edition of Collate.

People expert in computer encoding are rare; perhaps as rare as people expert in Middle English. We could never do our work on the Canterbury Tales if we who work on this project all had to be expert in both computer encoding and Middle English. One of us, Peter Robinson, is proficient in computer encoding. The other, Elizabeth Solopova, feels about computer encoding as Professor Blorenge, professor of French Literature in Nabokov's Pnin, felt about his subject: Blorenge did not know French and disliked Literature. Already, the tools are sufficiently advanced to permit advanced work to be done without specialist computer knowledge. Moreover, the intellectual issues underlying the realisation of electronic editions will press with more and more urgency as the editions mature. We became aware in our work of several major areas of difficulty in what we are doing. We will dwell on three of them: our transcription, our spelling databases, and our understanding of what an electronic edition is.

The benefit of computer readable transcripts and spelling databases is that they enable large scale statistical research designed to produce 'objective results'. But this benefit conceals a trap: at their core inevitably lie interpretative decisions. This paradox is itself reason for caution in the use of such electronic tools, and it imposes a duty of consistency and transparency on those who make the interpretative decisions underlying these data collections. It became evident to us that in order to make our decisions reliable and predictable, our practice had to be very carefully weighed and well documented. All our decisions had to be a compromise between the requirements of consistency, utility and philological exactness, even though these requirements do not always well agree among themselves.

When working out our transcription and lemmatisation policies we tried to minimise the need for subjective decisions on a case to case basis, by simply eliminating some of the possible distinctions. Thus we decided to transcribe all first letters of line initial words in verse as emphatic, in manuscripts where the scribe's usual practice is to use emphatic letters at line beginnings. For some letters scribes do not have distinct emphatic and unemphatic forms, and had we decided to keep this distinction, the choice (emphatic letter or not) would have had to be made 'impressionistically' by every transcriber. This would have largely undermined the value of this information.

When we preferred to keep distinctions which need many interpretative decisions, we brought this to the attention of our readers in the transcription introductions and in the lemmatisation statement. Thus in the transcription introductions to most manuscripts, we say that word division is uncertain and that it was often difficult to decide whether the spelling is as one or as two words. Considerations of precision did not allow us to regularise word division. Nor was it possible to give a single rule for treatment of all such spellings: what looks as one word at first reading may look as two words at a second reading by the same person.

To carry out transcription we had to interpret every potentially significant mark on every manuscript page in accordance with our transcription policy. This was not always easy. In our article on the 'Principles of Transcription' (published in the first Canterbury Tales Project Occasional Papers volume) we explain our choice of a graphemic scheme: that is, a scheme which aims to preserve all graphemically distinct spellings. Thus, we neither level all spellings to a standard, as is common practice in printed editions, nor do we try to record all information about different letter forms, as in a graphetic analysis. However, given the uncertainties of Middle English scribal practice, it can be very difficult to determine whether particular marks on a page have graphemic meaning; and if they do have graphemic meaning, what meaning.

For example: we have chosen to transcribe as potentially significant marks, tails and flourishes occurring on final letters in many manuscripts. There are cases where flourishes appear to be undoubtedly meaningful: this is often the case with flourishes which stand for final -e in words ending in -re. In some such examples a flourish represents a stressed final -e, e.g. in the word 'tre' in Mm 733. Potentially meaningful use is sometimes revealed by comparison of manuscript spellings: 'sire' spellings in Hg often correspond to 'sir' with a flourish in El. At the same time tails often seem to have no graphemic meaning. We did not transcribe tails which occur on final vowels. In one manuscript -- La (British Library Lansdowne 851) -- we had to discard the attempt to record tails: they occur after virtually every final letter and are clearly ornamental and not graphemic.

As we carry out the collation, we assign every word in every witness to a lemma, or headword, and declare its grammatical form. This task presents many difficulties analogous to those inherent in our transcription system. It is clearly more useful to other scholars to sort all the spellings by lemma and grammatical category, than just heaping all the spellings into an undifferentiated mass. But it will only be useful if the sorting, the lemmatisation, is appropriate and transparent. It must not be too fine, and therefore risk engaging in precious distinctions which will annoy. Nor can lemmatisation be too coarse, and fail to make the divisions scholars may reasonably expect. Perhaps most important, scholars must easily grasp what distinctions have been made and not made, and why.

Just as it is often difficult to decide how we should represent particular marks on the page in our transcription, so it is often impossible to arrive at a firm decision as to just what grammatical category, or even to what lemma, a particular word should be assigned. Once more, the flux of Middle English over this period makes it difficult to decide just what grammatical categories we should determine.

One of the problems we encountered was how to treat such verbs as 'wol', 'shal', 'kan', 'may' and 'moot': as modal verbs, in anticipation of their modern state, or in the same way as all other notional verbs and so to determine in each case their mood and tense. 'sholde', for example, occurs both in preterite contexts where it can be perfectly well described as past indicative, and in present contexts, very much as modern English 'should'.

Another example of uncertainty is how to treat such prepositional phrases as 'to yeere', 'today', 'on live', 'a live', 'a caterwawed', 'on honde', 'a bedde' and so on. They can all be spelt both as one word or as two words. Do we treat them as nouns with prepositions, or as adverbs? We wished to reflect the transitional state of the language in our classification: some of these expressions were in the process of becoming 'full' adverbs. In the language of some scribes this process was more advanced than in the language of others, and we wanted this to be shown in our system. Because of this we did not want to adopt one rule for all such cases and to treat them all as, for example, nouns with prepositions. One way was to be guided by the highly irregular word division of the manuscripts, and to consider them adverbs when they are spelt as one word and nouns with prepositions when they are spelt as two words. Thus 'a liue' would be a preposition and a noun (lemmata 'on' and 'lyf'), whereas 'aliue' would be an adverb (lemma 'aliue'). We could also take into account the form of preposition: if it is in a reduced form ('a') it is an adverb, if not -- it is a noun. However, the reduced form is common for 'on' and 'of', but not for 'by' or 'to', for example. This rule would work only for some of the cases.

Our system had to anticipate various difficulties. Thus we have chosen to mark the oblique case of monosyllabic nouns often expressed though final -e. We decided to make this distinction for nouns ending in a consonant, but also for nouns with final -e. This is because some nouns have alternative forms with and without -e, and in such cases it is not clear whether the final -e is due to oblique case or always occurs in this word. An example is the word 'birthe'. It occurs twice in Hg, both times in the form 'birthe': once in a prepositional phrase 'in oure birthe' WBP 400, and once in objective case 'that he his birthe took' MLT 192. At the same time this word -- an early Middle English adoption of an Old Norse endingless nominative -- did occur without final -e in Middle English. We have chosen to mark the oblique case of such nouns to avoid prejudgement about the function of final -e.

Very often in our work on the transcription scheme and spelling databases we felt overwhelmed with the need for numerous choices. We aimed at a system that would do justice to the diversity of language of the manuscripts, that would be logical and consistent as far as this is reasonable and possible, and would provide scholars with possibly complete information for research. We do not know how well we met each of these requirements, and we look forward to our work being tested by scholars.

Our most general concern is: just what sort of edition are we creating? Or, to put the question another way: who is going to use this CD-ROM; how will they use it? We have been too absorbed in the last years with the struggle to do this work to consider these questions overmuch. Because there never has been an edition like this before -- indeed we do not know whether it should be called an edition -- we have no answers to these questions. But we do have some hopes. We hope that by bringing the manuscripts so close to the reader, in all their richness and all their confusion of readings and spellings, that we will bring the period, the language, and Chaucer himself far closer to the reader. The fixity of a printed edition distances the text from the reader. We hope that through our many texts the immediacy of each scribe's attempts to understand and to transmit might be borne upon the reader, as they have been borne upon us. It might be feared that, with so many texts, Chaucer's own text might disappear into a miasma of variation. We hope the reverse: that we can help the reader find his or her way though the variation to a clearer perception of what Chaucer did write, and did not write. We have some ideas as to how this might be achieved; but this is a subject for another time.