Some Problems of TEI Markup and Early Printed Books (by Carole E. Mah and Julia H. Flanders)

Mirrored from: http://dynaweb.stg.brown.edu/wwp_books/DL/57

Some Problems of TEI Markup and Early Printed Books

Transcription challenges inherent in the document: The problem of dual emendation and correction

Advantages of an SGML solution

In constructing our textbase, the WWP envisions that it will be used in the following ways: to get a diplomatic transcription of the source for scholarly work, to get a clear transcription for reading, to verify the accuracy of the transcription, and to do useful searching. These goals seem to fulfill many of the most common features people expect of an electronic textbase. In addition, they fulfill many of the features one expects from various kinds of printed scholarly editions. SGML is instrumental in making it possible to achieve all these goals, but also brings to light some challenges which deserve careful thought. These challenges are not introduced by SGML but are inherent to doing transcription of early modern printed books; using SGML merely foregrounds the issues.

In constructing our textbase, the WWP's policy is one of diplomatic transcription. Diplomatic transcription involves transcribing the text of the document without making any emendations or corrections of apparent errors in the source text based either on one's own judgment or on other versions of the text. By contrast, it is standard practice when producing reprints or critical editions to correct apparent errors; reprints typically do not even consider it necessary to inform the reader which words were corrected. When producing a critical edition, a scholar produces corrections to apparent errors only after comparing the readings of a given word in a wide range of relevant documents in order to determine, based on this body of evidence, whether the apparent error is in fact an error (such as a printer's error or a singular misspelling on the part of the author) or whether it is a period spelling or an idiosyncratic spelling (but consistent across editions or works) specific to that author. One of the reason a diplomatic transcription does not presume to correct apparent errors is that it is by definition a transcription of only a single version of a text. Another reason, when working with pre-Victorian literature by women, is that spelling was not standardized before the 19th century, and for women even less so given their lesser access to formal education.

A clear transcription for reading means a transcription that does not require the reader to wade through the difficulties of apparent errors and old printing conventions. This is an ideal transcription for the casual reader and for many teaching situations. In such a transcription, not correcting apparent errors could lead to great confusion in understanding meaning in context in some cases. In addition, the audience for such a transcription might not be interested in the uncorrected version. The reader of such a transcription would also not be interested in or necessarily be able to understand conventions such as the printing (prior to about 1650) of certain letter combinations in an abbreviated fashion. For example, a character with a short horizontal bar, small letter, or other mark directly over it (not above and to the side as in a superscript) is called a brevigraph. A "y" with a mark of some sort over it can stand for "the", "thou", or "that." An "e" with a short bar over it can stand for the letter combination "en" or "em". Reproducing these brevigraphs as closely as possible for such an audience would be counter-productive.

SGML gives one the ability to simultaneously record both the apparent error and one's correction, without compromising the policy of diplomatic transcription, since (using the TEI <sic> element) the document's content remains unchanged, with one's correction recorded on the attribute value:

<sic corr="whether">mhether</sic>

. In this way, both a reading version and a diplomatic version can be produced from the single source transcription. The programmer can produce the diplomatic version of a text by specifying that the content of all <sic> elements should be honored (producing in this instance "mhether") and can produce the reading version by specifying that the content of the <sic> elements should be ignored in favor of the value of the corr attribute (producing in this instance "whether"). The same argument goes for the transcription of abbreviations such as brevigraphs; both forms can be encoded simultaneously using the TEI <abbr> element with the expansion recorded on the attribute value:

<abbr expan="condition">c[oacute]diti[oacute]</abbr>

. The diplomatic version of the text would then honor the content of the <abbr> element, giving the abbreviated form with the brevigraphs. The reading version would ignore the content of the <abbr> element in favor of the value of the expan attribute, giving the expanded form (i.e. "condition").

Having used SGML for transcription, one can simultaneously or separately produce two versions--a diplomatic version and a reading version--from the single source document and can thus cater to a variety of audiences without duplicating one's labor. Traditional scholarly editing projects usually provide no such access, choosing one approach or the other, but making no attempt to provide both. For instance, Malone Society type facsimile reprints provide the exact facsimile of the original characters and do not provide expansions. In contrast many critical editions expand and modernize everything. Using SGML to create an electronic edition gives one the option to enlarge the number of audiences for one's texts. An electronic edition has at least the potential to serve a wide variety of audiences from the specialist in early modern printed books, to the linguist specializing in the Renaissance, to the well-educated generalist, to the undergraduate student, to the general public. (In fact, the attempt to cater to as many audiences as possible is another way, in addition to broad range in time and chronology, that the WWP distinguishes itself from other humanities text encoding projects. And again, this is both an asset and a liability.)

Using SGML also gives the reader a way to verify the accuracy of the transcription. Having both the apparent error and the correction at hand (for instance in an online version that incorporates both the reading and diplomatic versions) lets the reader knows that the error is inherent and was not introduced by the transcriber, especially since, in a textbase of this size, there will inevitably be some errors introduced by the transcriber and the reader would want to distinguish between the two. This is of course a problem for traditional scholarly work as well. For instance, a Malone Society reprint presents the doubtful reading as is in the body of the text, and in the front matter provides a table of the doubtful readings and the editor's corrections together with a reference to the page on which the doubtful reading occurs. This is so obvious a problem that it seems unnecessary to mention it except that people often assume or expect otherwise in an electronic environment.

Apart from the original unexpanded forms' intrinsic interest for many scholars, having both the abbreviated form as well as the expanded form of a word allows the reader, as with apparent errors, the ability to verify the transcription. One could perhaps dispense with transcribing the original abbreviated form and simply encoded only the expansion; e.g. simply type "condition" rather than:

<abbr expan="condition">c[oacute]diti[oacute]</abbr>

Then one could provide (instead of textual markup) a scanned image of the original page for comparison. However, should people doubt the accuracy of our expansion of several "e"'s with brevigraphs as "em" rather than "en"or vice versa, they might want to do a search for all abbreviated forms (&emacr;) in context, and make their own judgment. This can easily be done if both versions are encoded in tandem; comparing each of our expansions to a scanned image would be quite a laborious back-and-forth task.

Verification is not the only reason to provide both the abbreviated and expanded forms of a given word and both the corrected and uncorrected forms of a given word. Powerful searching capabilities are the most oft-heralded feature of electronic documents and especially SGML-encoded ones (which provide for sophisticated context-sensitive searching based on the structural hierarchy of the document elements). Expansion provides the full form of a word, which is what most people expect to encounter when writing or using programs for useful searching of a text or an entire textbase. For example the Oxford English Dictionary is using the WWP textbase as a new corpus upon which they can do searches for occurrences of words that may supplant the current earliest attested usage of that word. This is a significant and powerful resource for them, since prior to the existence of such electronic textbases, they had to do their research by hand. Similarly, correction of apparent errors provides meaningful data for searching programs (e.g. a search across one or several textbases for the word "pickle" would not turn up the occurrence misspelled "qickle" unless that word were tagged appropriately).

Simple Cases

For both corrections and expansions, then, the SGML approach is clearly a more robust one than the traditional non-electronic (and many non-SGML electronic approaches), allowing one to provide a variety of audiences with, in each case, two different readings of the same text. In addition, however, in many texts there are large sets of non-overlapping examples of both corrections and expansions; in these cases, providing all four possible permutations of the two is a simple matter in some. For example, consider:

exqectati[oacute]

where the "q" is a typesetter's mistake. There are four possible readings of this word:

exqectati[oacute]  (uncorrected, abbreviated)
exqectation  (uncorrected, expanded)
expectati[oacute]  (corrected, abbreviated)
expectation  (corrected, expanded)

To encode this, one could do the following:

ex<sic corr="p">q</sic>ectati<abbr expan="on">[oacute]</abbr>

-- a very straightforward application of the TEI tagset. From this single source document one could then derive all four possible readings of every such instance in the document using simple, unambiguous processing programs to produce four full different text versions.

One could in fact chose any number of such approaches depending on one's analysis of the likely audiences, the sophistication of available software, and the relative amount of additional labor involved in each approach. In contrast, in most traditional publishing situations, no one would even attempt to solve this problem in a way that would facilitate providing four separate full texts; rather, the inevitable result would involve being forced to chose which bits of information to lose--which would be least useful to a given (usually single) target audience. Therefore, in the traditional type facsimile situation the doubtful reading would be printed (perhaps with a table of doubtful readings in the appendix), and the abbreviations would remain unexpanded. this provides a basic diplomatic version. In fact this same choice might be made by many a TEI-conformant project. The point is that SGML gives one the option not to have this be the only choice, so that if chosen, it is not by default but by an analysis of labor, audience, and software.

Additional layers of complexity: an in-depth example

What does one do when the set of doubtful readings and the set of abbreviations overlap? That is (for example) what if the character or characters involved in a doubtful reading are also characters that have brevigraphs? Given the complexity of the textual issues, in both the traditional and the SGML publishing situation, some information will be lost no matter what one does. The traditional solution would not differ in the case of such overlap as compared to cases in which there is no overlap. However, the variety of ways of nesting structural hierarchies using SGML means that there are many ways of solving such problems. Many of these solutions are excellent and the choice will depend on the intended audience(s). A few of the potential solutions should be avoided because the way in which they nest elements introduces unnecessary difficulties and because they leave too much up to the processor rather than making as much as possible clear in the encoding. These are issues a traditional publisher has neither the privilege nor the burden of facing.

A typical Women Writers Project example of such a complex situation is shown below, taken from

Foxe's Actes and Monuments (which contain a version of

Anne Askew's Examinations ):

t[eacute]p[eacute]ted

Fully expanded and corrected, this would be:

tempted

The four major readings of this word are:

t[eacute]p[eacute]ted  (uncorrected, abbreviated)
tempemted  (uncorrected, expanded)
t[eacute]pted  (corrected, abbreviated)
tempted  (corrected, expanded)

If the intended audience made it necessary to provide only one or two readings, it would of course be a trivial matter to encode this word with a simple encoding such as either

t<abbr expan="em">[eacute]</abbr>p<abbr expan="em">[eacute]</abbr>ted

<abbr expan="tempemted">t[eacute]p[eacute]ted</abbr>

in order to get either of the following two readings

t[eacute]p[eacute]ted  (uncorrected, abbreviated)
tempemted  (uncorrected, expanded)

To get the first one (no expansion) the processor would simply ignore the attribute values on <abbr> whereas to get the second one the processor would heed it. Similarly, the simple encodings

t[eacute]p<sic corr="">[eacute]</sic>ted

<sic corr="t[eacute]pted">t[eacute]p[eacute]ted</sic>

would yield either of the following two readings

t[eacute]pted  (corrected, abbreviated)
tempted  (corrected, expanded)

by taking or not taking the value of the corr attribute on the <sic> element.

A more complex possible encoding is (let us call it Example 5):

t<abbr expan="em">[eacute]</abbr>p<sic corr=""><abbr expan="em">[eacute]</abbr></sic>ted

One could produce the following readings from this encoding:

t[eacute]p[eacute]ted (uncorrected, abbreviated) [using CONTENT of <sic>; CONTENT of <abbr>]
tempemted (uncorrected, expanded)  [using CONTENT of <sic>; ATTRIBUTE of <abbr>]
t[eacute]pted (corrected, abbreviated)  [using ATTRIBUTE of <sic>; CONTENT of <abbr>]
tempted (corrected, expanded)  [using ATTRIBUTE of <sic>; ATTRIBUTE of <abbr>]

How do these readings derive from the encoding? Two assumptions are at work here; the first is one made by the programmer; the second is an unavoidable result of the nature of an SGML-encoded document's binary tree structure. First, assume that each reading is produced by treating all instances of a giving element in the same way. That is, if the decision is made to take the attribute value on an element and therefore ignore its content, then all instances of the element are so treated by the programmer (this is a natural and easy assumption to implement since it requires no special actions on the part of the programmer). Second, assume that when tags nest, the outer tag (or parent element) takes precedence over the inner (or child element). Thus, in the second nesting, no matter what we had as the content of the second <abbr> would be ignored in favor of the value of the expan attribute on it:

t<abbr expan="em">[eacute]</abbr>p<abbr expan="em">NONSENSE</abbr>ted

This would also produce:

tempemted (corrected, expanded) [using ATTRIBUTE of <sic>; ATTRIBUTE of <abbr>]

Following this same logic,

t<abbr expan="em">[eacute]</abbr>p<abbr expan="NONSENSE"><sic corr="">[eacute]</sic></abbr>ted

would produce:

tempNONSENSEted (corrected, expanded) [using ATTRIBUTE of <sic>; ATTRIBUTE of            <abbr>]

Now study another, deceptively similar-looking encoding (let us call it Example 6):

t<abbr expan="em">[eacute]</abbr>p<abbr expan="em"><sic corr="">[eacute]</sic></abbr>ted

The only difference between this encoding and the previous one (i.e. Example 5) is that the nesting of the second <abbr> and the <sic> is reversed. one might think this would make no difference. However, observe that this encoding produces the following readings:

t[eacute]p[eacute]ted (uncorrected, abbreviated) [using CONTENT of <sic>; CONTENT of <abbr>]
tempemted (uncorrected, expanded)  [using CONTENT of <sic>; ATTRIBUTE of <abbr>]
t[eacute]pted (corrected, abbreviated)  [using ATTRIBUTE of <sic>; CONTENT of <abbr>]
tempemted (corrected, expanded)  [using ATTRIBUTE of <sic>; ATTRIBUTE of <abbr>]

Closely examining the last of these reveals that:

tempted  (corrected, expanded)

cannot be produced from this encoding, whereas it can be from the opposite nesting. As explained above, this is because the parent element takes precedence over the child. Now it should be clear that the only way to get "tempted" from this encoding is by not making the first assumption described above that all instances of a given element should be treated identically. Rather, the programmer would have to specify that the second <abbr> should be treated differently than the first:

tempted (corrected, expanded) [using ATTRIBUTE of <sic>; ATTRIBUTE of first         <abbr>, CONTENT of second <abbr>]

Here, in the first instance of <abbr>, the attribute value is taken, whereas in the second instance the content of the <abbr > element is taken.

The TEI-conformant textbase project might very well turn to the use of feature structures at this point (discussing this option is outside the scope of this article--for WWP purposes the added functionality feature structures would provide do not seem to outweigh the complexities they would introduce); however, that choice aside, these examples exhibit several important points about encoding early printed books. One is the utter necessity, when choosing which encoding scheme to implement, that the choice be well-documented and consistently implemented. Implementing several of the possible choices in a single document would be unwise in the extreme. It is also clear that it is important to chose a scheme which depends as little as possible on how the programmer treats the encoded document in using it to produce various versions of the text and depends as much as possible on the clarity and simplicity of the information provided by the markup. That is, markup situations which might force one to treat different instances of an element in different ways should be avoided at all costs. Finally, in coming to a decision about which encoding scheme best achieves the goals of multiple versions of a text from a single source document, the facilitation of searching, and the ease of verification of transcription, it is important to consider which produceable versions are the most desirable to likely audiences.

The current preferred solution at the WWP takes all of these points into consideration. In particular, we have concluded that the "temempted" reading is less desirable among our potential audiences than the other three readings, since it is likely that any reader who would want abbreviations expanded would also like the errors to be corrected. Further, we wanted to avoid the potential confusion of the sort of nesting explored in the above examples. Therefore, we settled on the following:

<abbr expan="tempted">t[eacute]p<sic corr="">[eacute]</sic>ted</abbr>

This would make the following three readings possible:

t[eacute]p[eacute]ted  (uncorrected, abbreviated)
t[eacute]pted  (corrected, abbreviated)
tempted  (corrected, expanded)

With this encoding, no special treatment of different instances of an element is necessary to produce the desired results. Finally and perhaps most significantly (both from the point of view of readers who might want to look at the raw SGML and of student transcribers learning how to encode), this solution is much less long-winded and much easier to understand than the ones above.

It is important to re-emphasize that the problem of concurrent provision of correction and expansion is an intellectual and scholarly challenge that exists independently of SGML's existence. Using SGML merely makes it possible (in the simple, non-overlapping case) to address the question rather than to avoid it, duck it, or be forced to ignore it for obvious want of a solution. In the more complex case SGML can introduce some difficulties but only because it also introduces the opportunity to provide multiple versions of a text from a single transcription--something that would not otherwise be possible. In the section that follows, we will examine a case where it seems to be the nature of SGML itself that creates the encoding challenge.