[This local archive copy is from the official and canonical URL, http://www.wwp.brown.edu/training/intro/; please refer to the canonical source document if possible.]
November 19, 1998
Version 2.0
The initial encoding of a text at the WWP involves four main steps. The first step is to read and analyze the text for basic structural and textual features. This first step is known as "Document Analysis". The second step is to encode the text. Next comes proofreading the text. The final step is to input the corrections from the proofreading. There may be subsequent rounds of proofreading and corrections input after these initial four steps.
The actual encoding uses the principles and syntax given in the TEI Guidelines, as modified and amended by the WWP. These WWP modifications and specifications are recorded in the Encoding Resolutions FileMaker Pro Database. Hereafter, I refer to this via the shorthand "Encoding DB". One can find the answer to most specific encoding questions by referring to this database and to the TEI Guidelines ("P3") themselves. (For example, "what specific tags to I need to use to encode a table of contents?")
However, these specifics are much easier to implement if one first has a general understanding of the way documents are structured. It is important that we all encode in a consistent manner and that we think through the complexities of the texts carefully before attempting to encode them.
This document will give you an idea of the major textual and structural issues you will need to keep in mind. The intent here is to introduce you to the general concepts of encoding at the WWP. You should not expect this document to introduce you to the specific tags used to implement these concepts. Once you understand the basic concepts and terminology, you can use these ideas and terms to find the "how-to" answers to specific tagging questions by using the FileMaker DB and the TEI Guidelines.
The TEI defines all texts as consisting of front matter, a body, and back matter. Front matter and back matter do not always occur, but there is always a body. Thus, a document might only have a title page, a dedication, a preface, and three chapters. The title page, the preface, and the dedication are all front matter. The three chapters comprise the main body of the text; there is no back matter in this case. Another text might additionally contain (for example) an index, which would be considered back matter.
Thus the general structure of any document is that of front matter,
a body, and back matter; each of these three can contain one or more
major textual divisions. For example, front matter may consist of a
title page, a table of contents, two dedicatory epistles, and a
preface. Definitionally, every single thing in a document nests inside
a major division. A major division must always nests inside either the
front, body, or back. That is, nothing can occur in between major
divisions. With a few exceptions (notably the title page), every major
structural division of a text is called a "division" (the
specific tag is <div>
). To distinguish one kind of
division from another, each is given an attribute coinciding with its
function. For example, a chapter is a <div type="chapter">
.
When analyzing any WWP text, the first structural decisions should always be to figure out the major divisions. Very generally, a typical WWP text might structurally decompose as:
<text>
<front>
<titlePage>
All the stuff on the title Page</titlePage>
<div>
Some front matter, perhaps a dedicatory letter</div>
<div>
Some more front matter, perhaps a preface</div>
<div>
Yet more front matter, etc.</div>
</front>
<body>
<div>
A major division of the body, say, a chapter</div>
<div>
Another one</div>
<div>
Another one</div>
<div>
etc.</div>
</body>
<back>
<div>
Some piece of back matter, an appendix, index, etc</div>
<div>
Another one</div>
</back>
</text>
This example does not use WWP-specific tagging in full, so as not to get bogged down in specific tagging issues.
Conceptually, all texts will break down structurally in a similar manner. The specific structure and therefore the specific tagging will of course differ depending on the particular document.
The major structural differences between documents are usually due to differences in genre. The major divisions in the body of a prose work could be (for example) chapters, sections, essays, revelations, prayers, etc. Which of these applies to a document will depend on whether the work in question is a novel, a political tract, a religious tract, etc. The major divisions in the body of a verse work are usually poems. The major divisions in the body of a dramatic work are usually acts (and, within those, scenes).
When analyzing a document, one should think along these general lines. However, do not be irrevocably wed to the specific names of divisions which we have used so far at the WWP. New documents will sometimes reveal a new major type of division which so far has not been encountered.
In addition, a document can be quite complex and involve a good bit of nesting of different types of major divisions. For example, a dramatic work may be written in verse form. A dramatic work might contain within it another play (the actors in the play acting out a play of their own). Or, for instance, a fictional novel may contain several large poems. Finally, a text might contain both political musings and poems, as well as a short play thrown in for good measure. Thus it is often helpful to draw a tree diagram or other helpful visual aid of the structure of a document in order to gain a full understanding of its structure.
Thus it is quite likely that your document (inside the front, body, back or all three) will contain several levels of nesting divisions, e.g.,
<body>
<div>
<div>
[stuff]</div>
<div>
[stuff]</div>
<div>
<div>
[stuff]</div>
<div>
[stuff]</div>
</div>
</div>
<div>
[stuff]</div>
<div>
[stuff]</div>
<div>
<div>
[stuff]</div>
</div>
</body>
The TEI refers to major structural elements below the level of the
<div>
as "chunks". Each division of any
text is usually subdivided into major chunks. These chunks differ
according to genre. In prose, the chunk-level element is the paragraph
(<p>
). In verse, the chunk-level element is the
line group (<lg>
). Line groups consist further of
metrical lines (<l>
). In dramatic works, the
chunk-level element is the speech (<sp>
). Speeches
will usually consist further of either paragraphs or line groups.
The TEI has some other major groups of tags to accommodate structures which do not fall into these major groupings. Most of these follow some sort of common sense; for example, many major structural divisions begin with a heading, e.g.:
<div>
<head>
Chapter 1: Chickens</head>
<p rend="reg">
stuff</p>
<p rend="reg">
stuff</p>
</div>
Below or inside the "chunk" level is the
"phrase" level. Thus, any element which you might want to
tag within a paragraph, heading, line group, speech, etc., is a
"phrase-level" element, e.g. (the "<name>
"
is the phrase-level element in this example):
<div>
<head>
Chapter 1: Swords</head>
<p rend="reg">
blah blah blah it was called<name>
Excalibur</name>
blah blah</p>
<p rend="reg">
more blah blah blah</p>
</div>
In general, the WWP encodes phrase-level elements only if they are "renditionally distinct" from the surrounding text. For instance, a phrase-level word or phrase that is italicized within a paragraph which is otherwise regular roman (upright) text should be tagged using a phrase-level element. The main exception to this is proper nouns, which should be tagged whether or not they are renditionally distinct. There is more about renditional distinction, a little farther on in this document.
The TEI defines another group of elements that it calls "inter-level" elements, which occur both within and between paragraphs. For example, notes, lists, and figures (e.g. drawings) fall into this category. There are other major groups of elements which you can find out about by browsing through the Guidelines.
There are some "chunk" and "phrase"-level elements that belong in any document. However, many other phrase and chunk-level elements belong only in verse; others belong only in dramatic works. (Chapters 6-10 of the TEI Guidelines are organized with this in mind.)
The concepts just described here will help you to understand the general order of the chapters in the TEI Guidelines. The Guidelines also contain a table of contents, and a useful index. However, the most useful feature of P3 is the back half of the second volume. This is an alphabetical listing of all TEI elements. For each element there is a short description of the element and a list of its attributes. There is also a reminder about what sort of element it is (chunk, phrase, inter, etc.), and perhaps an example of its use. Most usefully, there is also a list of all the elements in which that particular element may occur (i.e. all the elements inside which it may nest), and all the elements which may nest within it. Finally, there is a reference to the chapters in which the element is fully discussed and explained.
Th Markup Documentation DB contains WWP-specific information about the topics described in P3. It also contains information about general encoding topics outside the scope of P3. That is, the database is considerably more specific than the TEI Guidelines, and contains examples and specific rules about how the WWP actually implements the rules, suggestions, and ideas in P3. We try to comply with P3 whenever possible, but, given its scope (the whole international text encoding community), it does not always provide a way, or the best way, for us to do things.
To assist encoders in using the database, we have broken down subjects into small conceptual pieces so that it is easy to find an example and description of exactly what you are trying to do. Hopefully, this level of granularity will make it possible for you to find information quickly and accurately. When you open the database, you will see a search screen, which gives you instructions on using the database. Please read these instructions carefully.
A term used throughout both this and other WWP documents is Office Text (OT for short). This refers to the texts from which we transcribe. We use this term rather than source text or original text since our Office Texts are usually photocopies of source texts, or photo-reproductions of microfilm. The distinction is important because transcription based on such secondary material can be significantly different from transcription based on an actual source text. This is because both photcopying and the reproduction of microfilm may introduce many uncertainties into the text: blotches, illegible gutters, blurriness, etc. You must read the document A Guide to the Women Writers Project Office Text and Transcription Databases before beginning to encode. (As of November 1998 that document is a little bit out out date, so check with Julia to see if you really need to read it).
When analyzing your document and when actually encoding it, always think in terms of function and structure, rather than in terms of appearance. Pieces of text may be italicized, aligned to the right, at the foot of the page, bigger than other text, indented, etc. However, such variations in appearance and layout are not by themselves significant features. Rather, they help you think about the text in terms of its structure. Their primary function is in giving the reader clues as to the function of the piece of text in question and to major structural shifts and divisions in the text. Thus, if a single word in a paragraph is in all capital letters or italics, do not first ask yourself "how do I make this thing all capitals"; instead, ask, "why is this capitalized?" You will also record the fact of its capitalization, but you first need to decide for yourself why it is capitalized (is it for rhetorical emphasis, or because it is a name? etc.). Similarly, seeing some small text at the bottom of the page should not make you think "how do I record the fact that this text is at the bottom of the page?" Rather, it should make you think "what is this thing at the bottom of the page?" You will figure out the answer based on what you already know about printing and publishing conventions, and on what you read here and in other documents about the printing and publishing conventions of the past several centuries. Again, you will encode the fact that the text appears at the bottom of the page, but this information is secondary to the main piece of information, which is that the piece of text is (for example, most probably) a note.
Read the first chapter in Philip Gaskell's A New Introduction to Bibliography entitled "The Hand-printed Book" for an understanding of the basic concepts and terminology necessary to encode information about the physical page.
You can learn specific information about these topics (including
notes on tag usage and examples) by referring to the tutorial on forme
works accessible by clicking here or from the main Training web tutorial page. The
TEI provides the <fw>
element (which stands for "forme
work") to encode "any of the unchanging portions of a page
forme, such as: running heads, running footers, page
numbers,catch-words, other material repeated from page to page, which
falls outside the stream of the text" (P3, 18.3, 556).
The WWP uses at least one and often both of the reference systems provided within the text as an explicit page reference system for the encoded file. By "explicit page reference system" we mean a method for uniquely identifying, sequencing, and locating individual pages in the text. The two systems provided within the text are page numbers and signatures. Both of these are always encoded literally as they appear on the page. In addition, however, if they are being used as the reference system for the encoded text they are encoded in a regularized form as an attribute value on the appropriate milestone element.
WWP texts span a period of considerable change in printing practices, and as a result certain basic structures vary a great deal from text to text. Page numbering in the early period was often done unsystematically or inaccurately, while signatures (on which printers depend for their construction of the text) tend to be more accurate and reliable. Some of our texts have no page numbers at all, or have page numbers which are so disordered that they cannot provide a useful way of navigating within the text, or a common system of reference for scholarly work. Scholars working in the periods covered by our textbase have different habits and expectations about textual reference systems, depending on the kinds of texts they are accustomed to working with. The WWP's approach to reference systems is based on the need to take into account these scholarly expectations, and to take into account the practices which govern the creation of standard printed editions and existing scholarly references to our texts and to others in the period.
There are numerous rules about the correct encoding of reference systems. See the entry on reference systems (and also the entries on page numbers and on signatures) in the Markup Documentation DB for a comprehensive guide to encoding reference systems.
The central, enabling assumption underlying the design of SGML as a system for text encoding is that all texts consist of ordered hierarchies, wherein all elements are fully nested within other elements, with no overlap. For example,
<element1>
blah blah<element2>
stuff stuff</element2>
</element1>
<element1>
blah blah<element2>
stuff stuff</element1>
</element2>
This assumption is built into SGML at the most basic level, and it makes possible the unique functionality of SGML: the ability to specify and regulate a document's structure, and then use that structure for complex activities such as searching, processing, document comparison, and the like. At the same time, it has been apparent at least since SGML began to be used to transcribe existing documents that a given text can--and frequently does--contain multiple hierarchies whose elements overlap one another. An obvious case is that of physical structures like pages and textual structures like paragraphs, which are structured independently of one another and frequently overlap. Such situations may derive from the very nature of the traditional printed book, where a physical object with a certain architecture contains an abstract structure of an entirely different sort: a sequence of words which make up a text. In addition, though, such overlap may occur within the text itself, between such different structures as the grammatical syntax of the text and its poetic form. Although the textual structure of a document has an obvious usefulness for research and navigation, the physical structure is also of immense importance, particularly for researchers who study the relationship between the text and its physical embodiment.
In a traditional printed book, these different structures are unobtrusively accommodated, in a manner which has arisen over the long history of book production. However, the activity of text encoding, by making explicit the relationship which governs a set of individual details, creates linkages which--within the SGML paradigm of nested hierarchy--get in each other's way. These different structures derive from methodological perspective rather than from genre, a point which has several consequences. First, it guarantees that overlapping structures will be evident in any document which is encoded to accommodate more than one methodology. And second, in order to assign precedence among the structures--to decide which ones to encode straightforwardly and which to encode using an alternate system--the textbase designer must decide which methodologies best represent the bintended function of the textbase. An electronic text project which was content to limit itself to encoding a single hierarchy would be able to provide only the most restricted version of the text. For a project like the WWP, which is engaged in transcribing rare books for a scholarly as well as a more general audience, multiple hierarchies result unavoidably from the textual and physical features of the document required by this audience. Moreover, some methodological perspectives themselves require attention to multiple hierarchies, so that even limiting one's encoding to a single approach cannot always avoid engaging with problems of overlap.
The difficulties of accommodating multiple hierarchies is a problem for which a number of solutions have been proposed. The TEI Guidelines summarize the basic options:
Of these, concurrent markup best preserves the actual structuring of the source, but unfortunately it is not accommodated by most existing SGML parsers. Each of the other two has its limitations, but lends itself to a particular subset of the encoding problems which the WWP most frequently encounters.
<pb>
, <lb>
, etc.Using empty elements--elements which do not enclose any content, but simply mark a point in the text--avoids overlapping elements by not marking the element itself. There are two ways of using empty elements for this purpose. One is to use empty elements as substitutes for being-tags and end-tags. The WWP does not use this method; for a discussion, see the second half of "Some Problems of TEI Markup and Early Printed Books" by Julia Flanders.
The other method of using empty elements is to use them as milestones, which mark the boundary between two regions of text. The use of such a boundary marker relies on the assumption that the elements it defines abut directly, without any interstitial material; where one element ends, the next one begins. This makes milestone elements ideal for encoding structures like pages or typographical lines. Since each set of milestones defines a system of objects which follow one another without interruption, which never nest inside one another, and which together encompass the entire text (so that there is, for instance, nothing which is not on a page), the lack of explicitly enclosed elements does not create significant processing problems.
The WWP uses milestone elements in the way just described to encode features specifically pertaining to the physical structure of the source document. These include primarily pages and typographical lines, about both of which it can safely be assumed that when one line or page has ended, another one will begin immediately. Thus using a simple boundary marker for these features does not create any active difficulties.
Using a boundary marker as a universal solution to the problem of overlapping elements, however, is undesirable. In some cases, both overlapping elements may be of a type which is inconvenient to encode using an empty boundary element. For instance, both quotations and paragraphs occur frequently enough and are of enough structural importance that they require tagging as normal elements; to tag them with empty elements throughout a text would generate unnecessary and awkward complexity. For such a case, the second remaining approach suggested by TEI--fragmentation and joining--is more useful. This approach does not use empty elements, but instead divides one of the textual features into smaller segments which nest within the elements of the other textual feature, thus avoiding the overlap. For instance,
<p rend="reg">
<q ID=Q1 next=Q2>
This quotation,</q>
she said,<q ID=Q2 prev=Q1>
is interrupted; it has been divided into two sections and linked using attributes.</q>
</p>
The WWP uses this fragmentation approach for a variety of purposes where empty boundary elements are inadequate: for instance, to encode interrupted quotations, or quotations which span several verse lines.
For more detailed, WWP-specific tagging examples and discussion
pertaining to quotations, please read the latest entries in the
Markup Documentation DB; we now solve many <q>
and
<quote>
problems of this sort using the part=
attribute with possible values "I"
, "M"
, or
"F"
(initial, medial, or
final). Sometimes one has to use a combination of
approaches to solve the problem.
"Rendition" is a term used to describe many things about the appearance, position, etc. of an element. As previously noted, renditional information can act as a visual clue to the reader about the function of an element and the general structure of the text. Renditional information is not in and of itself the primary piece of information about an element. Renditional information includes:
The TEI does not provide an adequate means of recording all this
information. The WWP has devised it's own scheme for doing so, called
"Renditional Ladders". In the TEI scheme the
rend=
attribute only takes one value. In contrast, the
WWP's renditional ladders system allow one to specify many things
about the rendition of an element. Instead of just an attribute-value
pair (<name rend="italic">
) renditional ladders allows an
attribute to have many keyword-value pairs (<name
rend="slant(italic)weight(bold)">
). "Slant" and
"weight" are both examples of renditional ladder
keywords. A hotsheet containing all the keywords exists in the Red
Binder (the centralized printed resource for encoders). Explanations
and examples of the use of all the keywords are in the Encoding
DB.
As explained below in the section on the TEIHeader, consistent and frequently occurring renditional information can be stored as default information so that the encoder does not have to specify the whole renditional ladder every single time a rendition occurs.
Every TEI-conformant document must have a TEIHeader at the top. This is the section of the document that identifies the document, its source (the OT), and other pertinent bibliographic information. It also contains information about how the document was encoded, the way tags are used, who did the encoding and what changes they made along the way. The requirements for the contents of the TEIHeader take up an entire chapter in P3 (Chapter 5) and are very confusing and voluminous. Therefore, we have sorted through it and have created a WWP-specific TEIHeader template, which will already be at the top of any document you begin to encode. Following the TEIHeader examples given on the training web page and in the Encoding DB, you need only fill in a few fields in this template, change a few others, and delete a few lines as necessary.
The information you will add to the TEIHeader falls into three
categories: bibliographic information, tag usage information, and
information about yourself and the changes you make to your document.
Most of the information you will add is bibliographic in nature. You
can obtain this and most other bibliographic information in the OT
Database, as described in the document "A Guide to the Women
Writers Project Office Text and Transcription Databases". Another
addition you will make is a list of definitions of your usage of tags
(in the <tagsDecl>
section). Filling in this
section according to the specifics of your document will greatly
reduce the amount of work you have to do once you begin encoding your
document. For instance, if all stage directions in a play you are
encoding are aligned to the right, italicized, and preceded by a
square bracket, you can record this information once in the
<tagsDecl>
, rather than having to record it each
time a stage direction occurs. Thus the <tagsDecl>
functions (among other things) as a defaulting mechanism. There is
more about this in the Encoding DB. Finally, you must record every
major stage of your operation on the document in the change log of the
TEIHeader (in the <revisionDesc>
). Do not record
every single act of opening and closing the document and making
changes! Do record every major step: initial transcription,
proofreading, corrections entry, etc. This is very important -- if you
leave or otherwise are forced to stop midway, or if we need to find
out what progress we as a project are making, this is the major way
for you and others to keep track of the state of your document.
Title pages are perhaps the most difficult part of the encoding process because they contain many elments and are quite complex and often idiosyncratic, depending on (among other things) the historical conditions in which the document was printed. Thus, here is a general description which will help untangle the mess.
Title pages generally consist of:
The title can often be broken up into smaller pieces, such as the main title, a sub title, and a further descriptive title, or an alternate title. There will often also be figures, drawings, ornaments, and handwriting on a title page.
The WWP's texts include a wide variety of notes. The most basic distinction is between notes which appear in the original text and notes which have been added by the WWP for one reason or another. Of the first group, there is also a distinction between notes written by the author, and notes added by a contemporary editor. Of the second group, there is a distinction between notes which are for internal use, and notes which record information for the use of the public. By far the most common kind of note is the note written by the author.
Notes are often "anchored" to the body of the text. For example, there might be as asterisk, numeral, or other symbol superscripted before or after a word in a paragraph; at the bottom of the page, that same symbol appears next to the actual content of the footnote. There are a huge variety of symbols early modern printed works use to anchor notes (in addition to those one is used to seeing in 20th-century documents such as successive numerals and asterisks). Sometimes even lower-case alphabetic characters (usually surrounded by parentheses, and not superscripted) are used as anchors, often for marginal notes. Marginal notes are quite often unanchored. The reader is expected to be able to figure out to what the marginal note refers by noting its position in the margin relative to the text. Usually it refers to some word, phrase, or idea mentioned on the same line as the top line of the note. Sometimes the thing being referenced in the body of the text will be renditionally distinct from the surrounding text. The anchoring system for endnotes varies, often involving a page and line number reference in the endnotes, and perhaps a repetition of the word or phrase in question (often paraphrased). There is often no anchoring symbol in the body of the text itself.
Notes can be quite complex; a text might contain footnotes to
footnotes, marginal notes, endnotes, or footnotes which span the
bottoms of several pages. There are a number of ways of encoding
notes, and one of the more apparently compelling issues is where to
place the <note>
element itself: in the flow of the
text where the note is anchored; at the position on the page where the
note text appears; or in a separate division of the document
altogether. Encoding footnotes which span several pages are
especially vexing. The difficulties of recording the page breaks
within the note and the page breaks in the main flow of text are quite
non-trivial; other problems arise as well. Because of this, putting
the note on the page where it actually occurs is quite
difficult. Therefore the WWP puts all notes (except endnotes, which
get encoded right where they are) in a separate division of the
document called the <hyperdiv>
. This is an element
we have invented (it is not in the TEI). There are other things which
may go in the <hyperdiv>
such as cast lists for
plays which contain no actual printed cast lists.
As explained in Chapter 2 of P3, entity references should be used to capture any special character that is not directly and universally accessible as a regular keyboard character (i.e. that is not part of the basic set of ASCII characters).
For the WWP, this means that inverted letters, foreign characters and diacritics, long s, symbols, and other such characters should be transcribed using entities. Here are pictures of some of these things:
The top picture shows a typical foreign character, in this case an "o" with a circumflex over it (in "boccores"). In the bottom picture, there is first an example of the long s (looking like an "f" to the untrained eye, it is the first letter in "sub-sophists"). This picture also has an example of some typical symbols (here, their function is as footnote anchors). These symbols are the double dagger (after "jest") and the section symbol (after "position").
Ligatures are the exception. We do not treat these as special characters but transcribe them as separate letters, even though printers made ligatures by taking two or more letters, joining them together, and casting them as a single piece of type (a single piece of metal). Here is a picture of some ligatures:
There are three ligatures in this example, in the words "reflects" (an fl ligature and a ct ligature) and "lavish'd" (an sh ligature). These would simply be transcribed as if they were not ligatures but just single characters. The long s in the sh ligature would still be transcribed as a long s, using an entity reference.
Diphthongs (ae, oe) do not count as ligatures; we do treat these as special characters (and therefore use entity references to transcribe them):
The word "primaeval" here would be transcribed using an entity reference for the "ae" diphthong.
Read the appropriate sections of the books available in the office on this subject by McKerrow, Bowers, and Gaskell (ask Julia which) for more information on the subjects of type, characters, ligatures, etc. (the general topic of typesetting and printing).
These pictures and explanations should give you an idea of the kind of special characters you will find. The purpose here is not to make you worry about how to transcribe the characters in the pictures but to get a feel for the issues. When you begin transcribing, you should ask questions and also refer to the reference materials in the FileMaker DB, in the Red Encoding Binder (the Entities section near the back) and in the Entities section of the Training web page.
As just discussed, many characters which appear in WWP texts (e.g. dashes, digraphs, and accented characters) cannot be typed on a standard keyboard and hence must be treated specially, using an entity reference. However, in addition to these there are characters which can be typed using a standard keyboard but which still need special treatment because of their special use in markup. These fall into three categories:
Thorough discussion of this issue can be found in the FileMaker DB, including listings of all characters in each category and the entities to be used. If you have any doubt about whether an entity should be used for a specific character, check the FileMaker DB and ask questions.
Please suggest improvements on any confusing, misleading, or incomplete aspects of this document by sending me email or talking to me.
Carole_Mah@brown.edu