[Cache from http://www.tei-c.org/Sample_Manuals/Leiden.html; please use this canonical URL/source if possible.]
Converting Leiden-style editions to TEI Lite XML
by T. J. Finney, 2001
These recommendations concern the translation into TEIxLite documents of printed editions that employ the Leiden conventions defined in Chronique D'Egypte 13-14 (1932), pages 285-7. They may also be applied where a transcription is made directly from a manuscript. TEIxLite is an extensible markup language (XML) version of the TEI Lite document type definition. TEI Lite (TEI U5) represents a subset of the full Text Encoding Initiative guidelines (TEI P3).
The recommendations should be read in conjunction with the TEI Lite specification. Although TEI Lite is adequate for most features encountered in a printed edition, there are situations where the encoding methods of the full TEI guidelines are better. Following TEI Lite allows the present recommendations to use a widely adopted framework that is relatively well supported. This in turn should maximize the utility of Leiden-style editions that have been translated into TEIxLite documents according to these recommendations. However, the gain is achieved at a cost of bending less appropriate features of TEI Lite to purposes for which entirely appropriate features exist in the full TEI guidelines. This set of recommendations takes a minimalist approach to rendering features likely to be encountered in Leiden-style transcriptions. A more comprehensive approach that used an XML version of the full TEI guidelines would be less vulnerable to charges of 'tag abuse'.
At this point it may be appropriate to give a cursory introduction to XML so that what follows may be better understood. XML provides a way to describe the structure and content of a text. A document type definition (DTD) sets down definitions and structural rules to be followed by conforming documents. Not all XML documents require a DTD. However, those that have one must conform in order to be valid. TEIxLite documents conform to the TEIxLite DTD.
Features of a text are marked and described using markup elements which consist of start and end tags contained in angle brackets. Start tags may include attributes that take on particular values.
E.g. <anElement someAttribute='x' anotherAttribute='y'>marked text</anElement>
An element may be empty, in which case its start and end tags can be contracted.
E.g. <emptyElement someAttribute='x' anotherAttribute='y'/>
XML also uses entity and character references which offer a shorthand method of referring to predefined items by their names and Unicode numbers, respectively. An entity reference stands for a string of characters and may be used to represent a character that cannot be directly entered. Entity references must be defined and included before they can be used. The TEIxLite DTD currently includes four sets of standard entity references defined in the files 'iso-lat1.ent', 'iso-lat2.ent', 'iso-num.ent', and 'iso-pub.ent'. These cover a range of symbols, punctuation, and Latin characters with diacritical marks. An entity reference consists of the entity's name placed between '&' and ';'.
E.g. soft hyphen: ­
Character references refer to Unicode characters. Almost every conceivable character is included in Unicode. A particular character's code can be determined from the code charts found at the Unicode web site (http://www.unicode.org/). Decimal character references are placed between '&#' and ';', while hexadecimal (i.e. base sixteen) references are placed between '&#x' and ';'. Codes given in the Unicode charts are hexadecimal.
E.g. soft hyphen: ­ = ­
Returning now to the recommendations, an essential part of a valid TEI document is the headerthe electronic title page. Among other things, the header serves to distinguish between the editor of a printed transcription and the person responsible for its conversion to a TEIxLite document.
The first aim of the person converting a printed edition should be to faithfully reproduce its content. Responsibility for editorial decisions is assumed to belong to the editor of the printed transcription and not the one converting it to XML. Consequently, the 'resp' attribute of the elements used in these recommendations should have a default value of 'ed' or the editor's initials. If it is thought necessary to introduce changes, responsibility for each change needs to be noted using one of the methods provided in TEI Lite (see TEI U5 sections 9 and 20). The <sic> element is suitable for the purpose, with the value of its 'resp' attribute identifying the person responsible. In addition, a log of changes should be kept in the header's revision description.
The recommendations are presented below, either as Leiden-style features paired with methods of rendition, or as example renditions under general headings. Each recommendation may include a description of the Leiden-style feature, an authoritative basis [in square brackets], and commentary. References to TEI documents give the document number then section number. For example, TEI P3 18.2.3 means TEI P3, section 18.2.3, and TEI U5 10 means TEI U5, section 10. Recommendations for which no authoritative basis is shown should be treated with due caution.
('\.' represents a dot printed beneath the preceding letter.)
Letters that are really doubtful or so imperfect that, apart from the context, they could be read more than one way.
[TEI P3 18.2.3]
These recommendations assume that doubtfulness or loss of letters is due to manuscript damage. That is, the 'reason' attribute of the relevant elements should have a default value of 'damage'.
.... or +-4 <unclear><gap extent='4'/></unclear>
Illegible letters of which the approximate number is known.
[TEI U5 10]
[....] or [+-4] <gap extent='4'/>
Lost letters of which the approximate number is known.
[TEI P3 18.1.7]
A letter of which any trace remains belongs outside the brackets.
] or [ ] or [ <gap/>
Lost letters of which the approximate number is unknown.
[TEI P3 18.2.4]
The letters are lost, but restored from a parallel or by conjecture.
TEI Lite does not have the <supplied> element for restored text that is contained in the full TEI specification (TEI P3 18.1.7). It is therefore necessary to find an alternative among the available elements. The <gap> element is a logical choice in view of its use for the other categories of lost letters. However, it is an empty element and cannot contain the restored text. This leaves the <add> element, which is suitable for 'letters, words, or phrases inserted in the text by an author, scribe, annotator, or corrector' (TEI U5 10). In this case, the annotator is the editor, and the annotation is the supplement.
a(bgd) a<abbr expan='bgd'/>
Braces indicate resolution of an abbreviation or symbol.
[TEI P3 6.4.5]
The value of the 'expan' attribute is intended to replace whatever is contained within the <abbr> tags. In the case of an empty element, the expansion replaces a blank.
A printed symbol may be represented by the corresponding Unicode character reference, if it exists.
If there is no appropriate character reference, the symbol should be represented by an empty element with the 'type' attribute set to 'symbol'.
<abgd> <sic corr='abgd'/>
Letters the editor regards as mistakenly omitted by the scribe.
[TEI P3 6.5.1]
Letters the editor regards as mistakenly included by the scribe.
[TEI P3 6.5.1]
[[abgd]] <del hand='hx'>abgd</del>
Letters deleted in the manuscript.
[TEI P3 18.1.4]
The scribe responsible for an alteration is specified using the 'hand' attribute, with recommended values of 'h1' for the first hand (the scribe), 'h2' for the second, and so on. A value of 'hx' may be used if the hand cannot be identified.
E.g. a deletion by the second hand: <del hand='h2'>abgd</del>.
The mode of deletion may be specified using the 'type' attribute. Refer to TEI P3 18.1.4 for suggested values of this attribute.
Any values of the hand attribute that are used must be declared in the header's profile description. A separate <ident> element is required for each hand, and the entire set is enclosed in a <creation> element.
<ident id='h1'>First hand.</ident>
<ident id='h2'>Second hand.</ident>
<ident id='hx'>Unidentified hand.</ident>
This approach is necessary due to the exclusion from TEI Lite of the <hand> and <handList> elements that are featured in the full TEI guidelines.
'abgd' <add hand='hx'>abgd</add>
Interlinear additions which are difficult to print above the lines of the transcription.
[TEI P3 18.1.4]
This method of rendition can also be used for scribal additions that are not interlinear. It is important to always include the 'hand' attribute to distinguish a scribal addition from an editorial supplement. The location of an addition may be specified using the 'place' attribute, with TEI P3 18.1.4 providing a list of suggested values.
E.g. a matching deletion and addition: <del type='subpunction' hand='h2'>dgba</del> <add place="supralinear" hand='h2'>abgd</add>.
abgd <hi rend='overline'>abgd</hi>
Lines drawn above letters to indicate 'nomina sacra' or numerals.
The function of such a line is to highlight the associated text. In contrast to a scribal symbol or compendium, it does not stand for some other text. It is therefore not appropriate to encode the line using character references.
Letters that represent numerals are encoded as such.
E.g. Greek numeral 'alpha': <num value='1'><hi rend='overline'>a</hi></num>
Note: There are character references for the upper (ʹ) and lower (͵) Greek signs that indicate numerical use of letters.
Diacritical marks and punctuation
Any diacritical mark included in a Leiden-style transcription is encoded using the corresponding Unicode character reference as determined from the relevant code chart. (See, for example, the 'Combining Diacritical Marks' chart.) The character reference follows the modified letter.
E.g. a rough breathing: o̔n
A punctuation mark that cannot be directly entered at the keyboard is also encoded using the corresponding Unicode character.
E.g. Greek medial point: ·
Whereas editors often insert spaces between words transcribed from 'scriptio continua' manuscripts, they do not normally provide hyphens to mark words divided at line ends. The one translating a Leiden-style transcription to TEIxLite must therefore indicate whether or not the words are divided.
The soft hyphen entity or character reference (­ = ­ = ­) should be used to indicate word division at the end of a line. The space character is sufficient for word division within lines.
<lb n='1'/> doulous autou pros
<lb n='2'/>tous gewrgous labein tous kar­
<lb n='3'/>pous autou kai labontes oi gewr­
<lb n='4'/>goi tous doulous autou o̔n men
Page, column, and line divisions
TEI Lite has specific elements for page and line breaks but omits the column break element found in the full TEI specification. The more general <milestone> element may be used if it is necessary to indicate a column break. TEI Lite advises against mixing page and line break elements with milestone elements in this manner (TEI U5 5). However, the alternative of using milestone elements for each kind of division is quite costly in terms of the extra keystrokes required: compare <lb n='1'/> with <milestone n='1' unit='line'/>.
These are empty elements that precede the features they mark. The 'n' attribute gives the number of the page, column, or line that begins at the marked point.
Use of the milestone element to mark column breaks should be mentioned in the header's editorial declaration.
E.g. <editorialDecl><p>A milestone element with the 'unit' attribute set to 'column' represents a column break.</p></editorialDecl>
Recto, verso, and papyrus direction
Recto and verso sides of a codex leaf may be indicated by appending 'r' or 'v', respectively, to the folio number given in the 'n' attribute of the 'pb' element.
E.g. folio 7 recto: <pb n='7r'/>
Printed editions use arrows to indicate papyrus direction. The corresponding character reference can be used for an arrow that appears in the editorial discussion or notes. Where the arrow is part of the manuscript transcription, it should be encoded by assigning a value of 'horizontal' or 'vertical' to the 'rend' attribute of the 'pb' element. To use a character reference in this context is wrong because it implies that the arrow is part of the manuscript's text.
E.g. fibers horizontal: <pb rend='horizontal'/>
Canonical reference points are marked with empty milestone elements. The 'n' attribute identifies the standard division that begins at the marked point while the 'unit' attribute specifies the kind of division.
<milestone n='Matthew' unit='book'/>
<milestone n='21' unit='chapter'/>
<milestone n='34' unit='verse'/>
<milestone n='45' unit='verse'/>
Such a use of milestone elements should be mentioned in the header's editorial declaration.
E.g. <editorialDecl><p>The 'unit' attribute of milestone elements is used to indicate biblical book, chapter, and verse divisions.</p></editorialDecl>
Overall document structure
Leiden-style editions usually contain an editor's discussion, the manuscript's transcription, associated notes, and a bibliography. This document structure is emulated using <div0> elements with corresponding attribute values. The manuscript transcription is enclosed in a quotation element, thus indicating that it derives from an independent source.
<p>Discussion starts here.</p>
<lb n='1'/> doulous autou pros
<note>First note goes here.</note>
<bibl>First bibliographic citation goes here.</bibl>
TEI Lite recommends that, if possible, the body of a note should be inserted in the encoded text at its point of reference (TEI U5 7). As mentioned above, the present recommendations group editorial notes in a separate division in order to emulate the usual structure of a Leiden-style edition. As a consequence, each editorial note needs to be linked to the relevant text.
This is achieved using a <ptr> or <ref> element, depending on whether a single point or a span of text is referenced. The 'target' attribute of the <ptr> or <ref> element is set equal to the 'id' attribute of the relevant note. Alternatively, <anchor> elements may be inserted at the beginning and end of the span. The 'target' and 'targetEnd' attributes of the <note> element are then set equal to the 'id' values of the respective <anchor> elements. The one converting the Leiden-style edition to XML must supply unique values for the matching 'target' and 'id' attributes. [TEI U5 8.1]
(1) a <ptr> element marks the reference point:
<lb n='3'/>pous autou kai labontes<ptr target='n1'/> oi gewr­
<note id='n1'>so most MSS; labontes de 1555 and the Sahidic.</note>
(2) a <ref> element encloses the annotated section:
<lb n='3'/>pous autou <ref target='n1'>kai labontes</ref> oi gewr­
<note id='n1'> </note>
(3) <anchor> elements mark the start and end of the annotated section:
<lb n='3'/>pous autou <anchor id='n1a'/>kai labontes<anchor id='n1b'/> oi gewr­
<note target='n1a' targetEnd='n1b'> </note>
A manuscript may include a commentary on the primary text. It may also have scribal annotation besides alterations to the primary text. In contrast to editorial notes, the encoded versions of these kinds of annotation are included at the places to which they refer. They are enclosed in <note> elements whose 'type' attribute is set to 'scribal' or 'commentary' as the case requires. Responsibility for the enclosed text is indicated with the 'resp' attribute. A scribal note is attributed to the relevant scribe using 'h1' for the first hand, 'h2' for the second, and so on. For commentary, the 'resp' attribute identifies the original author or is set to 'unknown' when authorship has not been established. The location of notes and commentary may be recorded using the 'place' attribute, suggested values of which can be found at TEI P3 6.8.1.
(1) a scribal note placed in the left margin by the third hand:
<note type='scribal' resp='h3' place='left'>amaqestate kai kake. afes ton palaion. mh metapoiei.</note>
(2) commentary by a known author:
<note type='commentary' resp='Ephraem of Syria'> </note>
(3) commentary by an unknown author:
<note type='commentary' resp='unknown'> </note>
Bibliographic citations are placed within <bibl> elements which may include further elements such as <author>, <title>, <editor>, <pubPlace>, <publisher>, <date>, and <biblScope>. The <listBibl> element encloses a list of citations.
The <title> element's 'level' attribute takes allowable values of 'm' for monographic (i.e. pertaining to a work published as a distinct item), 's' for series, 'j' for journal, 'u' for unpublished, and 'a' for analytic (i.e. pertaining to articles, poems, etc., published as part of a larger item). There is also a 'type' attribute for classifying the title as 'main', 'subordinate', 'parallel', 'abbreviated', and so on.
The <author> element encloses the statement of primary intellectual responsibility for a work, while the <editor> element contains a secondary statement of responsibility. The latter element's 'role' attribute has a default value of 'editor' but may take any appropriate value including 'translator', 'compiler', or 'illustrator'.
Place of publication, publisher, and date are marked with the corresponding elements shown above. The <biblScope> element contains page numbers, section numbers, etc., that define which parts of the work are referenced. An optional 'type' attribute may be used to specify the kind of reference. Appropriate values include 'pages', 'chapter', 'volume', 'part', and 'issue'. [TEI U5 13, TEI P3 6.10.2]
References in the text are linked to their counterparts in the bibliography using <ptr> or <ref> elements in the same manner as described above for notes.
E.g. a reference in the text that points to an item in the bibliography:
<p><ref target='JDT1997'>Thomas (1997, 8)</ref> regards the hands of P104 and P90 as similar </p>
<author>J. David Thomas</author>
<title level='a'>4404. Matthew XXI 34-37; 43 and 45 (?)</title>
<title level='s'>The Oxyrhynchus Papyri</title>
<editor>E. W. Handley, U. Wartenberg, et al.</editor>
<publisher>Egypt Exploration Society</publisher>
Language and transliteration scheme
Languages encountered in the document are declared in the header's profile description. Non-Latin letters are encoded with Unicode character references or a standard transliteration scheme. Any transliteration scheme that is used must be identified in the profile description. Transliteration should only be applied to letters. Character references are used for diacritical marks etc.
<language id='eng'>Text besides the transcription is in English.</language>
<language id='grc'>The transcription is in Greek. Letters are transliterated according to TLG Beta Code.</language>
Any foreign word or phrase outside the transcription is marked with a <foreign> element whose 'lang' attribute is set to the relevant language code. The language of the transcription is specified using the enclosing quotation element's 'lang' attribute.
(1) a foreign word outside the transcription:
<note>The papyrus reads <foreign lang='grc'>palin</foreign> </note>
(2) the transcription:
<q lang='grc'><lb n='1'/> doulous autou pros </q>
Where the editor has supplied lost text, resolved an abbreviation or symbol, added text regarded as mistakenly omitted by the scribe, or deleted text regarded as spurious, the 'cert' attribute may be used to indicate the editor's level of confidence in that decision. Values recommended here are 'high', 'med', and 'low'. As a guide, 'high' indicates C > 75% (beyond reasonable doubt), 'med' indicates 25% < C < 75% (doubtful), and 'low' indicates C < 25% (very doubtful), where 'C' is the confidence level. There should be no presumption of the editor's level of confidence unless it is stated or clearly implied.
3. Validating the resultant document
The converted Leiden-style edition should be validated using an XML parser. The following outline shows how this might be done:
Ensure that the DTD and entity file names are the same as those given in the DTD. The files themselves can be obtained from the following locations:
TEIxLite DTD file: http://www.tei-c.org/Lite/DTD/teixlite.dtd
Standard entity reference files: http://www.tei-c.org/XML_Entities
A number of XML parsers are available free of charge. Two are mentioned here in case it is helpful. Attempting to open an XML document from within Microsoft's XML Notepad will cause the document to be validated if it has a DTD.
Xerces is a Java-based parser provided by Apache Software Foundation. Both the parser and Sun Microsystems' Java platform must be installed in order to run this program. The parser can be invoked using the following command:
java dom.DOMCount -v <filename>
where <filename> is replaced with the file name of the TEIxLite document to be validated.
Each parser will respond with error messages if the document being validated contains XML errors or does not conform to the DTD.
4. TEIxLite template
The following is a TEIxLite template for converting Leiden-style editions to XML.
<?xml version="1.0" encoding="ISO-8859-1"?>
<!-- Template for converting a Leiden-style edition to TEIxLite.
T. J. Finney, 2001.
Fill in the places marked ... -->
<!DOCTYPE TEI.2 SYSTEM 'teixlite.dtd'>
<!-- Full bibliographic description of electronic file -->
<!-- Enter title of source text here -->
<title>A machine readable version of ...</title>
<!-- Editor of source text -->
<!-- One responsible for making electronic file -->
<!-- Information on publication of electronic file -->
<!-- One responsible for making electronic file available -->
<!-- Availability. Status may be free, unknown, or restricted. -->
<!-- Description of source text -->
<!-- Relationship between electronic file and source -->
<!-- Extent of source included -->
<!-- Editorial practices applied during encoding -->
<p>Encoded according to:
<bibl><title>Converting Leiden-style editions to TEI Lite XML</title>
<author>T. J. Finney</author><date>2001</date></bibl></p>
<!-- Descriptions of milestone usage (optional).
Delete these p tags and contents if not used. -->
<!-- Declaration of classification systems (optional).
Delete these classDecl tags and contents if not used. -->
<!-- Insert taxonomy codes and bibliographic references as required -->
<!-- Description of non-bibliographic aspects of electronic file -->
<!-- Information about creation of the text -->
<!-- Add more hands if required -->
<ident id='h1'>First hand.</ident>
<ident id='h2'>Second hand.</ident>
<ident id='h3'>Third hand.</ident>
<ident id='h4'>Fourth hand.</ident>
<ident id='hx'>Unidentified hand.</ident>
<!-- Insert appropriate language codes and descriptions here.
eng = English
fra = French
deu = German
grc = Greek (ancient)
ell = Greek (modern)
lat = Latin
See ISO 639-2/T for other three letter language codes.
Delete the transliteration sentence if not applicable. -->
<language id='...'>Text besides the transcription is in ...</language>
<language id='...'>The transcription is in ...
Letters are transliterated according to ...</language>
<!-- Classification of electronic file's text (optional).
Delete these textClass tags and contents if not used. -->
<!-- Insert taxonomy codes and text classifications as required.
Scheme must match one of the taxonomy codes given above. -->
<!-- Revision history of electronic file. Insert changes as required. -->
<!-- Editor's discussion.
Sample may be initial, medial, final, unknown, or complete. -->
<div0 type='discussion' sample='...'>
<!-- Manuscript transcription. Sample values as given above. -->
<div0 type='transcription' sample='...'>
<!-- Insert appropriate language code here -->
<!-- Transcription notes. Sample values as given above. -->
<div0 type='notes' sample='...'>
<!-- Bibliography. Sample values as given above. -->
<div0 type='bibliography' sample='...'>
Elliott, Tom, Hugh Cayless, and Helen Hawkins. tei.epidoc: structured markup of Greek and Latin epigraphic texts: a proposed "best-practice" guide. Version 0.2, 2001. Online: http://www.stoa.org/markup/epidoc02.html
"Essai d'unification des méthodes employées dans les éditions de papyrus." Chronique D'Égypte 13-14 (1932): 285-87.
Sperberg-McQueen, C. M. and L. Burnard. TEI Lite: An Introduction to Text Encoding for Interchange. TEI U5. Text Encoding Initiative, 1995. Online: http://www.tei-c.org/TEI/Lite/.
Sperberg-McQueen, C. M. and L. Burnard, eds. Guidelines for Electronic Text Encoding and Interchange. TEI P3, rev. ed. Oxford: Text Encoding Initiative, 1999. Online: http://www.tei-c.org/Guidelines/index.htm.
Thomas, J. David. "4404. Matthew XXI 34-37; 43 and 45 (?)". Pages 7-9 in The Oxyrhynchus Papyri. Vol. 64. Edited by E. W. Handley, U. Wartenberg, R. A. Coles, N. Gonis, M. W. Haslam, and J. D. Thomas. London: Egypt Exploration Society, 1997.