For the Christian Classics Ethereal Library
Version 0.8, Monday, July 20, 1998
Harry Plantinga
This paper describes the markup that is used to prepare texts for the Christian Classics Ethereal Library (CCEL) in Microsoft Word. The markup consists of certain paragraph and character styles and XML tags used for specified purposes. A template contains the styles that are used as well as macros and a header that may be filled out for the bibliographic information of the head section. A program will then convert from ThML-formatted Word documents to XML.
The Theological Markup Language (ThML) is an XML-based markup languages with support for information often used in theological study, such as scripture references and commentary, synchronization of multiple related text, and indexing systems such as Strongs numbers. Another design goal was that the language represent all of the information about a text needed for use in a rich digital library and the information represented in other common formats used for theological etexts. More information on the design of the language, as well as a full definition, is available in the paper ThML: Theological Markup Language for the Christian Classics Ethereal Library. This document describes the guidelines for formatting ThML documents in Microsoft Word, using the ThML template, which contains styles, macros, a toolbar, a menu, and a header template.
Preparing Etexts for the CCEL in Microsoft Word
In order to prepare a text for the CCEL with Microsoft Word, the first step is to get the ThML Template, ThML08.doc, and put it in the Templates folder, inside the Microsoft Office folder. The template can be downloaded from the ThML web page, http://ccel.wheaton.edu/ThML. Once it is installed, you can create a new ThML document by choosing New from the file menu and selecting the ThML template. You can also attach the template to an existing document with the Templates and Add-Ins item on the Tools menu. In either case, the resulting document will have ThML tools available. Then the document is typed or scanned, if necessary, and formatted with appropriate styles and markup codes as described below. Footnotes may be entered as ordinary footnotes in Word, using the Insert | Footnote… menu item or the Insert-Footnote-Now shortcut, Alt+Ctrl+F.
If Microsoft Word is not available, the document may be entered in any word processor, and the special formatting in Word can be left for someone else. It is still helpful to format for the CCEL as much as possible, though. This would include using appropriate font sizes and styles, paragraph indentation, Greek/Hebrew fonts if needed, and inserting footnotes in the method supported by the word processor. There should be carriage returns only at the end of paragraphs, not lines, and blank lines should not be added paragraphs except where there is extra space in the text. The some XML codes described below could also be entered during data entry, if desired—especially codes for page breaks and notes.
Paragraph and Character Styles
Much of the formatting in Word is done by applying character and paragraph styles to the document. Paragraph style sheets are named groupings of styles for paragraphs, such as single-space, indent first line, Times New Roman 11-point, etc. A paragraph style can be applied to a paragraph by selecting it from the left-most dropbox on the formatting toolbar. The ThML template provides several paragraph styles that should be used for formatting documents—styles such as Body Text, Body Text First Indent, Heading 1, Verse, BlockQuote, and others.
Character styles are similar to paragraph styles, except that they only contain character formatting and they may occur within a paragraph style. The character styles used for ThML are "
HTML Markup", "
Style Name |
Shortcut Keys |
Description |
Body Text |
ctrl-alt-b |
Text with no first-line indent |
Body Text First Indent |
ctrl-alt-i |
Text with first line indent |
Comment (character) |
ctrl-alt-c |
Comment -- ignored |
Default (character) |
ctrl-alt-d |
Default paragraph font |
Heading 1 |
ctrl-alt-1 |
Level-1 heading |
Heading 2 |
ctrl-alt-2 |
Level-2 heading |
Heading 3 |
ctrl-alt-3 |
Level-3 heading |
Heading 4 |
ctrl-alt-4 |
Level-4 heading |
HTML (character) |
ctrl-alt-h |
HTML (or XML) markup |
Name (character) |
ctrl-alt-n |
A person's name |
Verse |
ctrl-alt-v |
Poetry, verse, etc. |
XML |
ctrl-alt-x |
XML (or HTML) markup |
When markup requires attributes (e.g.
lang="el"), paragraph styles are not sufficient, and XML or HTML tags are used. The markup may consist of opening and closing tags with attributes, surrounding some text, as for example <foreign lang="el">logos</foreign>. The opening and closing tags and the contained text are called an "element." The markup may also consist of an empty element, that is, an element that doesn't contain any text, such as <pb/ n="37">. In that case, there is a trailing slash after the element name and no closing tag. These elements are represented in a Word document as text that is red, hidden, Courier New text. (In fact, any text that is red will be interpreted as markup.) This style may be applied to text by using the XML paragraph style or the HTML character style. These styles are identical, and used identically, except that one is a paragraph style and one is a character style. Note also that elements must be properly nested: <b><I>x</I></b> is legal but <b><I>x</b></I> is not.ThML documents have a head section, with information about the document, and a body section, containing the document itself. When a new ThML document is created, a template for the head section appears. As much of the template as possible can be filled in. If possible, the MARC record should be retrieved from the Library of Congress gateway (http://lcweb.loc.gov/z3950/gateway.html
), in machine-readable and formatted form, and inserted into the header at the appropriate spot. The information in the MARC record can then be pasted into other sections of the header.The body of the document, placed between the
<body> and </body> tags of the template, should contain everything in the print edition of the book. It should be made to look as similar to the book as possible using the ThML template. In fact, if desired, the ThML styles may be modified to make the document look more like the book, though style names shouldn't be changed and styles other than those in the template should not be used.Headings and Divisions
Headings for the preface, table of contents, and index, chapter titles, section heads, and the like should all be formatted using the styles Heading 1, Heading 2, Heading 3, or Heading 4. These styles can also be applied with ctrl-alt-1, etc. and viewed or modified in the outline view of a document.
XML
divn tags are used to mark the structural sections of a document. These may often match the Heading paragraphs, but they offer more control. They are used for building a table of contents and also to specify points at which an electronic text may be split into pages or chunks for easy viewing. For example, the header for this section might be marked up in this way:<div4 type="Section" title="Headings and Divisions" n=3>
Headings and Divisions
…</div4>
<div1>
is used for top level divisions, such as the preface, table of contents, and chapter titles. <div2> is used for lower-level divisions, and so on. The type and n attributes are optional; if they are present, they will be used in the Table of Contents and possibly other locations such as page heads.The
<insertContents/ level="2"> tag may be used to insert a table of contents at a particular location in the text. In the example above, all of the <div1> and <div2> entries would be gathered and listed in a hierarchical list. Each entry would be linked to the appropriate section. Note that to replace the existing table of contents with the new one, the <added> and <deleted> elements may be used, as in this example:<added>
<insertContents/ level="2">
</added>
<deleted>
[original table of contents here]
</deleted>
It is often useful to know the page breaks from the print edition of a book. They may be used as targets for subject index entries that identify the page of the entry or to display a text with the pagination of the print edition. Page breaks are marked by the insertion of
<pb/> tags, with the n attribute giving the page number of the upcoming page (<pb/ n="37"> or <pb/ n="xii">). These elements should appear at the start of the identified page.Many electronic texts will also have images of pages available on line. The
pb element will also take an href attribute specifying a URI for an image of the page (e.g. <pb/ n="37" href="gif/0021a.gif">). So that it is not necessary to add the attributed for every pb element, intelligent incrementing will be used to derive URLs for later page breaks.Paragraphs
Normal paragraphs of text may be formatted with the Body Text Indent style. This is a single-spaced paragraph with indented first line. The Body Text style is similar, except that the first line is not indented. It is used for the first paragraph of a chapter or the continuation of a paragraph after a figure, for example. Body Text 2, a double-spaced version of Body Text, is also available.
The BlockQuote paragraph style should be used for extended quotations. A BlockQuote paragraph is normally indented on both sides. There is also some extra space before and after a BlockQuote paragraph.
Footnotes may be entered as normal footnotes in Word, and they will be converted to XML notation in the Word to XML conversion process. However, it may at times be preferable to enter notes using the XML notation directly, in order to take advantage of the greater flexibility offered, or because the word processor in use doesn't support footnotes.
The XML notation for notes uses the
<note> element, following the syntax used by TEI Lite [e.g. <note place="foot" resp="whp" anchored>See http://www_tei.uic.edu/orgs/tei/lite </note>]. The place attribute specifies where the appears in the text (e.g. end, foot, inline, interlinear, or margin). The resp attribute identifies the person responsible for the note -- for example, the author, editor, or a person's initials. The anchored attribute specifies that the note is anchored at an exact location; margin notes typically are not anchored.Plain, numbered, or bulleted lists with several levels of indent may be represented with Word styles List, List 2, List 3, List 4 for the plain version; List Bullet, List Bullet 2, …; List Number, List Number 2, etc. There are also styles called List Continue, List Continue 2, …, used for additional paragraphs of a list entry.
The plain list is converted to the HTML elements
<UL> and <BR>, the bullet list to <UL> and <LI>, and the numbered list to <OL> and <LI>.This is a plain list with
continued entry.
continued entry.
Terms, Definitions, and Glossaries
Some documents contain a glossary. It should be surrounded by
<glossary> tags, and individual terms and definitions should use paragraph styles called Term and Definition. These are converted to the HTML elements <dl>, <dt>, and <dd>.<glossary>
Agape
Greek for the unconditional love which God extents to his people.
Apotheosis
An ancient theological word used to describe the process by which a Christian becomes more like God.
</glossary>
Theological books often contain verse -- poetry, hymns, or versified presentation of material such as the Psalms. Verse is often typeset with varying levels of indentation. These are represented with Verse 1, Verse 2, and Verse 3 paragraph styles. In the example below, the first and third line of each stanza is formatted with teh Verse 1 paragraph style, the second with Verse 2, and the fourth with Verse 3.
O God, a world of empty show,
And sated with the weary sum
Sweet childhood of eternal life!
Thine Arms, to whom I turn and cling
G. Ter Steegen
Attributions to authors, of poetry or letters for example, may be given the Attribution paragraph style, as in the "G. Ter Steegen" attribution in the poem above. These are by default rendered as right-justified, italic text. Names that occur in text may be given the
Name character style. Then they can be found for inclusion in an index of names referred to.Scripture
<scripRef> element, as in this example:<scripRef passage="Rom. 8:27,28; 10:8-13" version="NIV">
RomansThe version attribute specifies the translation or version, and the passage attribute is a list of scripture references separated by commas or semicolons. Each reference may consist of a book name (or abbreviation), a chapter, and a verse. The chapter and verse are separated by a semicolon or period. If the book name or chapter are missing, they are assumed to be the same as the previous reference. If two references are separated by a dash, all of the intermediary verses are included as well. In the case of books with only one chapter, a reference consists of a book name or abbreviation and a verse. Book names should be as they appear in the version cited or a unique prefix of at least two letters of the name. Abbreviations that are not prefixes may also be accepted by programs that process ThML documents.
Software for processing ThML texts will likely have a scripture parser incorporated that finds scripture citations and marks them appropriately, so that it will not be necessary to mark citations by hand. However, parsing text to find and identify scripture references involves several difficulties. One problem is that different translations of the Bible use different versification schemes. For example, Psalm 9 in the King James Version is split into two -- Psalms 9 and 10 -- in the Septuagint. In order to interpret a reference, the versification scheme in use must be known. Scripture references will be assumed to be compatible with the versification scheme used by the KJV, ASV, NASB, NIV, and TLB unless otherwise specified.
Also, context is sometimes necessary in interpreting a reference. A passage may refer to Romans 8:28 at one point and later to verses 29 and 30 and chapter 10:8-13. A parser should be able to identify the context in most cases, but in some cases it may be necessary to set the context or turn the parser off. The
<scripContext/ version="NIV" passage="Romans 8"> element is used to set the default context for the parser, and the <scripParseOff/> and <scripParseOn/> elements may be used to turn the parser off or on, to prevent linking of a passage such as "Bob had 2 apples and John 3." The version attribute may be set in a scripContext element but it is never set by the parser.In theological texts, scripture is also sometimes quoted. In this case, it is not desirable to link the reference to the scripture passage, but it may be desirable to incorporate the passage into a table of scripture references. Quotations of scripture may be marked with the
<scripture> element. A passage may be represented as in this example:<scripture passage="Mark 7:16" version="NKJV">
If anyone has ears to hear, let them hear!</scripture>Explanation or commentary on a passage will be marked with the
<scripCom> tag, as in this example:<scripCom passage="Mark 7:16">
Mark 7:16. This admonition seems to apply to most everyone . . .</scripComm>Index Entries
Passages in the text may be marked for insertion into an index using the
<index> element. For example, one might mark a passage for inclusion in a subject index this way:<index type="subject" subject1="Christian Life" subject2="Sanctification" title="Apotheosis">
Apotheosis (or Deification) is an ancient theological word commonly used in Eastern theology to describe the process by which a Christian becomes more like God . . . </index>.The title attribute is used in the Table of Contents. If it is not present, the text inside the
<index> element is used as a title. foreign tag and the lang attribute. For example, the Greek passage <foreign lang="el">logos</foreign> may be marked as shown. "lang" attribute values are as specified in ISO 639. Some examples are Dutch: nl, English: en, French: fr, German: de, Greek: el, Hebrew: he, Latin: la, Spanish: es, Portuguese: pt, Russian: ru.If the language uses characters not available in the ISO-8859-1 (Latin-1) character set, they may be represented with the Latin-1 character set using an appropriate font. For example,
<foreign lang="el"><font face="Symbol">logos</font></foreign>. The Greek and Hebrew fonts used for the CCEL are the excellent, freeware SIL Galatia and SIL Ezra fonts and related software from the Summer Institute of Linguistics, used here in a Greek example (logov) and a Hebrew example (hwhy). The latter method depends upon the availability of a particular font to the client.Hypertext Links
Hypertext Links can be inserted using the Microsoft Word link facility, perhaps using the ctrl-k shortcut. Links can be either HTML or XML format. This is an example of a link to the
CCEL.Horizontal rules that span 30% of the page can be inserted with a paragraph using the HR30 style. These would be rendered in html as
<hr align="center" width="30%">. The above paragraph is an example. The paragraph below, of style HR, represents a horizontal rule that spans the entire page. Of course, any HTML including horizontal rules can be inserted directly, thus: <hr width="50%">
Electronic texts formatted according to these guidelines will be converted to XML format using custom software. Browsers that support XML will be able to use the resulting texts directly, and the format is semantically rich enough that it will be possible to convert texts to a variety of other formats without loss. Those formats may include multi-file HTML webs, plain text, PDF, OnLine Bible, Docbook, Windows Help, and others.
Libraries making use of all of the semantic information in books using this markup will be able to provide a variety of capabilities not often found in digital libraries. These capabilities include global scripture and subject indexes, indexes of foreign words and names mentioned, flexible alignment of texts, linking of arbitrary texts and dictionaries or lexicons, display of the pages of a book as text or image, cross reference systems, and the ability to convert automatically to new formats that may be needed.