[Cache version of the document published at http://www.cs.usask.ca/faculty/devito/e-TLL/apa95paper.html; please use the canonical reference if possible. Additional information is available in the document section Electronic Thesaurus Linguae Latinae.]
This is an HTML version of a paper presented by Ann DeVito at the meetings of the American Philological Association in San Diego, California, 28 December 1995. It describes plans to use a grammar to direct the automated tagging of the electronic TLL database.
The Consortium for Latin Lexicography (CLL), based at the University of California at Irvine under the Directorship of Patrick Sinclair, has just begun to plan the development of an electronic version of the Thesaurus Linguae Latinae (TLL). We hope to release, in several years' time, an electronic TLL on CD-ROM, complete with its own search engine and user interface. With an electronic TLL, users will be able to find and examine articles that meet the criteria they specify with a speed and thoroughness that is impossible using the printed version. In order to support computerized search of this sort, it is necessary to tag the TLL database appropriately. This paper concentrates on the CLL's grammar-directed approach to automating the tagging of the electronic TLL.
A TLL article can be a complex entity that holds many different kinds of information, including the lemma, ancient and modern discussions of etymology, variants in spelling, form, gender, and prosody, ancient discussions of meaning, and modern discussions of the word's survival in Romance languages, as well as comprehensive lists of citations showing how the word was used by Latin authors. Ideally, with the electronic TLL one should be able to conduct searches on specific kinds of information within articles: for example, searches of lemmata only; searches of citations only; or all articles that cite a certain author or title. It also would be advantageous to display only selected sections of a long and complex TLL article.
In order to support these functions, it is necessary to tag the different types of information and sections within TLL articles so that each can be recognized as discrete entities. These functional tags, while unseen by the users of the electronic TLL, will enable our computer software to conduct sophisticated searches and to display selected types of information.
There are a number of steps involved in translating the TLL from the printed page to an electronic database that is fully tagged. The first step in the translation is transferring the TLL from the printed page to computer disk. The TLL text is typed in manually by a data-entry firm. As they enter the text of the TLL, the data-entry typists add Beta Code to indicate the typeface of the printed page. (This Beta Code is an extended version of the Beta Code used by the Thesaurus Linguae Graecae to encode Greek characters.)
Figure 1 shows a reproduction of part of a printed page in the TLL (col. 966 l.43-52).
Figure 2, below, shows the same passage in Beta Code.
{100{101&1ores}101 &3v. &oros.}100 {100{101&1o%26re%26sco}101 &3v. &auresco.}100 {100{101&3? &1oresta}101: &thres [1&3anglosax.&]1 &7Gloss. &V 376, 6 [1&3item &7Gloss.&13L &Corp. o 229; ora vestis &3temptat Lindsay&]1.}100 {100{101&1orestiade%26%27s &3f., $O)RESTIA/DES.}101 &3i. q. <20OREADES>20&: &7Fest. &p. 185 %19es nymphae montium cultrices.}100 {100{101&1orestion &3n., $O)RE/STION}101 &[1&3resp. apud Diosc. gr.&]1. &3nomen herbae q. e. <20INULA>20&: &7Plin. &nat. 14, 108 herba, quam alii helenion, alii medi- cam, alii symphyton, ... alii %19on [1%19um &3var. l.&]1 ... vocant [1&3in eadem fere serie nominum&: &7Diosc. &`1, 28 p. 19, 8 M. [&3scrib. &%19stini]. 5, 77 p. 195, 20]1. &7Gloss. &III 571, 48 %19on [1%11stimi &3trad.&]1 i. eleniu.}100
The Beta Code sample in Figure 2 is filled with typographical codes like "&1", which signifies bold Roman font, "&3" for italic Roman, "$" which signifies the Greek alphabet, codes for macrons, etc.
At present, 100 sample columns (50 printed pages) of the TLL have been typed and Beta-coded in this manner. This is only a preliminary step, however. We cannot ask the data-entry typists, who do not know Latin and who are not familiar with the structure of TLL articles, to do anything more than reproduce the typography of the printed page with Beta Code. Yet the Beta Code only preserves the form of the printed TLL page; in order to search the electronic TLL effectively, we must be able to distinguish the function of different elements within each TLL article. Thus, we must replace the formal Beta Code with functional tags that will define the functions of different parts of the TLL articles.
The CLL has decided to use the Text Encoding Initiative (TEI) implementation of SGML (Standard Generalized Markup Language) as functional tags to distinguish the different types of information found within TLL articles. TEI offers features that we need in our tags for the TLL, such as functional, as well as formal, definitions, and the use of both start and end tags to surround an element; it also has the advantage of being an emerging standard in tagging electronic texts in the Humanities.
Figure 3 shows a few short TLL articles that have been tagged with TEI by C. M. Sperberg-McQueen. At the top is the tagged article "ores v. oros." Note that <entry> tags surround the article; <form> and <orth> tags surround the lemma; <XR> tags surround the cross reference. We still have some typographical information preserved here, in the <LBL REND="ital"> tag that surrounds the "v.", but the most important feature is the ability to distinguish the functions of different parts of the article.
<ENTRY><FORM><ORTH>ores</ORTH></FORM> <XR><LBL REND="ital">v.</LBL> <REF>oros</REF>.</XR></ENTRY> <ENTRY><FORM><ORTH>ōrēsco</ORTH></FORM> <XR><LBL REND="ital" >v.</LBL> <REF>auresco</REF>.</XR></ENTRY> <ENTRY ID="oresta"><FORM><LBL REND="ital">?</LBL> <ORTH>oresta</ORTH>:</FORM> <SENSE><TRANS><DEF>thres</DEF> (<LANG REND="ital">anglosax.</LANG>)</TRANS> <SEG><BIBL><TITLE REND="sc">Gloss.</TITLE> V 376, 6</BIBL></SEG> (<HI REND="ital" >item</HI> <EG><CIT><BIBL><TITLE REND="sc">Gloss. <HI REND="sup">L</HI></TITLE> Corp. o 229</BIBL> <Q>ora vestis</Q></CIT></EG> <HI REND="ital">temptat Lindsay</HI>).</SENSE></ENTRY> <ENTRY><FORM><ORTH>orestiade¯˘s</ORTH> <GEN REND="ital">f.,</GEN></FORM> <DEF><MENTIONED LANG="grc">O)RESTIA/DES.</MENTIONED></DEF> <DEF><HI REND="ital"> i. q.</HI> <MENTIONED REND="exp">oreades</MENTIONED>:</DEF> <EG><CIT><BIBL><TITLE REND="sc">Fest.</TITLE> p. 185</BIBL> <QUOTE><OREF>es nymphae montium cultrices.</QUOTE></CIT></EG></ENTRY> <ENTRY><FORM><ORTH>orestion</ORTH> <GEN REND="ital">n.</GEN>,</FORM> <ETYM><MENTIONED LANG="grc" >O)RE/STION</MENTIONED> (<HI REND="ital">resp. apud Diosc. gr.</HI>).</ETYM> <DEF><HI REND="ital">nomen herbae q. e. <MENTIONED REND="exp">inula</MENTIONED></HI>:</DEF> <EG><CIT><BIBL><AUTHOR REND="sc">Plin.</AUTHOR> nat. 14, 108</BIBL> <QUOTE>herba, quam alii helenion, alii medi- cam, alii symphyton, ... alii <OREF>on (<OREF>um <HI REND="ital">var. l.</HI>) ... vocant</QUOTE></CIT></EG> <NOTE>(<HI REND="ital">in eadem fere serie nominum</HI>: <BIBL><AUTHOR REND="sc"> Diosc.</AUTHOR> 1, 28 p. 19, 8 M. <NOTE>[<HI REND="ital">scrib.</HI> <MENTIONED REND="rom"><OREF>stini</MENTIONED>]</NOTE>. 5, 77 p. 195, 20</BIBL>).</NOTE> <EG><CIT><BIBL><TITLE REND="sc">Gloss.</TITLE> III 571, 48</BIBL> <QUOTE><OREF>on <NOTE>(<MENTIONED REND="rom"><OREF TYPE="nonlemma">stimi</MENTIONED> <HI REND="ital">trad.</HI>)</NOTE> i. eleniu.</QUOTE></CIT></EG></ENTRY>TEI Tagging by C. M. Sperberg-McQueen
In devising a tag set for the TLL, our starting point will be the tag set for printed dictionaries defined in the TEI P3 guidelines (Sperberg-McQueen and Burnard 1994), which is the set used in the figure above; however, since this tag set does not meet all of the needs of a historic lexicon like the TLL, the CLL hopes to work with the TEI and other interested parties to develop a tag set for historic lexica. (These plans are contingent upon all parties receiving the requisite funding.) The CLL is confident that it will be possible to devise a TEI- conformant SGML tag set that will be able to define each of the sections and types of information within the TLL.
Recall that as a preliminary step we have the TLL typed onto disk, with Beta Codes added that preserve the typography of the printed page. As a second step, we must convert the Beta-Coded TLL text to text with TEI tags that define the function of elements within articles.
Usually the conversion of one type of encoding of a textual database to another--in this case, the conversion of Beta Code to TEI tags-- is carried out in a series of steps, translating the tags in one set of elements after another. Unfortunately, this series of translation steps is often ad hoc, with translations carried out as the need presents itself. This can result in confusion and in uneven results. The text of the TLL is so complex and so large that we can ill afford any degree of confusion or inconsistency in the tagging process. For this reason, the CLL has decided to follow the lead of the developers of the Oxford English Dictionary on Compact Disk (Oxford University Press 1992) and use a grammar to guide the automated tagging process (Berg and others 1990).
Grammars have been used in Computer Science for many years to define and check the syntax of programming languages (Aho and Ullman 1977). The developers of the electronic OED created a similar type of grammar that defined the syntax, or structure, of articles in the OED (Berg and others 1990; Kazman 1986). The grammar was used to transduce--i.e., parse and tag--the text of the OED, separating and marking individual articles and the different types of information contained in the sections within each article.
Following this line of research, we are creating a grammar for the TLL. The TLL grammar attempts to define the syntax of the TLL and its articles, delineating each of the sections of a TLL article by taking advantage of the typographical conventions used in the printed TLL (and reproduced in the Beta-Coded text of the TLL), as well as the clues offered by the order of elements and by certain phrases, abbreviations, and punctuation. Our plan is to use the TLL grammar to guide the conversion of the Beta-Coded text that reproduces the typography of the TLL to TEI-tagged text that defines the functions of distinct elements within TLL articles.
The CLL expects to develop a transducer generator, software that will read in a grammar--in this case, the TLL grammar--and generate automatic tagging software, formally known as a transducer, that is based on that grammar. This tagging software will then read in and parse Beta-Coded TLL text according to the TLL grammar as it applies the corresponding TEI tags, thus providing a measure of verification during the tagging process.
The grammar-driven approach to automatic tagging offers some significant advantages. First, one avoids the potential for confusion that is inherent in applying a series of ad hoc translations to a text. Second, because the grammar defines the structure of TLL articles, any tagging process that is guided by a grammar becomes a verification process as well. Whenever the grammar-driven tagging software encounters anomalous structures that are not expected in an article, that article is flagged so that it can be reviewed. Although there are errors that a grammar-driven tagger cannot catch (e.g., misspellings of words other than keywords employed by the grammar), the grammar-driven process does provide a level of verification that is absent from conventional automated tagging techniques.
At present, work is proceeding on a preliminary version of the TLL grammar. Figure 4 depicts the highest level of the grammar and shows how the grammar defines the basic structure of the TLL and its articles.
<TLL> ::= <front> <body><back> <body> ::= <article>+ -- An article may be complex or may consist only of lemma -- and cross reference <article> ::= [<quest>] <artHeader> <artMain> [<artSupp>] | <lemmaSect><crossRefSect>. <artHeader> ::= <hwdgrp> [<prelimSect>] <artMain> ::= [<definition>] <wordHistory> <wordHistory> ::= <senseLevel>+ <senseLevel> ::= <sense> <citation>+ <citation> ::= <author> <work> <loc> <quote> Key : A ::= B A consists of B [A] A is optional A+ A occurs 1 or more times A|B either A or B occurs <A> A is non-terminal
The TLL grammar uses a notation known as Backus-Naur Form or BNF--the main points of this notation are summarized in the Key shown at the bottom of Figure 4.
The portion of the grammar depicted here shows that the TLL as a whole is composed of front matter, back matter, and the body, which is defined as a series of articles. An article may be complex (the option specified on the first line of the article definition), or may consist only of a lemma and a cross- reference (the option specified on the second line). A complex article always has a header, and a main section; it may have an optional prefatory question mark (applied to questionable entries) and/or supplement.
Figure 5 shows a somewhat more detailed level of the grammar. The article header always has a headword group and may contain an optional preliminary section. The headword group consists of the lemma and a number of other optional sections that give variant spellings, part-of- speech information, cognates, and even a second headword group. In Figure 5, only the lemma section (<lemmaSect>) is defined in detail. A lemma section consists of the lemma in boldface; thus the grammar indicates that the Beta Code for bold Roman must precede the lemma. Note that the non-terminal <boldRomanCode> is defined at the bottom of Figure 5 by the terminal symbols "&1," which is the Beta Code inserted by the data-entry typists to indicate boldface. The lemma itself is defined as one or more letters or "special characters;" the "special character" designation is necessary because Beta Codes for macrons and breves may be interspersed with letters in the lemma. It should now be clear how the TLL grammar defines the functional structure of the TLL in terms of the Beta- Coded form that preserves the typography of the printed page.
<article> ::= [<quest>] <artHeader> <artMain> [<artSupp>] | <lemmaSect><crossRefSect>. Article Header <artHeader> ::= <hwdgrp> [<prelimSect>] <hwdgrp> ::= [<cross> ] [<digit>]<lemmaSect> [<varSpellComment>] [, <p-o-s>] [, <cognate>] [<italRomanCode>et [depon.] <hwdgrp>]. Lemma section: consists solely of the boldface lemma word <lemmaSect> ::= <boldRomanCode> <lemma> <lemma> ::= <letterOrMark>+ Alphabets and digits <letterOrMark> ::= <letter>|<specChar> <letter> ::= a | b | ... | y | z <digit> ::= 0 | 1 | ... 8 | 9 Font style codes <normRomanCode> ::= & <boldRomanCode> ::= &1 <italRomanCode> ::= &3
Figure 6 shows the parse tree for the simple headword group orno, -avi, -atum, -are and shows how the Beta Code representation of the lemma can be parsed into its constituent parts.
We can see how the parse tree works by reading the tree from the bottom up. At the bottom is the Beta Code for the headword group, # &1o%26rno, &%19a%26vi%26, %19a%26tum, %19a%26re. The "#" symbol, which stands for a "cross" symbol in Beta Code, is parsed into <cross>, which is the optional first element in the headword group. "&1", the Beta Code symbol for bold Roman is parsed into <boldRomanCode>, which is always the first element of <lemmaSect>. "o" followed by "%26" is parsed into o<macron>, which is then parsed into <longO>. This element, followed by the letters "rno", is parsed into <letterOrMark>+, which makes up the <lemma> per se. <boldRomanCode> and <lemma> form the <lemmaSect>. With <cross> and <lemmaSect>, we have the first two elements of <hwdgrp> accounted for. The last element of <hwdgrp>, <p-o-s> is not parsed in this diagram due to lack of space, but it does parse correctly. With all required elements of <hwdgrp>, including punctuation, accounted for in the Beta Code, # &1o%26rno, &%19a%26vi%26, %19a%26tum, %19a%26re succeeds in parsing correctly.
Any headword group that could be parsed successfully in this manner would be tagged appropriately with TEI tags distinguishing the lemma and the part of speech information -- any that did not would be set aside for review.
Although the CLL will attempt to devise one grammar that defines the structure of the TLL and its articles, it is expected that this archetypal grammar will have to be broken down into a number of subsidiary grammars for the actual translation process. This expectation is based on the experience of the developer of the OED grammar, who found that it was necessary to break down the translation process into a series of steps, each governed by a grammar that had been refined especially for that step (Kazman 1986). We may also find that we need to develop different grammars (and their subsidiaries) for different stages in the TLL's 100 year history.
Creating this grammar will be difficult and time-consuming, since TLL articles range in complexity from one-line cross-references to multi-page entries. Moreover, although there is a basic structure to the typical TLL entry, the TLL is riddled with inconsistencies of form owing to the fact that the TLL has been compiled by a number of scholars over a span of more than 100 years. Nevertheless, it is hoped that the TLL grammar will succeed in automatically preparing most of the text of the TLL for efficient electronic search.
Grammar-driven tagging offers the benefits of consistent tagging as well as a high degree of verification during the tagging process. The CLL is committed to developing a grammar for the TLL that will guide the automated process of applying TEI tags that will support the effective search of different elements of TLL articles. The TLL grammar may also be interesting in its own right as it sheds light on regularities and irregularities within the TLL.
At present, the electronic version of the TLL is only in the planning stages. It will take several years--and a lot of luck at finding grant money--to complete the project. It is hoped that this will be the first of a series of progress reports on the development of the electronic TLL.
The Oxford English Dictionary on Compact Disc. 2nd ed., Oxford: Oxford University Press, 1992.
Aho, A.V. and J.D. Ullman. Principles of Compiler Design. Reading, MA: Addison-Wesley, 1977.
To top of this page
To Electronic TLL Home Page
To University of California at Irvine Home Page
To Ann DeVito's Home Page
To University of Saskatchewan Dept. of Computer Science Home
Page
To University
of Saskatchewan Home Page