[Cache version from http://www.cs.usask.ca/faculty/devito/e-TLL/ach95poster.html; please see the canonical version if possible.]

Developing an Electronic Thesaurus Linguae Latinae

Ann DeVito
Technical Advisor
Consortium for Latin Lexicography
July 1995

This is an HTML version of a poster presented by Ann DeVito at the meetings of the ACH/ALLC (Association for Computers and the Humanities/Association for Literary and Linguistic Computing) in Santa Barbara, California, 11 - 15 July 1995. The HTML document was created on August 27, 1995.

  1. The Thesaurus Linguae Latinae
  2. The Consortium for Latin Lexicography
  3. The potential benefits of an electronic TLL
  4. Software to display and search the electronic TLL
  5. Preparing the text database for the electronic TLL
  6. Automating the tagging process
    1. Using a grammar to automate tagging: the OED CD-ROM
    2. Designing a grammar for the TLL
    3. A very simple grammar for the TLL
    4. A closer look at a very simple grammar for the TLL
    5. Using the grammar in the automatic tagging process
    6. Anticipated difficulties
  7. The present state of development of the electronic TLL
  8. Contact addresses
  9. References

1. The Thesaurus Linguae Latinae

The Thesaurus Linguae Latinae is a Latin lexicon that attempts to catalogue thoroughly the use of the language in surviving Latin texts to 600 A.D. (with full coverage of texts down to 200 A.D. and coverage of an extensive selection of texts between 200 and 600 A.D.). Work on this ambitious lexicon, begun in 1894, continues to this day. To date, ten thick volumes that consume three feet of shelf space have been produced. The TLL is the Latin philologist's fundamental tool, providing detailed notes on usage and a wealth of citations, both of Latin texts and of scholarly articles.

2. The Consortium for Latin Lexicography

In 1994, it was determined that the TLL could become even more useful if it were made available in electronic form. Thus, the Consortium for Latin Lexicography (CLL) came into being, with the production of an electronic version of the TLL as its mandate. CLL is composed of members from the Classics Department at the University of California at Irvine, the editors of the Thesaurus Linguae Latinae, and the Department of Computer Science at the University of Saskatchewan. The Director of CLL is Patrick Sinclair of the Department of Classics at the University of California at Irvine.

3. The potential benefits of an electronic TLL

An electronic TLL has the potential to be an extremely valuable tool, bringing the advantages of computerized search to the wealth of information in the TLL. With an electronic TLL, users could find and examine articles that meet the criteria they specify--for example, all articles that cite a certain author or work. Users would be able to find articles of interest with a speed and thoroughness that is impossible with the printed version.

It is also hoped that an electronic version of the TLL will be easy enough to use that it will appeal to a broader base of users, including students who may be intimidated by the size of the printed version.

4. Software to display and search the electronic TLL

In designing the software to search and display the text of the electronic TLL, CLL is committed to supporting three properties above all: a) rapid, thorough, and accurate search and retrieval, b) intuitive navigation, and c) a clear, easy-to-read display.The software will allow users to find and examine articles that meet the criteria they specify--for example, all articles that cite a certain author or text. The software also will provide navigational tools that allow users to move quickly to articles of interest and to browse through the lexicon. It should be an easy matter to move to a specific article, to browse through consecutive articles, to refer back to previously-consulted articles, or to follow cross-references to other articles. The graphical user interface will support multiple windows to display different types of information and should be straightforward and consistent.

In addition to allowing users to browse and search the articles of the TLL, the software will display the Praemonenda de Rationibus et Usu Operis (the introduction to the TLL) and the TLL's Index Librorum Scriptorum Inscriptionum.

Given the international nature of the TLL's users and of CLL, we hope to make our user interface available in English, German, French, and, of course, Latin.

5. Preparing the text database for the electronic TLL

The foremost problem confronting the developers of an electronic TLL is that of preparing the text so that it can support useful and efficient computerized search. A TLL article can be a complex entity that holds many different kinds of information, including the lemma, ancient and modern discussions of etymology, variants in spelling, form, gender, and prosody, ancient discussions of meaning, and modern discussions of the word's survival in Romance languages, as well as comprehensive lists of citations showing how the word was used by Latin authors. Ideally, with the electronic TLL one should be able to conduct searches on specific kinds of information within articles. It also would be advantageous to display only selected sections of a TLL article on screen. In order to support these functions, it is necessary to tag the different types of information and sections within TLL articles so that each can be recognized as discrete entities. These tags, while unseen by the users of the electronic TLL, will enable the computer software to conduct sophisticated searches and to display selected types of information.

6. Automating the tagging process

With the large amount of complex text presented by the TLL, manual tagging is not feasible. It is necessary to find a way to automate the tagging process. We start with a preliminary electronic version of the TLL text, typed in manually by a data-entry firm. The data-entry typists enter the text of the TLL, adding codes that indicate the typeface of the printed page. Because form is often aligned with function in the TLL typeface scheme, these typeface codes will prove invaluable in the automated tagging process. In addition, the typists add codes that distinguish the most basic divisions of a TLL article. (Because the TLL articles are entirely in Latin, it is not possible to ask typists to distinguish anything more subtle than these most basic divisions.) At present, 100 sample columns (50 printed pages) of the TLL have been typed and encoded in this manner.

Next comes the difficult task of automatically converting this encoded text to text that is marked with tags that distinguish the sections and different types of information within an article. This process utilizes the clues provided by the typeface codes, keywords in the text, certain punctuation conventions, and the expected order of sections within an article. The best way to make use of these clues is to design and utilize a grammar that defines the lexicon itself.

6.1. Using a grammar to automate tagging: the OED CD-ROM

The development of the second edition of the Oxford English Dictionary on Compact Disk (Oxford University Press 1992) provides a precedent for distinguishing specific types of information in a large, complex, and sometimes inconsistent lexicon like the TLL. The developers of the OED CD-ROM used a grammar to process text so that it could be organized for efficient search (Berg, Gonnet, and Tompa 1990). Grammars have been used in computer science for many years to define and check the syntax of programming languages (Aho and Ullman 1977). The developers of the OED CD-ROM created a similar type of grammar that defined the syntax, or structure, of the OED and its articles (Berg, Gonnet, and Tompa 1990; Kazman 1986). The grammar was used to parse and transduce the text of the OED in order to distinguish and tag individual articles and the different types of information contained within each article.

6.2. Designing a grammar for the TLL

Following this line of research, the developers of the electronic TLL plan to create a grammar for the TLL. The TLL grammar will attempt to define the syntax of the TLL and its articles, delineating the different types of information within a TLL article by taking advantage of the typographical conventions used in the printed TLL and the clues offered by certain keywords, abbreviations, and punctuation.

Creating this grammar will be difficult and time-consuming, since TLL articles range in complexity from one-line entries, containing no more than a cross-reference, to complex multi-page entries that contain discussions of different levels of meaning, ancient and modern deliberations on etymology, variant spellings and forms, and more, all supported by numerous citations of ancient authors and grammarians. We expect, however, that the results of using the grammar in the automatic tagging process will repay the effort.

6.3. A very simple grammar for the TLL

A grammar is a series of rules, each of which shows how an element of the TLL can be decomposed into other elements. Figure 1 shows a grammar that illustrates the basic structure of the TLL, greatly simplified. The grammar shows that the TLL as a whole consists of front matter, the body, and back matter. The content of the front and back matter are ignored in this simple grammar, but the grammar defines the body as a series of articles. Each article has a header, a main section and an optional supplement. The article header always has a headword group that consists of the lemma and, optionally, part-of-speech information. The article header may also contain a "preliminary section" that can hold twelve different types of information, from modern and ancient discussions of etymology to discussions concerning textual criticism. Any or all of these twelve subsections may be present in the preliminary section.

The grammar shows that the main section of the article may consist solely of a cross reference, or it may illustrate the meaning and usage of the word by collecting a number of citations together in the "word history" section. In recent volumes of the lexicon, a general definition precedes the word history section. The word history section consists of one or more senses of the word, each illustrated by citations of Latin texts.


Figure 1. A very simple grammar for the TLL

TLL := front body back

body := article+

article := artHeader artMain artSupp?

artHeader := hwdgrp prelimSect?

hwdgrp := lemmaSect p-o-s?

prelimSect := modEtymSub? ancEtymSub? varSpellSub? abbrSub? NTSub? varClassSub? varFormSub? varProsSub? ancMeanSub? liaSub? romanceSub? textCritSub?

artMain := (crossRef | (definition? wordHistory))

wordHistory := senseLevel+

senseLevel := sense citation+

citation := author work loc quote

Key to symbols:

A := B
A consists of B
A?
A is optional
A+
A occurs 1 or more times
A|B
either A or B occurs

6.4. A closer look at a very simple grammar for the TLL

If we look at a few rules from a lower level of a simple TLL grammar, we see that typographical clues, keywords, and punctuation form an important part of the grammar. In the printed TLL, the lemma always appears in bold roman font. In the encoded text of the TLL, typists have entered the typographical code "&1" before any text that appears in bold roman on the printed page. Thus in the example below, LemmaSect is defined as the bold roman code "&1" followed by the lemma itself, which is defined as a series of alphabetic characters. The typographical clues in the encoded text of the TLL are used as defining elements in the grammar.

lemmaSect := "&1" lemma

lemma := char+

char := {"a", "b", "c", ... "z"}

Keywords are also important in the grammar. The ancient etymology subsection of the preliminary section usually begins with the keywords de origine or derivatur a(b) in italics, which are denoted by the typographical code "&3." This subsection always ends with a period. Thus, the rules defining the ancient etymology section are:

ancEtymSub := ("&3" ("de origine" | "derivatur a" | "derivatur ab"))? ancEtym "."

ancEtym := char+

6.5. Using the grammar in the automatic tagging process

The grammar for the TLL will guide the automatic tagging process. A program will read in the encoded text with its typographical codes and parse it by comparing it to the grammar. As the program finds matches for specific types of information or section markers, it will apply the appropriate tags. For example, when the program finds text preceded by the bold roman code "&1" at the beginning of a headword group, it will tag the text as a lemma; when the program finds the keywords de origine or derivatur a(b) preceded by the italic code "&3" inside the preliminary section of an article, it will tag the keywords and the text that follows as belonging to the ancient etymology subsection.

It is likely that the automatic tagging process will require a series of passes. Thus, the main grammar for the TLL will need to be deconstructed into several specialized grammars, one for each pass (cf. [Kazman 1986]).

6.6. Anticipated difficulties

Although the automatic tagging program should notify the developers whenever it is unable to successfully parse and tag an article, the developers will also have to look out for passages that the program tags incorrectly. Even with a comprehensive grammar, there may be ambiguities in the text that cannot be resolved automatically. Moreover, although there is a basic structure to each TLL entry, the database is riddled with inconsistencies of form owing to the fact that the TLL has been compiled by a number of scholars over a span of more than 100 years. Given these potential problems, careful proofreading of the tagged text will be necessary.

Despite these anticipated difficulties, it is hoped that the TLL grammar will succeed in automatically preparing most of the text of the TLL for efficient electronic search. We also expect the construction of a grammar for the TLL to be a worthwhile, albeit challenging, experience that will teach us a great deal about the structure of this historic lexicon.

7. The present state of development of the electronic TLL

As of July 1995, CLL is just beginning the development process. Presently, we are determining the requirements of the electronic version of the Thesaurus Linguae Latinae, i.e., the features that the completed electronic TLL software should possess and the constraints that it should satisfy. A systematic formulation of the requirements is a very helpful first step in software development, since it ensures that all parties involved agree on the desired characteristics of the product they are developing. These requirements will also help us determine how the elements in the TLL should be tagged.

Once a set of requirements has been formulated and agreed upon, and once the design of the project has been reviewed, work can begin on designing and testing the TLL grammar. We hope to build a prototype of the electronic TLL that will serve as a test-bed for the grammar and the automatic tagging process, as well as a preview of the user interface for the search and display software.

At present, the electronic TLL project is funded by the Council for Collaborative Research in the Humanities at the University of California at Irvine.

8. Contact addresses

For more information on the electronic TLL project and the Consortium for Latin Lexicography, please contact:

Patrick Sinclair
Director
Consortium for Latin Lexicography
Department of Classics, 156 HH
University of California
Irvine, CA 92717-2000
USA
Phone: (714) 856-5931
Fax: (714) 824-2464
CLL@uci.edu
Ann DeVito
Technical Advisor
Consortium for Latin Lexicography
Department of Computer Science
University of Saskatchewan
Saskatoon, SK S7N 0W0
Canada
devito@cs.usask.ca

9. References

Aho, A.V., and J.D. Ullman. 1977. Principles of Compiler Design. Reading, MA: Addison-Wesley.

Berg, D.L., G.H. Gonnet, and F.W. Tompa. 1990. "The New Oxford English Dictionary Project at the University of Waterloo." Computational Lexicology and Lexicography: Special Issue Dedicated to Bernard Quemada (= Linguistica Computazionale 6), edited by A. Zampolli. Pisa: Giardini.

Kazman, R. 1986. Structuring the Text of the Oxford English Dictionary through Finite State Transduction. University of Waterloo Technical Report CS-86-20.


To top of this page
To Electronic TLL Home Page
To Thesaurus Linguae Graecae (TLG) Home Page
To University of California at Irvine Home Page

To Ann DeVito's Home Page
To University of Saskatchewan Dept. of Computer Science Home Page
To University of Saskatchewan Home Page


Written by Ann DeVito, July 1995.
Copyright 1995. This document may not be reproduced without the author's permission.
Most recent update 12 December 1995.
This page maintained by Ann DeVito; please send queries and comments to devito@cs.usask.ca.