ALLC/ACH 2000 Session 2 Abstracts

(2.1) Text Encoding

(2.1.1) Solutions for the Delivery of Thematically-Tagged Text

Terry Butler
Greg Coulombe
Sue Fisher
University of Alberta, Canada

Introduction

The Orlando Project has developed prototype delivery software which gives end users access to our literary history textbase. Richly-tagged SGML data is automatically converted to XML, and presented to users through a custom application (which runs locally on their machine, and communicates with a back-end XML server). The design of the user interface has been developed through a formal user needs analysis, conducted with a local Pilot Users Group. In the process, we have learned a great deal about how to exploit the richness of a heavily-tagged textbase, and how to present this information selectively to end users (meeting their information requirements without overburdening them with complexity).

The Goals of our Project

The Orlando Project is applying state-of-the-art software technology to traditional fields of study in the humanities. We are writing a literary history of women's writing in Britain, as both a conventional published text and as an SGML tagged textbase. At present (November 1999) we have documents on 850 British women writers and documents on 590 other writers. For each author we have a pair of interdependent documents - a biography and a writing life history. This material is supplemented by 13,600 events, which are discrete dated items providing the further essential and enriching political, social and cultural background to the work. Events vary in their depth of coverage, but are in every case in one way or another related to the literary history which we are writing. Here are three examples of events:

1863: Selective chronology: British Women writers: Florence Nightingale privately printed an anonymous pamphlet, Note on the supposed protection afforded against venereal disease by recognizing and putting it under police regulation. [keyword: law and legislation] [keyword: body/health - venereal disease]

August 1863: Selective chronology: British Women writers: Florence Nightingale corresponded with Harriet Martineau, outlining the case against the Contagious Diseases Acts. (Vicinus 441)

by 1871: Comprehensive chronology: Social climate: The Royal Commission on the Contagious Diseases Acts rejected a suggestion that soldiers and sailors be required to submit to the same regular examinations required of the prostitutes they frequented. The commission believed "there is no comparison to be made between prostitutes and the men who consort with them. With the one sex the offence is committed as a matter of gain; with the other it is an irregular indulgence of a natural impulse." This illustrates the double standard that held women to be sexually unresponsive and men to be prey to strong desire; paradoxically, this belief coexisted with the notion that women were emotional and irrational, while men were more enlightened and controlled.

Delivery Plans

The Orlando Project received SSHRC funding in 1994. Our grant proposal at that time argued that SGML was the only feasible means to capture and encode the complex thematic approach to literary history which the project required. As to the ultimate means of delivery for this information, we anticipated that the technology landscape would be utterly changed 5 years on. We believed that there would be ways to deliver SGML to end users at the very end of the 20th century (we were also aware, in 1994, that there were perfectly acceptable ways of converting and delivering SGML information). We have found that XML is the means to the end which we hoped would appear. XML is a rapidly developing W3 Consortium standard, which will permit the direct delivery of tagged information to end users. An XML audit of our textbase, carried out in 1998, showed us that (for delivery purposes) our textbase could be transformed from SGML to XML without any loss of its intellectual value. We are able today to deliver our richly-tagged information to a client program (running in an XML browser, such as Internet Explorer 5; or in a custom application which support XML though third party software such as IBM's XML toolkit.

User Needs

Assessment

Having received a great deal of positive encouragement from the scholarly community that the information we are developing is of considerable interest to them, we began a formal process of user needs assessment. A richly-tagged textbase such as ours can be exploited by end users in a wide variety of ways:

subject-specific searches can create customized chronologies and research texts for reading
imposing chronological limits can highlight issues and create connections which standard "period" labels obscure
consistent tagging allows one or more documents to be compared "side-by-side", to reveal new insights about authors and their context

The most important issue for us was to "bridge" between the complex tag set which we have created and the terminology and information expectations which will characterise our end users. The strengths of our tagging are their rigour, and the highly detailed descriptions of their meaning. Their deficiency (from the point of the end user) is that this knowledge is locked up in a single tag name which may be opaque (such as our Cultural Formation tag) or dangerously obvious (such as our Name tag, which has a precise meaning and occupies a specific niche in a constellation of about a dozen "personal name" tags). In order to drive the development of the software from the users' point of view (rather than our own), we struck a Pilot Users Group. This group (about a dozen people) were drawn from representative communities who we expect will be interested in accessing our information, including:

professors, graduate students, and undergraduate students
scholars in fields such as English literature and History
librarians and information scientists

The program for this group was devised in order to elicit their expectations and desires for our software, without raising the question of what the software would look like or how it would work. We began with meetings where the group were given only written and oral accounts of our Project's goals and content; we elicited the group's own descriptions and terminology for our areas of interest. In the fall of 1999, building upon our team's sense of what kinds of access we could provide to end users, the Pilot Users Group was asked to comment on an on-screen mock-up of our delivery software. These sessions were conducted as formal focus groups [Greenbaum; Jordan]; the sessions were recorded and team notetakers wrote down the comments and suggestions from the users group. Because the software on-screen was truly "throw away", we are able to genuinely encourage the users to critique it and explore their preferences and expectations. We also surveyed the computer equipment and level of experience of the user group; we will expand this survey, to make sure we create delivery software which our target users can run, and which they will be able to learn to use effectively.

Software Architecture

Our prototype delivery software is being written in a client/server fashion. The client end is a Java program which uses XML-aware code to request XML documents from the server to process them (by sorting, selecting, and sub-setting), and then displays them using XSL (the XML stylesheet language). Although it is technically possible to execute this part of the process inside an XML-capable browser, the nature of our textbase and the kinds of interaction which we wish to provide are rather unlike the Web-page metaphor. Our textbase can be queried to draw together coherent document sub-sections from many documents at once, which can be presented to the user in various forms, such as a customised chronology or a synoptic view of relevant sections from the lives or works of many authors at once. For this reason we feel the creation of an independent delivery program is desirable. A similar consideration operates with respect to linking within our textbase. We are implementing a much richer form of linking that the web at present provides; a great deal of the linking which end users will be able to explore will be generated automatically through the carefully and consistently tagged text. Users who are viewing text of interest will be able to pursue that interest by traversing automatic links which will open up from our elaborately tagged text. The server side of this architecture will make available our tagged textbase (as an XML document collection) which will respond to user queries by selecting and sending XML documents to the client program. We have explored various technologies to provide this searching and delivery on the back-end, including Java and CGI formats (using both Perl and SGREP to handle the searching). The obvious advantage of this approach is that the server can be implemented in more than one way (and be revised and extended as new technologies appear), while the front end client program remains the same (or is extended and improved on an independent trajectory). We are making extensive use of standard technologies, such as XML, XSL, and HTTP (for the communication between client and server). This will aid the process of generalising this software to meet the needs of other users who wish to present SGML or XML text to users without "rendering it down" to display-only formats like HTML.

Issues

XML is an emerging standard. The software support for XML is beginning to appear; our strategy will be more effective as XML becomes ubiquitous and a variety of robust XML-capable tools emerge.

The current effort is a "prototype"; the exercise of deploying it will have both successes and failures, from which we will learn.

We have been very careful to avoid using the "Web" metaphor - our textbase can be delivered in ways which are much more dynamic and more informative that a Web delivery metaphor would imply. This ambition is to some extent undercut by the expectations of our Pilot Users Group, who came to the material with "Web on the brain". A classic case of this was the specific comment that we ought not to use a certain shade of blue for text if it was not a link, because "blue means link".

References

Greenbaum, Thomas L (1997). The Handbook for Focus Group Research. Second edition; Sage Publications.
Jordan, Patrick W et al. (1996). Usability Evaluation in Industry. Taylor & Francis.
McConnell, Steve (1996). Rapid Development. Microsoft Press.
Butler, Terry and Fisher, Sue (1998). "Orlando Project: Issues when Moving from SGML to XML for Delivery of Content-Rich Encoded Text". Presentation at Markup Technologies '98, Chicago, Nov. 12-13, 1998.
Butler, Terry (1999). "Can a Team Tag Consistently? Experiences on the Orlando Project". Presentation given at ACH-ALLC 1999, Charlottesville VA, June 1999.

Return to ALLC/ACH Programme

(2.1.2) Meaning and Interpretation of Markup

C. M. Sperberg-McQueen
World Wide Web Consortium, USA

Claus Huitfeldt
University of Bergen, Norway

Allen Renear
Brown University, USA

Markup is inserted into textual material not at random, but to convey some meaning. An author may supply markup as part of the act of composing a text; in this case the markup expresses the author's intentions. The author creates certain textual structures simply by tagging them; the markup has performative significance. In other cases, markup is supplied as part of the transcription in electronic form of pre-existing material. In such cases, markup reflects the understanding of the text held by the transcriber; we say that the markup expresses a claim about the text.

In the one case, markup is constitutive of the meaning; in the other, it is interpretive. In each case, the reader (for all practical purposes, readers include software which processes marked up documents) may legitimately use the markup to make inferences about the structure and properties of the text. For this reason, we say that markup licenses certain inferences about the text.

If markup has meaning, it seems fair to ask how to identify the meaning of the markup used in a document, and how to document the meaning assigned to particular markup constructs by specifications of markup languages (e.g. by DTDs and their documentation).

In this paper, we propose an account of how markup licenses inferences, and how to tell, for a given marked up text, what inferences are actually licensed by its markup. As a side effect, we will also provide an account of what is needed in a specification of the meaning of a markup language. We begin by proposing a simple method of expressing the meaning of SGML or XML element types and attributes; we then identify a fundamental distinction between distributive and sortal features of texts, which affects the interpretation of markup. We describe a simple model of interpretation for markup, and note various ways in which it must be refined in order to handle standard patterns of usage in existing markup schemes; this allows us to define a simple measure of complexity, which allows direct comparison of the complexity of different ways of expressing the same information (i.e. licensing the same inferences) about a given text, using markup.

For simplicity, we formulate our discussion in terms of SGML or XML markup, applied to documents or texts. Similar arguments can be made for other uses of SGML and XML, and may be possible for some other families of markup language.

Related work has been done by Simons (in the context of translating between marked up texts and database systems), Sperberg-McQueen and Burnard (in an informal introduction to the TEI), Langendoen and Simons (also with respect to the TEI), Huitfeldt and others in Bergen (in discussions of the Wittgenstein Archive at the University of Bergen, and in critiques of SGML), Renear and others at Brown University, and Welty and Ide (in a description of systems which draw inferences from markup). Much of this earlier work, however, has focused on questions of subjectivity and objectivity in text markup, or on the nature of text, and the like. The approach taken in this paper is somewhat more formal, while still much less formal and rigorous than that taken by Wadler in his recent work on XSLT.

Let us begin with a concrete example. Among the papers of the American historical figure Henry Laurens is a draft Laurens prepared of a letter to be sent from the Commons House of Assembly of South Carolina to the royal governor, Lord William Campbell, in 1775. Some words have lines through them, and others written above the line. The editors of Laurens's papers interpret the lines through words as cancellations, and the words above the lines as insertions; an electronic version of the document using TEI markup and reflecting these interpretations, might read thus:
<DEL>It was be</DEL> <DEL>For</DEL> When we applied to Your Excellency for leave to adjourn it was because we foresaw that we <DEL>were</DEL> <ADD>should continue</ADD> wasting our own time ... 
From the DEL elements, the reader of the document is licensed to infer that the letters "It was be", "For", and "were" are marked as deleted; from the ADD element, the reader may infer that the words "should continue" have been added. Software might rely on these inferences in the course of making a concordance or displaying a clear text; human readers will rely on them in interpreting the historical document. Note that the markup here stops short of licensing the inference that "should continue" was substituted for "were". The editors could license that inference as well by appropriate markup, if they wished. Human readers may make the inference on their own, given the linguistic context; software cannot safely infer a substitution every time an addition is adjacent to a deletion.

A simple way to capture the meaning of markup is to define, for each markup construct, a set of open sentences - sentences with unbound variables - which express the inferences licensed by the use of that construct. In formal reasoning, such open sentences may be transformed into logical predicates in the usual way.

For example, the TEI element type DEL is said by the documentation to mark "a letter, word or passage deleted, marked as deleted, or otherwise indicated as superfluous or spurious in the copy text by an author, scribe, annotator or corrector" (TEI P3, p. 922). We take this to mean that when a DEL element is encountered in a document, the reader is licensed to infer that the material so marked has been deleted. In formal contexts, we may write "deleted(X)"; we can specify the meaning of the DEL element and of the logical predicate "deleted(X)" by means of an open sentence: "X has been deleted, or marked as deleted, or ..." etc. The variable X is to be bound, in practice, to the contents of the DEL element. If we imagine a variable named 'this', instantiated to each element of a document in turn, and a function 'contents' which returns the contents of its argument, then the meaning of the DEL element becomes "deleted(contents(this)))", or equivalently "contents(this) has been deleted ..." etc.

The TEI element type HI, similarly, "marks [its contents] as graphically distinct from the surrounding text" (TEI P3, p. 1013). We can capture the meaning of HI by the open sentence "X is graphically distinct from the surrounding text", or "highlighted(X)", where X is, as before, to be replaced by "contents(this)".

Attributes may be treated similarly. The 'rend' attribute on the <hi> element "describes the rendition or presentation of the word or phrase highlighted". In the example
<HI REND="gothic">And this Indenture further witnesseth</HI> that the said <HI REND="italic">Walter Shandy</HI>, merchant, in consideration of the said intended marriage ... 
the HI elements convey the information that the contents of those elements are distinct from their surroundings, while the 'rend' attributes on the HI elements specify how. The meaning of the 'rend' attribute is expressed by the open sentence "X is rendered in style Y." An HI element with a 'rend' attribute thus means "X is graphically distinct from its surroundings, and X is rendered in style Y".

Perhaps the simplest method of interpreting markup is to assume that

The meaning of every element type is expressed by an open sentence whose single unbound variable is to be bound to 'contents(this)'.
The meaning of every attribute is expressed by an open sentence with two unbound variables, one of which is to be bound to 'contents(this)' and the other to 'value(this,attribute-name)' (i.e. to the value of the attribute in question). In other words, each attribute defines some relation R which holds between the contents of the element and the value of the attribute.
All inferences licensed by any two elements are compatible.

The set of inferences applicable to any given location L is then the union of the inferences licensed by all the elements within which L is contained. Let us call this the 'union model' of interpretation.

The union model is simple, and provides a good first approximation of the rules of inference for marked up text. But it is not wholly adequate.

First, it fails to distinguish distributed properties (such as 'italic' or 'highlighted') from sortal properties (such as paragraphs, sections, or - as illustrated above - deletion). It is as true to say "The word 'And' is in black-letter" as to say it of the entire phrase, and the meaning of the example given above would not change if the HI elements were split into two or more adjacent pieces each with the same 'rend' value. Conversely, two HI elements with the same attribute values can be merged without changing the meaning of the markup. Other elements mark properties which are NOT distributed equally among the contents, and cannot be split or joined without changing the meaning of the markup. From the markup

Reader, I married him.

we can infer the existence of one paragraph, but we cannot infer that "Reader" is itself a paragraph. Such properties we call 'sortal' properties, borrowing a term of art from linguistics. Elements marking sortals are usefully countable; those marking distributed properties are not.

Second, the union model fails to allow a correct interpretation of inherited values and overrides, as illustrated by the TEI 'lang' attribute or the xml:lang attribute of XML. In fact, some inferences do contradict each other, and specifications of the meaning of markup need to say which inferences are compatible, and which are in conflict, and how to adjudicate conflicts.

Third, the union model allows inferences about a location L only on the basis of markup on open elements (those which contain L); in order to handle common idioms of SGML and XML, a model of interpretation must handle

upward propagation: the meaning of an element may depend in part on its contents; this is unusual in colloquial SGML/XML systems, but is a regular feature of proposals to eliminate attributes from markup languages.
context dependency: the meaning of an element may depend on its context; trivial examples include TEI's HI and FOREIGN, which can mean 'not-Roman' and 'not-English' in one context, and 'not-italic' and 'not-German' in others.
ordinal position, relative or absolute; dependence of meaning upon ordinal position is seldom an explicit feature of markup languages, but dependence of processing based on position is a standard feature of style-sheet languages.
milestone elements; these convey information by position in the beginning-to-end scan of the linear form of the document, rather than by position in the tree.
linking: out-of-line or 'standoff' markup conveys information about location L based not only on open elements, but on elements which point at L or some ancestor of L.

Other methods of associating markup with meaning are imaginable, but we believe a survey of existing DTDs will show that all or virtually all current practice is covered by any model of interpretation which encompasses the complications just outlined.

Essentially, these can be handled by extending the rules for binding variables in the open sentences which specify the meaning of a given markup construct. The simple union model allows only 'contents(this)' and 'value(this,attribute-name)'; the constructs listed above require more complex expressions, roughly equivalent in expressiveness to the TEI extended-pointer notation or to the patterns of the XPath language defined by W3C.

Complexity of the semantics associated with an element type or attribute may be measured by the number of unbound variables in the open slots, by the complexity of the expressions which are to fill them, and by the amount or kind of memory required to allow full generation of the inferences licensed by markup in a particular text.

References

DeRose, Steve et al. (1990). "What is Text, Really?" Journal of Computing in Higher Education 1: 3-26.
Huitfeldt, Claus (1995). "Multi-Dimensional Texts in a One-Dimensional Medium." CHum 28: 235-241.
Langendoen, D. Terence, and Simons, Gary F. (1995). "Rationale for the TEI Recommendations for Feature-Structure Markup." CHum 29.3: 191-209.
[Laurens, Henry.] (1985). "Commons House of Assembly to Lord William Campbell." The Papers of Henry Laurens, ed. David R. Chesnutt et al. University of South Carolina Press, Columbia, S.C.. Vol. 10, pp. 305-308.
Pichler, Alois (1993). "What is Transcription, Really?" ACH/ALLC '93, Georgetown.
Renear, Allen, Durand, David G., and Mylonas, Elli (1995). "Refining our notion of what text really is: the problem of overlapping hierarchies." Research in Humanities Computing. Oxford University Press, Oxford. Originally delivered at ALLC/ACH '92.
Simons, Gary F. (1997) "Conceptual Modeling versus Visual Modeling: A Technological Key to Building Consensus." CHum 30.4: 303-319.
Sperberg-Mcqueen, C. M., and Burnard, Lou (eds) (1994). Guidelines for Electronic Text Encoding and Interchange (TEI P3). Chicago, Oxford: ACH, ALLC, and ACL, 1994.
Sperberg-Mcqueen, C. M., and Burnard, Lou (1995). "The Design of the TEI Encoding Scheme." CHum 29: 17-39.
Wadler, Philip (1999). "A formal semantics of patterns in XSLT." Paper presented at Markup Technologies '99.
Welty, Christopher, and Ide, Nancy (1999). "Using the Right Tools: Enhancing Retrieval from Marked-up Documents." CHum 33: 59-84. Originally delivered at TEI 10, Providence (1997).

Return to ALLC/ACH Programme

(2.1.3) A Formal Model for Lexical Information

Nancy Ide
Vassar College, USA

Adam Kilgarriff
ITRI Brighton, UK

Laurent Romary
LORIA/CNRS, France

1. Introduction

The structure and content of lexical information has been explored in considerable depth in the past, primarily in order to determine a common model that can serve as a basis for encoding schemas and/or database models. For the most part, descriptions of lexical structure have been informed by the format of printed documents (e.g., print dictionaries), which varies considerably over documents produced by different publishers and for different purposes, together with the requirements for instantiation in some encoding format (principally, SGML). However, the constraints imposed by these formats interfere with the development of a model that fully captures the underlying structure of lexical information. As a result, although schemas such as those provided in the TEI Guidelines exist, they do not provide a satisfactorily comprehensive and unique description of lexical structure and content.

We believe that in order to develop a concrete and general model of lexical information, it is essential to distinguish between the formal model itself and the encoding or database schema that may ultimately instantiate it. That is, it is necessary to consider, in the abstract, the form and content of lexical information independent of requirements and/or limitations imposed its ultimate representation as an encoded or printed object. This is especially important since these eventual representations will vary from one application to another; in particular, lexical information may be encoded not only for the purposes of publishing in print or electronic form, but also for creating computational lexicons, terminology banks, etc. for use in natural language processing applications. It is therefore essential to develop a model that may be subsequently transformed into a variety of alternative formats.

In this paper, we outline a formal model for lexical information that describes (a) the structure of this information, (b) the information associated with this structure at various levels, and (c) a system of inheritance of information over this structure. We then show how the structure may be instantiated as a document encoded using the Extended Markup Language (XML). Using the transformation language provided by the Extensible Style Language (XSL), we then demonstrate how the original XML instantiation may be transformed into other XML documents according to any desired configuration (including omission) of the elements in the original. Because of its generality, we believe our model may serve as a basis for representing, combining, and extracting information from dictionaries, terminology banks, computational lexicons, and, more generally, a wide variety of structured and semi-structured document types.

2. Overview of the theoretical model

The underlying structure of lexical information can be viewed as embedded partitions of a lexicon, in which no distinction is made among embedded levels. A model of lexical information can be thus described as a recursive structure comprised, at each level, of one or more nodes. This structure is most easily visualized as a tree, where each node may have zero or more children. That is, at any level n, a node is either a leaf (i.e., with no children) or can be decomposed as:

T=[T1, T2, ..., Tn]

where each Ti is a node at level n+1.

Properties may be attached to any node in the structure with the prop predicate:

PROP(T,P)

indicates that the property P is attached to node T.

Properties are associated with nodes either by explicit assignment, or they may be inherited from the parent node. The object of our model is to identify the ways in which properties are propagated through levels of structure. For this purpose, we consider properties to be Feature-Value pairs expressed as terms of the form FEAT(F,V), where F and V are tokens designating a feature (e.g., POS) and a value. In the simplest case, values are atomic (e.g., NOUN) but may also consist of sets of feature-value pairs. This representation is consistent with the base notation associated with feature structures, a common framework for representing linguistic information.

3. Propagating information across levels

We define three types of features:

Cumulative features that may take more than one value and may be thus inherited and combined along the structure. For example, for a cumulative feature DOMAIN, if the property FEAT(DOMAIN,NAVIGATION) is associated with a node at level n and FEAT(DOMAIN,LAW) is associated with its child at level n+1, by inheritance the node at level n+1 will be assigned the property FEAT(DOMAIN,NAVIGATION + LAW).
Overwriting features that take only one value at a time. This implies that only one instance of an overwriting feature may appear at a given node and that the corresponding properties are propagated along the structure unless and until a new value is specified for that feature. In such a case, the new value "overwrites" the earlier one and is subsequently propagated to nodes in its subtrees.
Local features, which apply only at the node with which they are associated; i.e., they are not propagated through the structure. Cross-references are an example of a local feature, since they apply only to the level of description with which they are directly associated.

The full paper will provide details of this formalism.

4. Creating representations

Lexical information can be represented as a tree structure reflecting, in large part, the natural hierarchical organization of entries found in printed dictionaries. This hierarchical organization (e.g., division into homographs, senses, sub-senses, etc.) enables information to be applied over all sub-levels in the hierarchy, thus eliminating the need to re-specify common information.

For example, consider the following definition from the Collins English Dictionary (CED):

EX.1: overdress

overdress vb. (zzzz) 1. To dress (oneself or another) too elaborately or finely. ~n. (yyyy) 2. A dress that may be worn over a jumper, blouse, etc.

This information can be represented in tree form as follows :

[ orth : overdress] [ pos : verb pron : zzzz def: To dress (oneself or another) too elaborately or finely] [ pos : noun pron : yyyy def : A dress that may be worn over a jumper, blouse, etc.]

Each node in the tree represents a partition of the information in the entry, and information is inherited over sub-trees. Thus in this example, the orthographic form "overdress" appears at the top node and applies to the entire entry; the entry is then partitioned into two sub-trees, for verb and noun, each of which is associated with specific information about part of speech, pronunciation, and definition.

The final paper will provide similar examples from dictionaries as well as terminological data banks.

5. Extracting information from the tree

We define a tree traversal as any path starting at the root of the tree and following, at each node, a single child of that node. A full traversal is a path from the root to any leaf; a partial traversal extends from the root to any node in one of its subtrees.

As a tree is traversed, each node is associated with a set of features including: (a) features associated with the node during tree creation, and (b) features determined by applying the rules for propagating overwriting, cumulative, and local features. Thus, at any node, all applicable information is available for some unique partition of the lexical space. Nodes near the top of the tree represent very broad categories of partition; leaf nodes are associated with information for the most specific usage of the entry.

6. Encoding the information in XML

We define an XML encoding format for the structures described above:

Elements

<struct> represents a node in the tree. <struct> elements may be recursively nested at any level to reflect the structure of the corresponding tree. <struct> is the only element in the encoding scheme that corresponds to the tree structure; all other elements provide information associated with a specific node.

<alt> alternatives are bracketed in parallel <alt> elements, which may appear within any <struct>. <brack> is a general-purpose bracketing element to group associated features.

Base elements corresponding to various features, such as (for dictionaries) orth, pron, hyph, syll, stress, pos, gen, case, number, gram, tns, mood, usg, time, register, geo, domain, style, def, eg, etym, xr, trans, and itype, (analogous to dictionary elements defined in the TEI Guidelines.)

Attributes

Attributes are used to provide information specific to the element on which they appear and are not inherited in a tree traversal.

The following shows the corresponding XML encoding for "overdress":

<struct> <orth>overdress</> <struct> <pos>verb</> <pron>zzzz</>

<def> To dress (oneself or another) too elaborately or finely</></> <struct> <pos>noun</> <pron>yyyy</> <def> A dress that may be worn over a jumper, blouse, etc.</></></>

7. Transforming the XML document

The Extensible Style Language (XSL) is a part of the XML framework that enables transformation of XML documents into other XML documents. The best-known use of XSL is the formatting of documents for display on web browsers. However, XSL also provides a powerful transformation language that can be used to convert an XML document describing lexical information by selecting, rearranging, and adding information to it. Thus, a document encoded according to the specifications outlined in the previous section can be manipulated to serve any application that relies on part or all of its contents. The current version of the XSL transformation language is available at <http://metalab.unc.edu/xml/books/bible/updates/14.html>.

Lack of space prevents providing examples; the final paper will include these.

Return to ALLC/ACH Programme

(2.2) The Electronic Classroom

(2.2.1) Teaching Cybertext Writing, Design, and Editing: Language, Image, Linking, Thinking

Christopher Funkhouser
New Jersey Institute of Technology, USA

Ira Shor, in Critical Teaching and Everyday Life, proposes a student-centered pedagogy which theorizes that everyone immersed in mass culture is "habituated to a dizzying pace of life." (63) Describing the factors today's teachers face, Shor writes about the "addicting standard of stimulation" set by radio, television and other illuminated media, certifiying that a "hyped use of words in pictures fits into the whole accelerated gestalt of daily life." (63-64)

Accepting this as contemporary circumstance, methods must be constructed and made readily available to help teachers bridge the gap between past and present in terms of technology and the humanities classroom. Digital technology is changing the whole nature of education in our society. This means that professors and students from all disciplines need to be prepared to read and transmit their work in new ways via the computer.

My essay will outline, then describe in detail, successful methodology established in teaching "Electronic Publishing" classes to graduate and undergraduate students with interests across multiple disciplines at New Jersey Institute of Technology. The primary objective of "Electronic Publishing" is to enhance a previously untrained student's ability to use computers effectively and intelligently to create and design texts in academic, commercial, or other settings. Projects in this course of study intend to build understanding and functional skills in the visual presentation and online structuring of information. Students learn how to create interactive online documents that incorporate language with visual aspects of computerized text by combining graphics, sound, animation, text, and video into compelling content. The approach to teaching cybertext writing and design I have developed at New Jersey Institute of Technology since 1997 is effective for students presenting research in every area of the humanities, including languages and literature, history, philosophy, music, art, film studies, linguistics, anthropology, archaeology, creative writing, and cultural studies.

As a pedagogue, I formulate this discipline as an investigative, processual endeavor that demands the understanding and application of two human principles in conjunction with four essential aspects of design. To learn and succeed as online producers of text, students must first embrace and attempt to embody the concepts of patience and organization; the fundamental areas of attention in creating hypertext documents are introduced as: language, image, linking, and thinking. Every technical and aesthetic aspect, or problem, of document construction may be addressed through a series of questions, and a checklist of formal considerations associated with these principles and areas of attention.

All of the dimensions or elements within the principles and aspects of design highlighted above will be fully addressed and explained in the paper. Among the multiple subjects that arise in this discussion of how to teach students to produce cybertext are: gathering and formatting content, conducting research on the Internet, presenting effective visual communication, strategizing and solving technical problems, interlinking and layering documents, and otherwise establishing objectives and sensible schemes for online documents. In "Electronic Publishing," students are eventually introduced to a completely different language: the relentlessly precise language of computer programming, HTML, which intervenes with content and re-creates sense and vision within cybertext writing and editing. Code is language that handles the work of online producers: writing, image, and sound; sometimes it is relatively easy to understand and use, at others it may also be fearfully complicated. Methods of conceptualizing (for students) what HTML is, and how to make use of it in humanities projects, will be outlined in this presentation.

This essay will, in addition to covering materials listed above, offer a detailed account of the various components of the "Electronic Publishing" courses which consists of a month of unique design-oriented research followed by two months of "hands-on" work. Students in the course not only study electronic publishing, they do electronic publishing by editing two editions of a journal based on their personal academic or creative pursuits. An electronic portfolio of a student's work in every class they are enrolled in must also be completed as part of "Electronic Publishing."

The program of "Electronic Publishing" designs a technology plan for other Humanities-oriented departments interested in developing curricula around electronic publishing initiatives, Internet communication, and hardware/software management schematics. Methods of reading and presenting work using technologically sophisticated computers and networks are made clear by my process; students are quickly able to exhibit and exercise their learning in these courses. At the conclusion of my paper, I will present guidelines formulated for the assessment of student generated work.

For current "Electronic Publishing" course materials on the World Wide Web, see: <http://www-ec.njit.edu/~cfunk/353>, <http://www-ec.njit.edu/~cfunk/605>

Return to ALLC/ACH Programme

(2.2.2) A Toolbox for the Electronic Classroom

Peter L. Havholm
Larry L. Stewart
The College of Wooster, USA

Writing about the new economy, Jeff Madrick remarks that the kinds of good jobs increasing most rapidly "require communication skills, social ease, and basic reasoning abilities ..." Acquiring such skills, he believes, "may only be possible through higher education, where students are exposed to a sophisticated culture, a variety of experiences, and varying disciplines that require analysis of facts and concepts" (33). In our view, the conventional classroom on a college or university campus remains the best facility for such an education, and we think the new technologies can be used to make it even more powerful.

By contrast, much excitement about technology in governing bodies is lavished on various money-saving adaptations of distance education using the web. Perhaps as a result, some advocates of increased use of technology in higher education seem to believe the traditional classroom anachronistic (see Daniel, 1996 and 1997). In our opinion, such thinking bodes ill for the kind of learning we see as vital. Rather than using technology to replace teachers and conversation, we think it should take its place with more conventional tools as another way to enhance teaching and learning conceived in traditional ways.

In what follows, we describe (and illustrate in presentation) a technologically enhanced classroom and accompanying tools that operationalize a philosophy of pedagogy that puts technology at the service of active learning. While we have previously presented some of the tools we use in this classroom (Havholm and Stewart 1996, 1998), this presentation aims at showing a style of teaching we believe to be particularly promising. While it is too soon to claim more than anecdotal success, we know it promotes active learning because it demonstrably extends students' powers of inquiry.

We and colleagues increasingly favor this kind of use of technology - in a range of disciplines - at The College of Wooster. It saves no money in the short term, however. Rather than doing away with buildings or teachers, it adds technology to a conventional classroom housing a small number of people, one of whom is salaried. But over the long term, if we are right, our graduates have the intellectual and cultural capital that allows them to think and learn independently. They will not need expensive re-training every time their environment changes a little.

Our electronic classroom looks like a seminar room, with a table in the middle, surrounded by comfortable chairs. It differs in that along its walls twenty to thirty networked computers stand ready, each linked to one another, to a screen/video projector overhead, and to the internet. Such a classroom clearly values physically proximate talk, but it also brings the huge resources of the internet to any conversation that wishes them. Moreover, it makes possible the easy use of a range of new tools that encourage active learning.

For example, Peter Havholm and our colleague Jenna Hayward use a freeware beta version of PennMUSH in a course on dramatic structure to allow the class (of 29) to improvise a seven-episode serial drama. Students play characters and invent actions on-line, edit the logs of their online sessions into scripts, and then publish a final version on the web for friends (on- and off- campus) to read.

Because of the technology, they can write and publish a play that belongs to all of them. Most important, however, is that this exercise is not done in a playwriting class but in a study of drama. Rather than honing writing skills, writing and publishing a play in this class tests the principles of structure students are learning from their reading of a dozen plays and Aristotle's Poetics. And because the technology makes publication so easy, the whole project takes only about 15% of class time.

Having students write and publish a drama to test theoretical ideas was a natural development from another kind of project several of us in the English department use in our writing courses. In the Journalism course, in Introduction to Non-fictional Writing, and in English 101 as well as in the course Writing for Magazines, students spend two to five weeks writing, editing, designing, and producing a magazine, using page layout software, which they then either give away or sell on campus.

The publishing projects have pleased several of us because students so much enjoy writing to intrigue and amuse their peers - and because the projects make self-evidently necessary the tasks of re-writing, careful consideration of audience and voice, and editing. No need for exhortation about these activities; one cannot make a magazine to impress one's friends without them. We also believe that preparing writing for publication - with headlines, pullquotes, illustrations, captions, and the rest - provides valuable experience in imagining oneself as one's reader and in visual thinking.

Among the tools that have been particularly useful in courses in narrative or narrative theory is the Linear Modeling Kit (or LMK), a program the two of us designed and have worked with for several years (see Havholm and Stewart, 1996). The LMK is essentially an authoring system, and it allows users to create applications that generate any kind of text according to principles proposed by the user. For example, a student can use the LMK to create a "folktale generator" by entering what the student perceives to be the parts or elements of a folktale, any principles of order among those parts, and characteristic text for each part. Depending on the complexity of the input, the generator will produce hundreds, thousands, or millions of different texts. In our classes, students have created not only folktale generators but bildungsroman generators, romance generators, tragedy generators, and argument generators.

As do all the activities we discuss here, working with the LMK acts as an heuristic, forcing students to move back and forth between theory and practice. To produce an LMK generator, students must first abstract principles from narratives they have read and then turn them into instructions for their generators. The generator then operationalizes the student's theory; it produces narratives created according to the principles the student has derived.

Another of the tools we use in the electronic classroom is the Stylistic Analysis Kit (or SAK), a combination concordance and counting program with a nearly flat learning curve. Although the SAK is a fairly conventional program, its ease of use separates it from many of the tools used by professional researchers and makes it ideal for the student in the classroom.

When analyzing their own papers, students are almost always driven back to their texts by, for example, discovering their average sentence length to be half that of the person sitting at the next computer or by learning that "the" comprises 14.7% of their total words. Here, students move between the abstraction of statistics and their own practice as writers. Even those who seem generally to lack curiosity are fascinated by the statistical record of their writing and eager to determine what practices account for those statistics.

There has recently been much publicity about the ease with which new hardware and software can be used to create complex video projects. Our students have begun to find that - like desktop publishing software - the new video tools can be used to explore and test ideas. Ben Speildenner chose to make a video as his final project in our colleague Jenna Hayward's course in Post-Colonial Literature. In Urban Legends, he wanted to to evoke reflection congruent with one of the principal issues of the course. The class had talked about how easy it is to essentialize one's own culture while seeing other cultures as "different." Ben chose several urban legends - that Disney makes heavy use of phallic imagery in The Little Mermaid, that there's a boy with a shotgun in the background of a scene in Three Men and a Baby, that there's a hanged Munchkin in The Wizard of Oz if you look closely enough at the right moment, and that spiders can lay eggs on your face - to shake our presuppositions. He wanted his presentations of these legends to push his audience into problematizing their own culture. He thinks that our urban legends show us to be more "different" than we think we are. But he wanted to stimulate thought, not to impose his ideas on the class.

In Understanding and Cognition, Terry Winograd and Fernando Flores make a convincing case against the use of computers as "restricted to representing knowledge as the acquisition and manipulation of facts, and communication as the transferring of information" (78). Rather, they argue that we need to design computers as "equipment for language" so that they can "create new possibilities for the speaking and listening that we do" (79).

Our version of the electronic classroom and our use in it of the tools we have described reflect this understanding of technology. In every case, students use the tools to interrogate ideas in ways novel in humanistic study. In an important sense, each tool allows students to test their understanding: the effectiveness of a serial drama tests ideas about dramatic structure; reader response to a published magazine tests convictions about rhetoric; the variety of lawful stories an LMK generator produces tests the powers of the theory of narrative it has been "taught"; the SAK's quantitative analysis leads to testing qualitative ideas about writing style; and his classmates' response to Spieldenner's Urban Legends tested his hypothesis that presenting urban legends can help us think in new ways about "difference."

In every case, we believe, the technology adds power to students' ability to question and therefore to understand - in the context of a kind of discussion as old as learning.

References

Boyer, Ernest L (1990). Foreword. Campus Life: In Search of Community. Princeton UP, Lawrenceville, NJ.
Daniel, J. S. (1996). The Mega-Universities and Knowledge Media. (Open & Distance Learning) Kogan, London.
Daniel, J. S. (1997). Why Universities Need Technology Strategies. Change, July/August, pp. 10-17.
Havholm, Peter and Stewart, Larry (1996). Modeling the Operation of Critical Theory on the Computer. Computers and the Humanities, 30:2, pp. 107 - 115.
Havholm, Peter and Stewart, Larry (1996). Using a Narrative Generator to Teach Literary Theory. ALLC-ACH '96 Conference Abstracts. Bergen, pp. 135 - 37.
Havholm, Peter and Stewart, Larry (1998). Computers and Active Learning: Using the Stylistic Analysis and Linear Modeling Kits. ALLC-ACH '98 Conference Abstracts. Debrecen, pp. 135 - 37.
Madrick, J. Computers: Waiting for the Revolution. The New York Review of Books, 29-33.
Winograd, Terry and Flores, Fernando (1987). Understanding Computers and Cognition: A New Foundation for Design. Addison-Wesley, New York.

Return to ALLC/ACH Programme

(2.2.3) Ekphrasis and the Internet: Connecting the Verbal and the Visual with Computer-mediated Student Projects in an Undergraduate Humanities Class

Donna Reiss
Tidewater Community College, USA

Art Young
Clemson University, USA

In her exploration of ekphrasis, the relationship between visual and verbal arts, Amy Golahny reminds us that references to the interconnectedness of the language of pictures and words date at least from the fifth century B.C. when Simonides said, "as in painting, so in poetry". In the first century B.C., Golanhy adds, Horace said that "painting is mute poetry and poetry a speaking picture". Further consideration of the concept of ekphrasis by Murray Krieger and W.J.T. Mitchell brings our attention to this verbal-visual relationship up to date. However, at the end of the twentieth century, most undergraduate education in the humanities continues to approach these art forms separately or to focus on student-generated text alone for developing and communicating ideas.

The ease with which the Internet now allows students to exchange, create, and manipulate text and images offers new opportunities for engagement with the composing process. Because our goal as teachers of undergraduate writing and literature classes is creative as well as critical communication and because our pedagogy emphasizes active learning processes, we introduce our students to computing in and about the humanities. Dialogic writing within and beyond their classes enables students to enter into new discourse communities and to explore collaboratively the concepts of their courses. Creating, selecting, and manipulating visual images alone or in conjunction with text introduces students to expanded and contemporary composing processes. Publication of their compositions on the Internet provides them with an audience of other learners. They need not strive to be professional poets or painters to be makers of poems and paintings as a way to learn.

Although our students may read Blake at a Website or in an edition illustrated by his own drawings or read Auden's "Musée des Beaux Arts" accompanied by a reproduction of Brueghel's Landscape with the Fall of Icarus, the relationship between the visual and verbal has not been emphasized in undergraduate higher education, where science textbooks are likely to have more illustrations than literature anthologies. How do humanities teachers dramatize the connection between the visual and verbal for our students and thus help our students understand the interrelatedness of the linguistic and graphical arts? How do we revive their own creativity and cognitive skills with words and pictures? After all, our students probably illustrated their own words in elementary school but are seldom invited to do so in college.

New technologies, in particular the World Wide Web, are bringing words and pictures together for us and our students in ways that might bring those connections back to our college classrooms. Document design now extends beyond the one-inch margin requirements of MLA student manuscripts. Instead, our students are learning with us about screens and color and negative space and visual communication as integral to rather than decoration for the word.

We will describe undergraduate literature and writing projects in which student-generated words and graphics are central to communication of ideas. In these projects, publication of their compositions on the Internet encourages students to reflect on the connections between technology and art, word and image, private and public writing, and their own creative and critical processes. These projects give students opportunities to perceive and to communicate visually, orally, textually, kinesthetically - in other words, they provide multisensory learning experiences.

Theoretical foundations for student-generated compositions in this project come not only from Golahny, Krieger, and Mitchell but also from chapters on teaching in Learning Literature in an Era of Change: Innovations in Teaching. Terri Pullen Guezzar ("From Short Fiction To Dramatic Event: Mental Imagery, The Perceptual Basis of Learning in the Aesthetic Reading Experience") applies the theories of Rudolf Arnheim and Allen Paivio, who argue that privileging the verbal over the visual limits our cognitive development and that separating verbal from visual perception fragments our understanding of and communication about literature. Pedagogical theory is featured in "Figuring Literary Theory and Refiguring Teaching: Graphics in the Undergraduate Literary Theory Course," where Marlowe Miller maintains, "Graphics help students conceptualize complex and abstract theories so that they can identify the central concepts and assumptions of those theories."

Two accessible resources for teachers thinking about integrating new media into undergraduate education also are useful for encouraging colleagues to incorporate the Internet as a learning environment and to make computer-mediated student projects integral to the learning process. In Seven Principles for Good Practice in Undergraduate Education: Implementing with Technology, Arthur W. Chickering and Stephen C. Ehrmann describe ways the following tenets can be incorporated into computer-mediated instruction: contacts between students and faculty, reciprocity and cooperation among students, active learning techniques, prompt feedback, time on task, high expectations, respect for diverse talents and ways of learning. Cooperation among students and active learning techniques as well as respect for diverse learning styles all are supported by multisensory student online publications in which students create original works of art or combine text and images to learn and to communicate their learning.

Additional encouragement for teachers and students comes from Engines of Inquiry: Teaching, Technology, and Learner-Centered Approaches to Culture and History by Randy Bass, director of the American Crossroads Project, Georgetown University. Bass identifies "six kinds of quality learning" that "information technologies can serve to enhance": distributive learning, authentic tasks and complex inquiry, dialogic learning, constructive learning, public accountability, and reflective and critical thinking. Once again, collaborative student-generated projects are emphasized as effective learning strategies.

Teaching at two quite different types of institutions, Donna at a large multicampus urban-suburban open admissions community college on the Atlantic coast of Virginia and Art at a selective land-grant university emphasizing agriculture, engineering, science, and technology in the foothills of South Carolina, we both have found that opportunities to compose and share text and images has enriched learning for undergraduates. Examples from the work of our own students and of our colleagues' students will demonstrate some ways that novice scholars learn "from the inside out" by creating, selecting, combining, and manipulating text and images in electronic environments.

Using either a live Internet connection (preferable) or files on disk displayed through a Web browser as well as an overhead projector, we will present and analyze student work that illustrates the conjunction of visual and verbal knowledge and its significance for introducing undergraduate students to the artistic life of their community and to computer-mediated composing as well as for fostering their creative and cognitive development.

<http://onlinelearning.tc.cc.va.us/faculty/tcreisd/projects/achalc2k/>

References

Arnheim, Rudolf (1980). Visual Thinking. University of California Press, Berkeley.
Bass, Randy (26 Oct. 1998). Engines of Inquiry: Teaching, Technology, and Learner-Centered Approaches to Culture and History. <http://www.georgetown.edu/crossroads/guide/engines3.html>.
Chickering, Arthur W. and Ehrmann, Stephen C. (1997). Implementing the Seven Principles: Technology as Lever. <http://www.aahe.org/technology/ehrmann.htm>.
Chickering, Arthur W. and Gamson, Zelda (1987). Seven Principles for Good Practice in Undergraduate Education. AAHE Bulletin, March 1987.
Golahny, Amy, ed. (1996). The Eye of the Poet: Studies in the Reciprocity of the Visual and Literary Arts from the Renaissance to the Present.
Guezzar, Terri Pullen (2000). From Short Fiction To Dramatic Event: Mental Imagery, The Perceptual Basis of Learning in the Aesthetic Reading Experience. In Dona Hickey and Donna Reiss (eds) Learning Literature in an Era of Change: Innovations in Teaching, Stylus, 2000, Sterling, VA, pp. 74-86.
Krieger, Murray (1992). Ekphrasis: The Illusion of the Natural Sign. Johns Hopkins University Press, Baltimore.
Miller, Marlowe (2000). Figuring Literary Theory and Refiguring Teaching: Graphics in the Undergraduate Literary Theory Course. In Dona Hickey and Donna Reiss (eds) Learning Literature in an Era of Change: Innovations in Teaching, Stylus, 2000, Sterling, VA, pp. 61-73.
Mitchell, W. J. T. (1995). Picture Theory. University of Chicago Press, Chicago.
Paivio, Allan (1971). Imagery & Verbal Processes. Lawrence Erlbaum Associates, Mahwah, NJ.

Return to ALLC/ACH Programme

(2.3) Stylistics

(2.3.1) SMART Project: Methods for Computer-based Research of Premodern Chinese Texts

Christian Wittern
Chung-Hwa Institute of Buddhist Studies, Taiwan

This presentation will start with a look at some of the problems encountered so far in a number of projects that tried to apply TEI [TEIP3] markup to premodern Chinese Buddhist texts. I have been working with the TEI Guidelines for more than seven years and published the first text, rather heavily marked up in TEI fashion, in 1995¹. Since then I became involved with some other projects digitizing Chinese Buddhist texts, most prominently the work by the Chinese Buddhist Electronic Texts Association (CBETA) ². We now have about 200 MB of texts basically marked up³ according to the Guidelines.

All of these projects worked from printed editions published 80-100 years ago. One of the most obvious problems we encountered is the large amount of non-standard characters found in these texts, but TEI and SGML in general is quite able to handle this elegantly - nevertheless there are some important details that should be noted⁴. Some of the more subtle problems involve structural elements specific to texts of the sphere of Chinese cultural influence. Examples of these elements include the notion of a scroll, that is carried over from the time when the documents were actually written on scrolls, but still mark divisions in the printed editions. Being based on the physical medium, they fall into a similar category as the LB, PB and MILESTONE elements in TEI, but they are usually associated with some other heading-like text, colophons and the like. While this could be taken care of with the FW in some way, we decided to come up with our own solution, which was to introduce a new element, JUAN, (Chinese for scroll) and encode the information therein. Other structural elements that presented difficulties include colophons or other backmatter-like text at the end of a scroll, but in the middle of a DIV element that continued on the next scroll and sound glosses in the text.

A second part of this presentation will give an overview of the recent developments in the SMART (System for Markup and Retrieval of Texts) project⁵. This project aims at providing a working environment for research and markup on East Asian texts by utilizing the TEI Guidelines (see also [SpMcQ91]) and other international, open standards. The environment tries to enable network based collaboration and layered, private markup added to a central repository of texts, but it is intended to make it possible to use it on stand-alone machines without a live connection to the Internet. So far, the basic framework has been outlined and some of the utilities built. Originally, the plan was to develop this into a collection of open modules, that can interact through an open protocol in the spirit of presentations at ACH/ALLC 1999 by Michael Sperberg-McQueen, Jon Bradley and others. However, since such a protocol specification is far from being finalized, I found that I would rather have a concrete implementation to play with and to iron out problems. I therefore recently decided to build the tools I would need on top of the Zope⁶ Web-Application platform. This is an OpenSource™ project build mainly with Python, implementing an object-oriented database and a complete framework for developing dynamic Web-Applications. It has a strong support for XML and related standards and thus seems especially suited for the purpose at hand. All the methods are exposed through a URL-based interfaced, but also callable through XML-RPC.

The presentation in the context of the ALLC/ACH conference aims at contributing to a discussion of how such an open framework can be implemented, while at the same time showing some of the problems that arise when dealing with East Asian languages (see [ApWi96] and [CCAG80-85]). East Asian languages do not normally mark the word boundaries and even the definition of a word is highly disputed among linguists. In this situation, a list of all occurring words in the manner of a word-wheel cannot be applied. Additionally, the texts used here contain markup of textual variants, which complicates the creation of an index. Furthermore, different representations of the same character in machine-readable encodings have to be accounted for. An indexing method that takes these problems into account and also provides an abstraction from indexing of actual low-level locations in the text has been developed⁷.

The SMART project will be utilized in two different contexts:

1. As a retrieval and interface engine for the Buddhist text database produced by the Chinese Buddhist Electronic Text Association. SMART will allow for retrieval with enhanced queries, and add markup based on these queries, thus providing a powerful way to gradually enrich the markup.
2. As the central research platform for a research project of texts of the Chan school in Chinese Buddhism. A smaller corpus of texts is here used for building not only text with rich markup, but also supporting databases of proper names, sites and historical dates to allow for knowledge-base centered retrieval of the texts.

A demonstration of both applications will be given in this presentation.

Notes

^1. The Chan-Buddhist genealogical history Wudeng Huiyuan (first printed in 1253) on the ZenBase1 CD-ROM, see [App et al 95].
^2. The CBETA project website (mostly in Chinese) is at <http://ccbs.ntu.edu.tw/cbeta.>
^3. This basic markup follows the general ideas lined out in [Wit96].
^4. I will not go into detail for this audience, but some references to these problems can be found in the work by the Chinese Characters Analysis Group. More recently, we based our efforts on the work done by the Mojikyo Font Institute in Japan <http://www.mojikyo.gr.jp>.
^5. The project website is at <http://www.chibs.edu.tw/~chris/smart/>.
^6. For more information on Zope see <http://www.zope.org>.
^7. More information can be found in [Wit99]

References

RHComN: Research in Humanities Computing, Oxford: Clarendon, 1991ff. N is the sequential number of the volume.
[ApWi96} App, Urs and Wittern, Christian (1996) A New Strategy for Dealing with Missing Chinese Characters, Humanities and Information Processing No. 10, February 1996, S. 52-59.
[App et al 95] App, Urs, Kumiko, Fujimoto and Wittern, Christian (1995). ZenBase CD1. International Institute for Zen Buddhism, Kyoto.
[CCAG80-85] Chinese Character Analysis Group (Ed.) (1980 - 85). Chinese Character Code for Information Interchange, Vol. I-III, Taipeh 1980, 1982, 1985.
[CaZa91] Calzolari, Nicola and Zampolli, Antonio "Lexical Databases and Textual Corpora: A Trend of Convergence between Computational Linguistics and Literary and Linguistic Computing", in: [RHCom1], p273-307.
[Lanca91] Lancashire, Ian (Ed.) (1991). The Humanities Computing Yearbook 1989-90 A Comprehensive Guide to Software and other Resources. Clarendon Press, Oxford.
[Latz92] Latz, Hans-Walter (1992). Entwurf eines Modells der Verarbeitung von SGML-Dokumenten in versionsorientierten Hypertext-Systemen Das HyperSGML Konzept, Diss. Berlin 1992.
[Neum96] Neuman, Michael (1996). "You Can�t Always Get What You Want: Deep Encoding of Manuscripts and the Limits of Retrieval", [RHCom5], p209-219.
[Rob94] Robinson, Peter M.W. (1994). "Collate: A program for Interactive Collation of Large Textual Traditions", [RHCom3], p32-45.
[SpMcQ91] Sperberg-McQueen, Michael, C. (1991). "Text Encoding and Enrichment", [Lanca91], p503f.
[TEIP3] Sperberg-McQueen, Michael C. and Burnard, Lou (Eds.) (1994). Guidelines for Electronic Text Encoding and Interchange, Chicago and Oxford.
[Wit93] Wittern, Christian (1993). "Chinese Character Encoding", The Electronic Bodhidharma, Nr. 3, July 1993, p44-47.
[Wit94] Wittern, Christian (1994). "Code und Struktur: Einige vorläufige Überlegungen zum Aufbau chinesischer Volltextdatenbanken", Chinesisch und Computer, Nr.9, April 1994, S.15-21.
[Wit95a] Wittern, Christian (1995). "The IRIZ KanjiBase", The Electronic Bodhidharma, Nr. 4, June 1995, p58-62.
[Wit95b] Wittern, Christian (1995). "Chinese character codes: an update", The Electronic Bodhidharma, Nr. 4, June 1995, p63-65.
[Wit96] Wittern, Christian (1996). "Minimal Markup and More - Some Requirements for Public Texts", Conference presentation at the 3rd EBTI meeting on April 7th, 1996 in Taipei, Taiwan.
[Wit99] Wittern, Christian (1999). "SMART: Format of the Index Files", Technical note published on the Internet at <http://www.chibs.edu.tw/~chris/smart/smindex.htm >. (First published July 20th, 1999, last revised January 10th, 2000)
[Yas96] Yasuoka, Koichi and Yasuoka, Yasuko (1996) Kanjibukuro, Kyoto.
<http://m-media.kudpc.kyoto-u.ac.jp/~yasuoka/kanjibukuro/>

Return to ALLC/ACH Programme

(2.3.2) Word Order in Latin Prose Applied to a Case of Authorship Attribution: Book IV of the Stratagemata by Sextus Iulius Frontinus (1st century AD). The Contribution of Quantitative Methods via Computerized Text Analysis.

Empar Espinilla Buisan
Montserrat Nofre Maiz
University of Barcelona, Spain

Background

For some time, the Servei de Lexicometria at the University of Barcelona has been working in conjunction with the Latin Linguistics Group (Catalan acronym GLLUB) of the same university on the promotion and application of quantitative methods and computerized analysis of texts in the field of corpus languages, in this case Latin. Most of the work carried out to date has focused on questions of authorship attribution. Our main object of study is Sextus Iulius Frontinus, a writer of technical prose who was active during the first century AD. Parts of his work present problems of attribution; to add to the difficulty, there are few other candidates for the authorship of the doubtful text.

Three texts by Frontinus have survived: De agrimensura ("Agrimensura", fragments on land survey and its legislation), Stratagemata ("Stratagems", a set of instructional anecdotes for Roman army officers, which illustrated the principles of the art of warfare via examples of strategems selected from Greek and Roman history) and De aquaeductu urbis Romae ("On the aqueducts of the city of Rome", a treatise on water supply for Rome). The problem of attribution arises with the fourth and last book of the Stratagemata. The hypotheses proposed by philologists for the date of book IV do not coincide: due to the lack of qualitatively distinctive linguistic features, the pseudo-Frontinus has been placed in the first century (thus a contemporary of the author himself), at the beginning of the second, and between the fourth and fifth. For this reason we decided to work on this text of doubtful authorship by applying quantitative statistical analysis methods with computerized support (Espinilla-Nofre: 1998). In that study, we used some of the quantitative methods that are generally accepted for questions of authorship attribution (Holmes: 1994): the ratio of simple forms/occurrences, the ratio of forms/occurrences with a fixed number of occurrences (fixed N), the ratio of simple occurrences/forms, the ratio of hapax legomena/forms, the R-HonorE function, the ratio of hapax dislegomena/forms and the study of the length of forms. These data allowed a comparison between the doubtful text and the rest of the works of Frontinus. The results in that first study highlighted two points:

Between the doubtful text and the texts reliably attributed to Frontinus there is no inconsistency (this finding underlines the difficulties facing traditional hypotheses).
Another point of reference is required, i.e. another author, with whom to compare the data obtained.

So as the second stage of the project we have decided to approach the problem from another perspective. Following on from previous studies (Tweedie-Frischer: 1999; Frischer-Holmes-Tweedie, et al: 1999) and others, we are keen to analyze the order of the forms in the text in question and to compare them (1) with the texts recognized as Frontinian, and (2) with another text of a later date (control author, Tweedie: 1998). This analysis assumes that there was a change in word order in the Latin sentence between the classical era and the later period (Linde: 1923, Marouzeau: 1953). In spite of the fact that the use of the computer is a considerable aid in performing quantitative analysis of the texts, our study has faced two particular problems from the very beginning:

The first derives from the premise of an established word order in Latin. The generalized opinion is that word order is basically S(ubject)-O(bject)-V(erb). However, there are a number of deviations, and certain scholars have questioned the assumption of this standard word order in Latin prose (Pinkster: 1991):

Deviations according to sentence type: unlike assertive sentences, in imperatives the verb is usually placed at the beginning.
Deviations according to type of clause (main or subordinate) and the use of different types of subordinate clauses.
Deviations deriving from the internal structure of the constituents of the sentence: the general tendency in Latin is to place the syntactically relevant constituents (the heavy material) on the right, and the constituents of less syntactic importance (the light elements) as near the beginning as possible, even though this tendency may be altered for pragmatic and semantic reasons; questions of theme and rheme, or topic and focus. Nonetheless, in our study, we subscribe in principle to the premise that in the classical era the most common order followed by authors in Latin prose was SOV, and, in the later period, SVO.

The second intrinsic difficulty when working with corpus languages can be summarized as follows (Ramos: 1996):

Productivity: the corpus does not show which of the linguistic rules that can be extracted are the most productive.
Grammaticality: to clarify the grammatical differences observed between authors it is obviously impossible to consult a native speaker.
Representativity of the corpus: the corpus at our disposal is a set of materials that has been preserved due to a particular sequence of events. It is not specifically selected for study by linguists.

The Corpus Studied

For our study, we compared book IV of the Stratagema of Frontinus with books I, II and III, and also with the work of a control author: De diversis fabricae architectonicae, by Caetius Faventinus, another writer of technical prose who lived in the later period.

Methodology

Technical data

Computerization of the texts in the corpus (ASCII format)
The computer program used to analyze the corpus was TACT (Textual Analysis Computing Tools), version 2.1. gamma.
The corpus was coded with COCOA labels, following the marking guidelines of the MAKEBASE module in TACT.
The data were obtained using the USEBASE module in TACT.

Methods of analysis

We examine whether the verb is in final position in the various texts in our corpus.
We study the position of the direct object in relation to the verb that governs it.
We establish the type of clause (main or subordinate) in which the verb is found.
We establish differences between the position of the verb according to the type of subordinate clause.
We do not restrict ourselves to cases of direct objects represented by nouns or pronouns in the accusative, but also study cases of governed complement (in genitive, dative or ablative) and those in which the direct object is represented by a subordinate clause.

Working Hypothesis And Results Obtained

The aim of our study is to provide arguments to corroborate or reject our working hypothesis: following the traditional assumption of Latin word order, the text of the Stratagemata recognized as authentically Frontinian (books I, II and III) must follow word order SOV, while the work of the control author will predominantly follow word order SVO. According to the word order we find in the doubtful book IV we will be able to place it in one or other era. We will thus have a set of data which, though unable to date the writing exactly, will lend support to one of the traditional philological hypotheses.

Bibliography

Agud, A., Fernandez Delgado, J. A. and Ramos Guerreira, A. (eds) (1996). Las lenguas de corpus y sus problemas linguisticos, Ediciones Clasicas, Madrid.
Espinilla Buisan, E. And Nofre Maiz, M. (1998). "Metodos estadisticos y problemas de autoria. El libro IV de las Estratagemas de S. Julio Frontino". In S. Mellet (ed) JADT 1998. 4emes Journees Internationales d'Analyse statistique des Donnees Textuelles, Universite de Nice-Sophia Antipolis-Centre National de la Recherche Scientifique-INaLF, Nice. 263-271.
Frischer, B., Holmes, D, Tweedie, F., et. al. (forthcoming). "Word-order transference between Latin and Greek: The relative position of the accusative direct object and the governing verb in Cassius Dio and other Greek and Roman prose authors", Harvard Studies in Classical Philosophy (forthcoming).
Holmes, D.I. (1994). "Authorship attribution". Computers and the Humanities. 28, 87-106.
Linde, P. (1923). "Die Stellung des Verbs in der lateinischen Prosa". Glotta. 12, 153-178.
Marouzeau, J. (1953). L'ordre des mots en latin. Les Belles Lettres, Paris.
Pinkster, H. (1991). "Evidence for SVO in Latin?". In R. Wright (ed) Latin and the romance languages in the Early Middle Ages. Routledge, London. 69-92.
Ramos Guerreira, A. (1996). "El estatuto linguistico del corpus latino: algunas precisiones". In A. Agud et. al. (eds) Las lenguas de corpus y sus problemas linguisticos. Ediciones Clasicas, Madrid. 35-54.
Siewirska, A. (ed) (1998). Constituent order in the languages of Europe. Mouton de Gruyter, Berlin-New York.
Tweedie, F.J. (1998). "The provenance of De Doctrina Christiana attributed to John Milton: a statistical investigation", Literary and Linguistic Computing. 13, 2, 77-87.
Tweedie, F.J. and Frischer, B.D. (1999). "Analysis of classical Greek and Latin compositional word-order data", Journal of Quantitative Linguistics. 6, 1.

Return to ALLC/ACH Programme

(2.3.3) Back to the Cave of Shadows: Stylistic Fingerprints in Authorship Attribution

R. Harald Baayen
University of Nijmegen, The Netherlands

Fiona J. Tweedie
University of Glasgow, UK

Anneke Neijt
Hans van Halteren
University of Nijmegen, The Netherlands

Loes Krebbers
Max Planck Insitute for Psycholinguistics, The Netherlands

Introduction

Attempts to assign authorship of texts have a long history. They have been applied to influential texts such as the Bible, the works of Shakespeare and the Federalist Papers. A wide variety of techniques from many disciplines have been considered, from multivariate statistical analysis to neural networks and machine learning. Many different facets of texts have been analysed, from sentence and word length to the most common or the rarest words, or linguistic features. Holmes (1998) provides a chronological review of methods used in the pursuit of the authorial "fingerprint".

A key issue raised at the panel on non-traditional authorship attribution studies at the ACH-ALLC conference in Virginia, 1999, by Joe Rudman is whether authorial "fingerprints" do in fact exist. Is it truly the case that any two authors can always be distinguished on the basis of their style, so that stylometry can provide unique stylistic fingerprints for any author, given sufficient data?

Despite the long history of authorship attribution, almost all stylometric studies have been carried out on the assumption that stylometric fingerprinting is possible. However, often control texts are inappropriately chosen or not available. In addition, the imposition of editorial or publisher's style can distort the original words of the author. To our knowledge, no one has yet carried out a strictly controlled experiment of authorship attribution, with texts of known authorship being analysed between and within genres as well as between and within authors.

In this abstract we present such an experiment. The next section describes the design of the experiment. This is followed by a description of the analysis carried out, then by the results and our conclusions.

Experimental Design

The experiment was carried out in Dutch. Eight students of Dutch literature at the University of Nijmegen participated in the study. All the students were native speakers of Dutch, four were in their first year of study, and four were in their fourth year. The students were asked to write texts of around 1000 words.

Each student wrote in three genres: fiction, argument and description. Three texts were written in each genre, on the following topics.

Fiction: a retelling of the fairy tale of Little Red Riding-Hood, a detective story about a murder in the university, and a romance of chivalry.
Argument: defending a position about the television program 'Big Brother', the unification of Europe, and smoking.
Descriptive: football, the upcoming new millennium, and a book-review of the book read most recently by the participant.

The order of writing the texts was randomised so that practice effects were reduced as much as possible. We thus have nine texts from each participant, making a total of seventy-two texts in the analysis. The main question is whether it will be possible to group texts by their authors using the state-of-the-art methods of stylometry. A positive answer would support the hypothesis that stylistic fingerprints exist, even for authors with a very similar background and training. A negative answer would argue against the hypothesis that each author has her/his unique stylistic fingerprint.

Analysis

There are many methods proposed for the analysis of texts in the attempt to identify authorship. In this abstract we describe three, and a fourth will be described at the conference. The first is that proposed by Burrows in a series of papers, see e.g. Burrows (1992), and used by many practitioners. Here we consider the frequencies of the forty most common words in the text. Principal components analysis is used to identify the most important aspects of the data.

The second method considered is that of letter frequency. Work by Ledger and Merriam indicates that the frequencies of letters used in texts may be indicators of authorship. We use the standardised frequencies of the 26 letters of the alphabet, with capital and lower-case letters being treated together. As above, the standardised frequencies are analysed using principal components analysis.

Thirdly, we consider methods of vocabulary richness. Tweedie and Baayen (1998) show that Orlov's Z and Yule's K represent two separate families of measures, measuring richness and repeat rate respectively. Plots of Z and K can be examined for structure.

Finally, we are planning to tag the text and to annotate the text for constituent structure. Baayen et al. (1996) show that increased accuracy in authorship attribution can be obtained by considering the syntactic, rather than lexical vocabulary. The results from this part of the analysis will be presented at the conference.

The texts written in this analysis are available from the authors upon request and, once all annotation has been completed, will be made available on the Web as well.

Results

Each student was asked to write around 1000 words in each text. In fact, the average text length is 908 words. The shortest text has 628 words and the longest 1342. The texts were processed using the UNIX utility awk and the R statistics package.

We first consider all of the texts together. The Burrows analysis of the most common function words shows no authorial structure. Genre appears to be the most important factor, with fiction texts having negative scores on the first principal component, while argumentative and descriptive texts have positive scores on this axis. In addition, argumentative texts tend to have higher values on the second principal component than descriptive texts. It appears that fiction texts are more similar to other fiction texts than they are to other texts by the same author. Analysis of letter frequencies gives similar results, while the measures of vocabulary richness show some indication of structure with respect to the education level of the writer. Those in their first year of studies appear to have lower values of K, and hence a lower repeat-rate. In addition, higher values of Z are the province of first-year students also, indicating a greater richness of vocabulary. When all of these measures are incorporated into a single principal components analysis the genre structure becomes even clearer. Fiction texts are found to the lower left of a plot of the first and second principal component scores, while the other genres are found in the upper right of the graph.

Given the structure evident in the principal components analysis, it seems sensible to split the texts by genre and consider each separately. In each case, within fiction, argumentative, and descriptive texts, again the education level is the only factor to be apparent.

Conclusions

It is apparent from the results described above that in this study, differences in genre override differences in education level and authorship. The absence of any authorial structure in the analyses shows that it is not the case that each author necessarily has her/his own stylometric fingerprint. Texts can differ in style while originating from the same author (Baayen et al., 1996; Tweedie and Baayen, 1998), and texts can have very similar stylometric properties while being from different authors. Of course, it is possible that larger numbers of texts from our participants might have made it possible to discern authorial structure more clearly. Similarly, it may also be that more fine-grained methods than we have used will prove sensitive enough to consistently cluster texts by author even for the small number of texts in our study. We offer, therefore, our texts to the research community as a methodological challenge. Given what we have seen thus far, we believe our results must alert practitioners of authorship attribution to take extreme care when choosing control texts and drawing conclusions from their analyses.

References

Baayen, R. H., van Halteren, H. and Tweedie, F. J. (1996). Outside the cave of Shadows. Using syntactic annotation to enhance authorship attribution. Literary and Linguistic Computing 11(3):121-131.
Burrows, J. F. (1992) Not Unless You Ask Nicely: The Interpretative Nexus between Analysis and Information. Literary and Linguistic Computing 7(2):91-109.
Holmes, D. I. (1998) The evolution of stylometry in humanities scholarship. Literary and Linguistic Computing 13(3):111-117.
Ledger, G. and Merriam, T. (1994) Shakespeare, Fletcher, and the Two Noble Kinsmen. Literary and Linguistic Computing 9(3):235-248.
Tweedie, F. J. and Baayen, R. H. (1998). How Variable May a Constant Be? Measures of Lexical Richness in Perspective. Computers and the Humanities 32(5):323-352.

Return to ALLC/ACH Programme

(2.4) Posters & Demonstrations

(2.4.1) The Digital Performance Archive

Rachael Beach
The Nottingham Trent University, UK

While digital performance events and experiments proliferate and new performance genres are beginning to emerge, no central record or archive of these developments is currently being collated. The Digital Performance Archive (DPA) aims to fill that gap by archiving and critically analysing significant new interdisciplinary developments in performance which draw upon (or exist within) digital media in its varied forms. To this end, The Digital Performance Archive will undertake a comprehensive study and recording of the development of 'digital performance' in the last decade of the Twentieth Century.

The study will cover both digital resources used in performance and digital resources on performance created in the two year period 1999 to 2000 (with important precedents of the 90's also being catalogued). Digital resources in performance include theatrical productions and live-art installations that incorporate electronic media, to live-broadcast World Wide Web performances and Internet based collaborations, to interactive drama and the new performative 'virtual environments' of MUDS, MOOs and IRC. Digital resources on performance include those being used to document, analyse and critique performance: from performing arts databases, websites and mailing lists, to academic CD-ROMS and laser discs. The project aims to be of value to researchers across a wide range of academic disciplines, from drama and performance to art and design, from the social sciences to computer science and cybernetics.

The Digital Performance Archive will have several outcomes. Firstly it will collate an extensive searchable database on the World Wide Web, in order that the public will be able to gain access to work held by the DPA, in particular to the websites and digital files provided by practitioners. Secondly, as material for a DVD, the DPA will document exemplars of digital performance on video. This interactive DVD will also include other significant documents and examples of digitally related performance, all of which will be critically examined. Lastly, the DPA will produce an academic publication presenting a critical overview of the field. As already stated there is currently no other archive that is devoted specifically to this type of work. Perhaps the main reasons for this are that these types of works are so current, so diverse and developing at such an extraordinary rate that there is, as yet, no accepted methodology for dealing with them. Whilst one of the DPA's objectives is to create this web-searchable database archive of the works within the field, clearly it also has an important role to play in beginning to create a structure for the study of the works in its collection and of the field as a whole. It is these seemingly conflicting roles of provider of raw information on the web and of interpreter in such a fast moving 'discipline' that make the project so interesting and ambitious.

In no part of the project is this paradox more evident than in the Web based database of works. Here the archive benefits from a comparison with photographs and photographic archives. Sekula writes that,

The photographic archives' components are not conventional lexical units, but rather are subject to the circumstantial character of all that is photographable. Thus it is absurd to imagine a dictionary of photographs, unless one is willing to disregard the specificity of individual images in favor of some model of typicality ... Clearly one way of 'taming' photography is by means of this transformation of the circumstantial and idiosyncratic into the typical and emblematic. This is usually achieved by a stylistic or interpretive feat, or by a sampling of the archive's offerings for a 'representative' instance.

Clearly, as in the above description, the DPA, whilst dealing with files and information of similar types, will not be dealing with standard content within these types, especially in such a broad and shifting field. However, an interpretative ' taming' route will be taken with the material chosen by the DPA for the DVD and publication. This approach however is not viable for the searchable web database whose aim is to provide files to the researcher as the practitioner meant them to be seen. If this is not wholly possible visually, the files should at least be as free as possible from any interpretation that might be placed on them by the DPA.

Sekula continues:

Another way is to invent a machine, or rather a clerical apparatus, a filing system, which allows the operator/researcher/editor to retrieve the individual instance from the huge quantity of images contained within the archive. Here the photograph is not regarded as necessarily typical or emblematic of anything, but only as a particular image which had been isolated for purposes of inspection.

The creation of such a filing system is the ultimate aim of the DPA web database. However, Sekula creates a picture of an ideal situation in which a user finds exactly what they are looking for in the filing system. He does not consider that there are inherent biases in the creation of such a system. When attaching metadata to practitioners' files in order to allow database searches, the DPA must respect the descriptions attached to them by practitioners themselves, whilst also trying to create a system which is coherent and consistent (or typical and emblematic) across all the works. In addition, there needs to be some mediation between these two biases and the searches that the user brings to the database. It is a challenging aspect of the project.

This poster will provide delegates with the opportunity to see the breadth of works that the DPA must archive and define. It will be a chance to see how far the DPA has progressed with its acquisitions and with placing these acquisitions within its web 'filing system'. The poster will give delegates the opportunity to test this web searchable database and provide comment. The DPA will welcome the opportunity to discuss with delegates any similar problems and solutions they have encountered.

The archive is a joint project between Digital Research Unit of the Department of Visual and Performing Arts at The Nottingham Trent University, and the Media and Performance Research Unit, Department of Media and Performance at the University of Salford.

References

Sekula, A. (1996). 'The Body and the Archive'. In R. Bolton (ed). The Contest of Meaning: Critical Histories of Photography. The MIT Press, Cambridge MA.

Return to ALLC/ACH Programme

(2.4.2) An Assistant Tool for Verse-making in Basque-based on Two-Level Morphology

Bertol Arrieta
Xabier Arregi
Iñaki Alegria
University of the Basque Country, Spain

Introduction

In this paper we present a specialised word generator, which aims to be an assistant tool for Basque troubadours. Such a tool allows verse-writers to generate all the words that match with a given word termination. We coped with some interesting aspects, i.e. the dimensions of the generated list and the need to establish an order of relevance among the listed items.

This work can be seen as a way of re-using computational linguistic tools in the context of the Basque cultural means of expression. The technical foundations of this tool lie in a two-level morphological processor. The way in which words must be generated (starting from the end of the word) leads us to inverse the generation process.

"Bertsolaritza": What Is It?

"Bertsolaritza" (the Basque term for verse-making) is an oral or written literary form with old tradition and great popularity in the Basque Country. Similar forms are manifested in other countries like Cuba.

While the written mode is similar to poetry, the oral mode has a peculiarity: troubadours sing verses without previously knowing the theme. In other words, a theme is given to the singers and in a few seconds they have to think of a set of verses adjusted to the theme. These verses must hold to the formal conditions (measurement and rhyme) of the discipline.

This verse-making task is quite difficult, so great expertise is required. Because of that, some schools are devoted to teaching how to improvise this type of verses. From our view, the tool we are presenting may be quite useful in the verse-schools. For some decades, an oral verse-making competition has been organised in the Basque Country every four years. The high diffusion of this event (thousand of Basques follow this competition with great interest, live or on TV) is a clear demonstration of the importance of this discipline. From this background was formed the idea of designing the tool here presented. We hope that such an application will be a useful assistance-tool in the task of finding rhymes, namely for those inexperienced troubadours.

Reversing Of The Morphological Description

To make this tool we have re-used a morphological analyzer/generator for Basque developed few years before (Alegria et al., 96) and integrated several tools such as spelling correctors and ICALL systems (Maritxalar et al., 97). The morphological description is based on the Koskenniemi's two-level morphology model (Koskenniemi, 83).

The two-level system is based on two main components:

A lexicon where the morphemes (lemmas and affixes) and the possible links among them (morphotactics) are defined. The lexicon is divided into different sublexicons and each lexicon entry specifies its morphotactical information by means of a continuation class which is a set of sublexicons. Combining sublexicons (nodes) and continuation classes (arcs) the graph of morphotactics is defined.
A set of rules which controls the mapping between the lexical level and the surface level (changes at surface level when morphemes are linked) due to the morphophonological transformations (morphophonemics).

In order to get our inverted morphological analyser/generator for Basque we needed to reverse this morphological description. The goal is to build an inverted morphological generator for Basque, which will control the order of the proposals according to their suitability for being a rhyme. The inverted morphological generator will obtain all the possible forms corresponding to a known ending, instead of generating the possible forms corresponding to the beginning. We took into account two choices to reverse the morphological description.

The first one consists of manipulating the automata that is created from the morphological description of the Basque. This option initially looked good because we did not need to manipulate the lexicon and the rules; we only manipulated the automata. But, analysing this option in depth, we realised that our inverted Deterministic Finite Automata (DFA) would actually become a Non-Deterministic Finite Automata (NDFA) in an intermediate state of the transformation process; and trying to re-convert the NDFA in a DFA would cause a combining explosion.
The second option consists of manipulating and reversing the lexicon and the rules directly, before using the compilers (Karttunen and Beesley, 92)(Karttunen, 93). This approach, therefore, involves the implementation of the programs that invert the lexicon, the morphotactics and the phonological rules automatically.

Considering the risks of the first choice, we decided to develop the second method. This process was divided into three steps:

Reversing the lexicon: This task deals with the inversion of all the morphemes. The order of the characters inside the morphemes is inverted. For instance 'big' would be converted to 'gib'.
Converting the continuation classes in "backward classes": The basis of the morphotactics in the two-level model is the continuation classes (Koskenniemi, 83). We have programmed a script to convert the continuation classes in "backward classes", so that we have a group of morphemes that can go before an inverted morpheme. This looks easy, but it has some problems. Lexicons containing final classes have to be defined as root lexicons, and consequently the backward class of the original root lexicons must be null.

For example: Let ADJECT be the continuation class in adjectives with two syllables or less. Suppose that this class has a unique lexicon containing the stems -er, and -est, and that the continuation class of these stems is null. Once the conversion has been made, -er and -est will be in the root lexicon and in their backward class will be included the adjectives with two syllables or less.

Reversing the rules: The rules are expressed as following:

To reverse the rules only contexts have to be changed, interchanging between them and reversing each one. The contexts are regular expressions and it is necessary to distinguish between data (to be reversed) and regular operators and reserved characters.

For example, the rule y:i <=> _ +: s #:; will be converted to y:i <=> #: s +:_ ;

Application To "Bertsolaritza": Finding Words That Rhyme With An Ending

Once the inverted analyser/generator for Basque was developed, we tried to reuse it in an application that got the rhymes based on the final part of a word. We needed to invert the character sequence given by the user and then launch the generation with our inverted morphological generator tool. The output of the generation process - that is, all the words that match with the given ending - must be inverted before showing them to the user in a Tcl-Tk made screen. In this way our tool returns all the Basque words that have the same final sequence of characters as the sequence given by the user. So, the application finds all the words that rhyme with the word-ending given by the user.

In order to improve the usefulness of the application, we considered it necessary to face the problem of the huge quantity of generated words that match with the sequence given by the user. Two solutions were implemented:

1. Establishing a kind of categorisation or class-partition among the morphemes, so that only one example (representative of the class) is returned when all the elements of the class are suitable to be shown. For instance, if the input is 'est', instead of returning all the adjectives with the superlative form added (too long!)

big + est--> biggest
small + est --> smallest
thin + est --> thinnest
...

the application will return only one example and a short explanation:

BIG+ est --> biggest (ADJECTIVE + est)

2. Returning words sorted in the order that verse- makers appreciate more. The quality of the rhyme is better if the word is not composed or declined. In the example above it would be better to use rhymes like 'guest' than words declined like 'smallest'.

Conclusions And Future Improvements

Basque is a Pre-Indo-European language of unknown origin and quite different from the surrounding European languages. The declension of the Basque language has fourteen different forms for each singular, plural and undefined form. All of these forms are added at the end of the words. Besides, it is an agglutinative language which accepts morphemes being added to other morphemes. These characteristics show us the relevance of the final parts of the Basque words. That reason leads us to think that the inverted morphological analyser/generator would be useful for different applications. We have found an interesting use for such a generator in the world of the "bertsolaritza". Given that final parts of words (rhymes) are very important in verses, the inverted morphological analyser/generator can be an important assistant tool for writing verses. Furthermore, an automatic method for inverting the morphological description has been defined. Such a method can be reused in any other language, always starting from a two-level description.

We are considering as future works, (i) returning words with assonance rhyme; (ii) dealing with semantics in the selection module in order to improve the order of presentation, and (iii) publishing the application as a web page.

Acknowledgements

We would like to thank Xerox for letting us use their tools, and specially to Lauri Karttunen.

References

Alegria, I., Artola, X., Sarasola, K. and Urkia, M. (1996). Automatic Morphological Analysis of Basque. Literary and Linguistic Computing 11 (4): 193-203. Oxford University Press.
Karttunen, L. and Beesley, K.R. (1992). Two-Level Rule Compiler. Xerox ISTL-NLTT-1992-2.
Karttunen, L. (1993). Finite-State Lexicon Compiler. Xerox ISTL-NLTT-1993-04-02.
Karttunen, L. (1994). Constructing Lexical Transducers. Proc. of COLING«94. 406-411.
Koskenniemi, K. (1983). Two-level Morphology: A general Computational Model for Word-Form Recognition and Production. University of Helsinki, Dept of General Linguistics. Publications n* 11.
Maritxalar ,M., Diaz de Ilarraza, A. and Oronoz, M. (1997). From Psycholinguistic Modelling of Interlingua to a Computational Model. Proc. Of CONLL97 Workshop (ACL Conference). Madrid 1997.
Lekuona et al. (1980). Bertsolaritza. Jakin 14. eta 15. Donostia 1980.

Return to ALLC/ACH Programme

(2.4.3) The Second Version of the ICAME CD-ROM

Knut Hofland
University of Bergen, Norway

The demo will present the second edition of the ICAME CD-ROM. The CD contains 20 corpora, written, spoken and historical texts, with approximately 17 millions words. The CD-ROM comes with a full version of WordSmith and TACT and the retrieval component of WordCruncher (for DOS). Most of the material is indexed with WordCruncher. All the manuals are included in electronic form.

Reference: <http://www.hit.uib.no/icame/cd>

Return to ALLC/ACH Programme

(2.4.4) Working with Alignment of Text and Sound in Spoken Corpora

Knut Hofland
University of Bergen, Norway

The Bergen Corpus of London Teenage Language (COLT) has been transcribed and the cassette tapes have been digitized to Windows WAV-files (9 GB). The texts have been time aligned at the word level with the sound files by the company Softsound in the UK. The poster will describe how this material is made available through the Corpus WorkBench from IMS in Stuttgart. The user can search in the corpus by means of a Web-browser and from the resulting concordance play the corresponding sound to each occurrence (5-15 seconds). For this purpose a program was written to deliver small pieces of a sound file across the Web. These sound extracts can be saved by the user and further analyzed by signal processing programs.

Two Norwegian spoken corpora are also available for searching in this way. In the one corpus, a mark was put manually in the transcripts for every 10 seconds. A program then generated an interpolated time stamp for each word. In the other corpus, the program SyncWriter was used while transcribing the text. This program keeps track of time information for each unit which is transcribed. This information can be extracted from the data file together with the text. The time stamp for each word is interpolated between these values and the text and time information are indexed by the search software.

References:

COLT: <http://helmer.hit.uib.no/colt/>
Softsound Speech/Text alignment <http://www.softsound.com/SpeechText.html>
Corpus WorkBench <http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/index.html>
SyncWriter <http://www.sign-lang.uni-hamburg.de/Software/syncWRITER/info.html>
Demo concordance <http://helmer.hit.uib.no/test-of-sound.html>

Return to ALLC/ACH Programme