ACH/ALLC 1993 Conference Report

Peter Flynn Computer Centre, University College, Cork, Ireland

This report is a summary of the joint conference of the Association for Computing in the Humanities and the Association for Literary and Linguistic Computing, held at Georgetown University, Washington DC, 16&enrule;19 June 1993. It contains a précis of the text published in the preprints supplemented by the author's notes, but omissions occur for a few sessions for which (a) no paper was available; (b) where a panel discussion was held viva voce; or (c) where a fuller report is available from the speaker. In dealing with topics sometimes outside my own field, I will naturally have made mistakes, and I ask the authors' pardon if I have misrepresented them.

A hypertext version of this report is available on the Internet at and can be accessed through the WorldWideWeb using lynx, mosaic or similar browsers.

Tuesday, June 15

Preconference Activities

Wednesday, June 16

Opening Session

Welcomes: Mr. John J. DeGioia, Associate Vice President and Chief Administrative Officer for the Main Campus; Rev. Robert B. Lawton, S.J., Dean, Georgetown College; Susan K. Martin, University Librarian; Nancy Ide, President, Association for Computers and the Humanities; Susan Hockey, President, Association for Literary and Linguistic Computing

Keynote Speaker: Clifford Lynch, Director of Library Automation, Office of the President, University of California

The opening ceremony was held in the splendour of Gaston Hall, Georgetown University. Dr Michael Neuman, organiser of the conference, warmly welcomed attendees and then introduced each of the speakers.

In his welcoming speech, Mr John J DeGioia, Associate Vice President and Chief Administrative Officer for the Main Campus noted that it was impressive that the Georgetown conference followed Oxford and preceded the Sorbonne. Georgetown had a campus in Italy and there seemed to be similarities between Georgetown and Florence in literature, history, philosophy and art. Working in computing in the humanities must be a little like working in Florence in the 16th century. We stand on the verge of several possibilities and the very idea of `text' is central to shaping these possibilities. We will be facing serious questions in the years ahead and it was appropriate that the purpose of the conference was to improve learning.

The Rev Robert B Lawton SJ, Dean of Georgetown College, spoke of the computer as representing sophisticated technology for retrieving and processing information that really represents a profound movement in human evolution when we can extend our very powers of thinking.He hoped that the conference would lead to profitable conversations which will enrich the time spent at Georgetown.

Nancy Ide welcomed the audience on behalf of the Association for Computing and the Humanities. She concluded by pointing out that the theme suggests that we are at an important moment and set the scene and for the conference.

On behalf of the Association for Literary and Linguistic Computing, Chairman Susan Hockey complimented the Georgetown University organisers for their effort and effectiveness in organising the conference.The programme committee had put together a programme that really shows where Literary and Linguistic Computing and the Humanities Computing will make contributions well into the next decade.There is great potential for working together with librarians to pursue electronic research so that a significant contribution could be made to the electronic library of the future. The conference is a valuable opportunity to bring together the issues which concern librarians, scholars and the skills of computer scientists to develop programs for the creation of electronic texts and their manipulation.

Susan K. Martin, University Librarian also noted the growing interest in electronic texts in the research library community. Librarians will take on new and exciting roles as the true potential of electronic texts becomes more fully understood.

In the opening keynote address, Clifford Lynch, Director of Library Automation, Office of the President, University of California, surveyed the current and future scenes for electronic information delivery and access. In a presentation which ranged over many topics with great clarity and vision, Dr Lynch stressed that the future lay in electronic information and getting definitional handles on its components. He spoke about the technology and computer methods whereby 12,000 constituent networks existed where ideas could be exchanged, and where information can be accessed. Information exists that is independent of particular computer technology and is usable through open standards and network servers that can migrate from one generation of technology to the next thereby preventing preservation disasters of the past; information can now cross many generations of technology.

Inspiration and new analysis tools continue to allow researchers to build on the work of others. Databases and knowledge management by libraries have allowed them to become a central part of scientific enterprise to be used and shared internationally. Networks act as a facilitator of collaboration and for the inclusion of geographically remote researchers.

To establish how to handle electronic texts there was a need for coordination between libraries, centres for electronic texts and initiatives such as the TEI. He also pointed to the fact that intellectual property and copyright of textual material is turning out to be an incredible nightmare.

A move to overcome problems by developing a superstructure for the use of textual resources was proposed. Networking information was the right kind of perspective for thinking about these problems. And the future of networking information provided an interesting voyage fundamental to the issues of scholarship.

Track 1, 11.00: Vocabulary Studies. Chair: Christian Delcourt (Université de Liège)

Douglas A Kibbee (University of Illinois) The History of Disciplinary Vocabulary: A Computer-Based Approach to Concepts of `Usage' in 17th-Century Works on Language.

By linking over 50 texts dealing with the French language, from Estienne's treatises to the Dictionnaire de l'Académie, in which the key issues in the debate over usage are mentioned (dialects, archaisms, neologism, foreign borrowing, spelling, pronounciation, sociolinguistic variation, etc) it is possible to reconstruct what constituted the metalanguage of grammatical discussion in 17th century France. He argues that the full-text database techniques of corpora linguistics can be brought to bear on the neglected analysis of the history of discourse (in particular the definition of `usage') and that the importance of these disciplines in another age cannot be subject to current theoretical fashions.

Terry Butler, Donald Bruce (University of Alberta) Towards the Discourse of the Commune: Computer-Aided Analysis of Jules Vallès' Trilogy Jacques Vingtras

This study concentrates on the representational aspects of the discursive status of the Paris Commune of 1871, using a computer-aided analysis of the titular trilogy. The authors' hypothesis (that Proudhon's formal proposition of `anarchism' is realized in the narrative, metaphor and lexical items of Vallès, Rimbaud and Reclus) is being tested in two stages, firstly on unmarked texts, and secondly using PAT and Tact on TEI-marked versions of the texts, both to verify the existence of the empirical regularities and to ascertain heuristically any undiscovered patterns or relationships.

Track 2, 11.00: Statistical Analysis of Corpora. Chair: Nancy Ide (Vassar College)

Hans van Halteren (University of Nijmegen) The Usefulness of Function and Attribute Information in Syntactic Annotation

The author distinguishes two type of corpus exploitation: micro, where a specific phenomenon (for example, a linguistic element) is studied in detail, and macro, where groups of phenomena are studied on a corpus-wide basis (for example, to derive a probabilistic parser). Given the substantial effort needed to create syntactic annotation of corpus material, he examines the level of detail in annotation required for the examples of each of the two types of exploitation identified. Micro-exploitation is usually more successful if it involves only categorisation, rather than exploring functional relationships; the macro- exploitation of parser generation is more problematic, since the use of the data is much more varied.

R Harald Baayen (Max-Planck Institute for Psycholinguistics) Quantitative Aspects of Lexical Conceptual Structure

In the analysis of lexical conceptual structure, distributional data may help in solving problems of linguistic underdetermination. With mophological productivity as an example, the author uses the Dutch Eindhoven corpus to show that the frequency distributions of inchoative and separative readings of the Dutch prefix `ont-' are statistically non-distinct. He argues that linguistic analysis is called for in which deverbal and denominal reversatives are assigned identical lexical conceptual structures.

Elizabeth S Adams (Hood College) Let the Trigrams Fall Where They May: Trigram Type and Tokens in the Brown Corpus

Analysis of the distribution of trigram tokens in the Brown corpus shows that one-sixth of the trigrams occurred only once, 30 percent between 2 and 10 times, and a quarter between 11 and 100 times, with only 24 trigrams occurring over 10,000 times. A comparison of the increase in the number of types as documents were added (once in order of occurrence and once in random order) indicates that the incremental addition of documents to the corpus will not push the number of types over about 11,000. The computational effectiveness of trigram-based retrieval is emphasized given its advantages in terms of useability.

Track 3, 11.00: The Academical Village: Electronic Texts and the University of Virginia (Panel)

John Price-Wilkin (University of Virginia) Chair
Jefferson's term for the integrated learning environment he envisaged has been taken as the metaphor for a vigorous pursuit of development of computing resources for the Humanities at the University of Virginia. SGML-tagged texts on VT100 and X-Windows platforms using PAT and Lector, the Institute for Advanced Technology in the Humanities is designed to place nationally-recognised scholars in an environment where they can experiment freely with computer-aided research projects.
Kendon Stubbs (University of Virginia)
The electronic text initiative got under way in 1991 with the goals of providing facilities to ordinary faculty, graduates and students (rather than specialists); making the texts available remotely, rather than only on Library premises; and to focus on SGML-tagged texts. It has proven its worth in making Humanities computing a high-impact, high-visibility and low-cost initiative; a catalyst for innovation elsewhere in the Library; and as a part of the infrastructure and model for future development.
David Seaman (University of Virginia)
The Electronic Text Center is open most of the Library's regular hours, providing a walk-in service. Apart from the service, it provides an introduction to the technology and information about Humanities computing in a non-threatening way. The early signs are that faculty and researchers take readily to SGML-conformant texts and online tools.
David Gants (University of Virginia)
Two examples of computer applications in the Humanities come from teaching experience and from continuing research:
Edward Ayers (University of Virginia)
One project follows two communities, one Northern and one Southern, through the era of the [American] Civil War. It can be conceived as thousands of intertwined biographies, tracing the twists and turns in people's lives as they confronted the central experience of the American nation. The technology allows the inclusion of scanned images from microfilm of newspapers, the manuscript census, maps and other images, views of the battlescape, and political, economic and military news of the time.

Track 1, 2.00: Interrogating the Text: Hypertext in English Literature (Panel)

Harold Short (King's College, London), Chair
This session emphasises the pædagogical theory behind courseware design, an examination of the elaborate claims which have been made concerning the revolutionary impact of hypertext in education, and its facility for democratising education and allowing student-centered learning.
Patrick W. Conner, Rudolph P. Almasy (West Virginia University) Corpus Exegesis in the Literature Classroom: The Sonnet Workstation
The Sonnet Workstation is a HyperCard implementation of a literary corpus which allows students to read, compare and write about any corpus of short texts. It incorporates hypertext links to annotations and other texts and offers a search rotuine to allow students to carry out searches of several megabytes of 16th-century English sonnets with a glossary and online thesaurus.
Mike Best (University of Victoria) Of Hype and Hypertext: In Search of Structure
Two practical examples of programs developed by the author illustrate the use of hypertext and explore some of the theoretical questions. DynaMark is a classroom program to allow commenting and reviewing by instructors and students, allowing the attachment of comments to text, and to enable the text to be studied as it were under a microscope. By contrast, Shakespeare's Life and Times lets the student expand from the text out into the world, using HyperCard to provide the links between classroom and the library, with access to hypermedia links such as music of the period and text spoken in the relevant dialects.
Stuart Lee (Oxford University) Hypermedia in the Trenches: First World War Poetry in Hypercard - Observations on Evaluation, Design, and Copyright
Much software is successful in packages based on periods culturally removed from the time of today's students (such as Beowulf or Shakespeare). Lee and Sutherland's HyperCard version of Isaac Rosenberg's Break of Day in the Trenches provides bracnhes out into three main areas: Rosenberg's own life; analogues; and World War I. Rather than be a definitive teaching tool for WWI poetry, it is more of a prerequisite study for a tutorial or seminar on the poet. In analysis, using a group of A-level literature students, a worrying 96% of them enjoyed it but felt they no longer needed to research the material as they had `seen everything'.

Track 2, 2:00: Discourse and Text Analysis. Chair: Estelle Irizarry (Georgetown University)

Greg Lessard, Michael Levison (Queen's University) Computational Models of Riddling Strategies

Previous research has demonstrated that there is a formalisable, learnable set of mechanisms which can generate, in principle, an unlimited set of `Tom Swifties' (a form of wordplay such as `I hate Chemistry, said Tom acidly'). This is now extended to analyse structures such as riddles (`Why did the dog go out into the sun? To be a hot dog'). Such riddles share an essential trait with Tom Swifties: they are learned and learnable linguistic strategies.

The VINCI natural language generation environment offers a context-free phrase-structure, a syntactic tree transformation mechanism, a lexicon and lexical pointer mechanism and a lexical transformation mechanism to provide a modelling environment suitable for such analyses.

The model analysed allows the selection of the specific different semantic traits which pose the problem, and generates a question containing them. Three different kinds of question are exemplified, and the paper examines in more detail the linguistic constraints on riddles, in particular the tension between lexicalisation of the correct answer versus productivity.

Walter Daelemans, Antal van den Bosch (Tilburg University), Steven Gilles, Gert Durieux (University of Antwerp) Learning Linguistic Mappings: An Instance-Based Learning Approach

One of the most vexing problems in Natural Language Processing is the linguistic knowledge acquisition bottleneck. For each new task, basic linguistic datastructures have to be handcrafted almost from scratch. This paper suggests the application of Machine Learning algorithms to automatically derive the knowledge necessary to achieve particular linguistic mappings.

Instance-Based Learning is a framework and methodology for incremental supervised learning, whose distinguishing feature is the fact that no explicit abstractions are constructed on the basis of the training examples during the training phase: a selection of the training items themselves is used to classify new inputs.

A training example consists of a pattern (a set of value/attribute pairs) and a category. The algorithm is input as an unseen test pattern drawn from input space, and associates a category with it. The paper compares the results of this approach quantitavely to the alternative similarity-based approaches, and qualitatively to handcarfted rules-based alternatives.

Michael J Almeida, Eugenie P Almeida (University of Northern Iowa) NewsAnalyzer - An Automated Assistant for the Analysis of Newspaper Discourse

The NewsAnalyzer program is intended to assist researchers froma variety of fields in the studty of newspaper discourse. It works by breaking up newspaper articles into individual statements which are then categorised along several syntactic, semantic and pragmatic dimensions. The segmented and categorised text can then serve as data for further analysis along lines of interest to the researcher, for example statistical, content or idoelogical analysis.

The classification scheme distinguishes between factual and non-factual statements: factual ones are further classified between stative or eventive, and non-factual ones have several further subcategories. This is done through a combination of shallow syntactic analyses, the use of semantic features on verbs and auxiliaries, and the identification of special function words.

The program is implemented in scheme and runs on Apple Macintosh computers. A hand-coded version has been used in a linguistic-based content analysis of two weeks of newspaper coverage in which all front-page stories were coded and analysed for writing style, and has also been used to study the ways in which newswriters reported predictions on the Presidential and Vice-presidential candidates[in the 1992 US Presidential election].

Track 3, 2:00: Networked Information Systems. Chair: Eric Dahlin (University of California, Santa Barbara)

Malcolm B Brown (Dartmouth College) Navigating the Waters: Building an Academic Information System

The Dartmouth College Information System (DCIS) is organised as a distributed computing resource using the layered OSI model for its network interactions. It includes features such as:

The client/server model used introduces a modularity that was not previously available, and the use of the network frees the user from geographical constraints and delays. It is felt that the information thus made available has the potential equal to that of word processing to facilitate the basic work of Humanities scholarship.

Charles Henry (Vassar College) The Coalition for Networked Information (CNI), the Global Library, and the Humanities

The CNI could be the key in supporting the Humanities, as it builds upon its original programs of scholarship enhancement through the creation of a free and accessible global library of electronic holdings via the National Education and Research Network (NREN).

Through its working group `The Transformation of Scholarly Communication', the CNI intends to identify and help promulgate projects and programs in the Humanities that have significant implications for changing scholarship methodology and teaching when made available on the NREN.

Christian-Emil Øre (University of Oslo) The Norwegian Information System for the Humanities

This six-year project started in March 1991 to convert the paper-based archives of the collection departments in Norwegian universities to computer-based form, making the `Norwegian Universities' Database for Language and Culture'. An estimated 750 manyears work is required: currently 120 people are engaged from the skilled unemployed on a 50%-50% work and education programme.

The information core is held in SYBASE and accessed using PAT and SIFT on UNIX platforms. Using client-server technology, the database resides on a machine in Oslo and can be accessed through local clients (Omnis7 or HyperCard) or by remote X-terminals.

Data loading has started with four subprojects: coins (a collection of approximately 200,000 items at the University of Oslo), archaeology (reports in SGML on all archaeological sites in Norway), Old Norse (from 1537 CE, approximately 30,000 printed pages) and Modern Norwegian (the creation of a lexical database for the language). All text is stored in TEI-conformant form.

Track 1, 4:00: The Computerization of the Manuscript Tradition of Chrétien de Troyes' Le Chevalier de la Charrette (Panel)

Joel Goldfield (Plymouth State College), Chair and Reporter
The best work in the future of literary computing will be dramatically facilitated by the availability of databases prepared by those scholarswho have a masterful knowledge of their discipline, who have availed themselves of detailed, appropriate encoding schemes, and who have envisioned the widest scope of uses of their databases at all levels of scholarship.
Karl D Uitti (Princeton University) Old French Manuscripts, the Modern Book, and the Image
`Text' is often equated with the `final' , printed work of an author, but this is frequently an arbitrary construct: before printing, scribes considered themselves a part of the literary process whereas our editions contain what we believe the mediæval author `wrote'. By replicating in database form an important Old French MS tradition, we wish to augment the resources open to scholars by making available an authentically mediæval and dynamic example of pre-printing technology.
Gina L Greco (Portland State University) The Electronic Diplomatic Transcription of Chrétien de Troyes's "Le Chevalier de la Charrette (Lancelot):" Its Forms and Uses
The Princeton project differs from the Ollier-Lusignan Chrétien database in that it includes all eight manuscripts. The participants believe it is important not to resolve scribal abbreviations but to preserve this information intact: the MS text will be transcribed exactly with word divisiona, punctuation and capitalization. It is our hope that the electronic `editing' will be continuous, but with control centred in Princeton.
Toby Paff (Princeton University) The `Charrette' Database: Technical Issues and Experimental Resolutions
Treating the `Charette' materials as a database has several advantages over other approaches. While it provides fast access to words, structures, lines and sections for analysis, it also offers a rich array of resources for dealing with orthographic, morphological, grammatical and interpretative problems. The Foulet-Uitti edition is available in a SPIRES database and is augmented by lexicographical and part-of-speech indexes. The Postgres implementation allows the matching of dictionary searches with images of the manuscripts themselves.

Track 2, 4:00: Computer-Assisted Learning Systems. Chair: Randy Jones (Brigham Young University)

Eve Wilson (University of Kent at Canterbury) Language of Learner and Computer: Modes of Interaction

The difficulties of providing a Computer-Assisted Language Learning system (CALL) are exacerbated by the teacher is often not a computer specialist. Such packages need to accommodate both the needs and aptitudes of the learner as well as the goals of the teaching programme. Good interface design is essential so that it is easy for teachers to add material as well being easy for the students to use. The proposed system uses SGML to define the formal structure, and a sample DTD for a student exercise is included.

Floyd D Barrows, James B. Obielodan (Michigan State University) An Experimental Computer-Assisted Instructional Unit on Ancient Hebrew History and Society

The program content covers the story of the Hebrew people from their settlement in the Jordan valley to the fall of Jerusalem in 70 CE. It is implemented in ToolBook 1.5 for the IBM PC but can be ported to Hypercard for the Macintosh. The courseware is designed to provide interactive lessons on forces that influenced the development of the Hebrew people and students can select from 12 Lesson units, which include problem-solving questions. It can be used for enrolled students (giving a grade) or for guest users (interest only) and provides pre-and post-test elements to judge performance.

Track 3, 4:00: Information Resources for Religious Studies. Chair: Marianne Gaunt (Rutgers University)

Michael Strangelove (University of Ottawa) The State and Potential of Networked Resources for Religious Studies: An Overview of Documented Resources and the Process of Creating a Discipline-Specific Networked Archive of Bibliographic Information and Research/Pedagogical Material

An increasing number and variety of networked archives related to religious studies have appeared in the last few years, based on LISTSERV and ftp. The author provides an overview of the experience of creating and cataloguing network-based resources in religious studies, and relates the process and online research strategy of writing a comprehensive bibliography and guide to networked resources in religious studies, The Electric Mystic's Guide to the Internet. The issues of size and growth rate, security, copyright and verification, skills and tools required, and the funding strategies neededare also discussed.

Andrew D. Scrimgeour (Regis University) Cocitation Study of Religious Journals

Cocitation Analysis was developed in 1973 and is used here to study how humanities scholars perceive the similarities of 29 core religion journals. The resultant map is a graphic picture of the field of religious studies as organized by its journals. The map depicts the spatial relationships between each speciality area and also between the individual journals within each area. These maps are useful for providing an objective technique for tracing the development of a discipline over time and are of potential benefit in teaching basic courses in religious studies.

Evening Activities

5:45: ALLC Annual General Meeting
[text needed from Susan Hockey]
8:00: Report of the Text Encoding Initiative
[text needed from Lou Burnard/Michael Sperberg-McQueen]

Thursday, June 17

Track 1, 9:00: Hypertext Applications. Chair: Roy Flannagan, Ohio University

John Lavagnino (Brandeis University) Hypertext and Textual Editing

The recent innovation of hypertext has been taken as providing a clear and convincing solution to the problems facing textual scholars. The author examines what hypertext can and cannot do for editions compared with print.

Space is the predominant factor: it is unlimited in hypertext and has obvious attractions for handlinmg multiple- version and apparatus criticus problems. The other main factor is ordering: hypertext can provide multiple views of a corpus that are unobtainable in print. The mechanisms used are the link, the connection between one point in hypertext and another, and the path, which is a seres of pre-specified links.

Despite its obvious attractions, editors have been too limited in what they want from hypertext, seeking only solutions to the problems of traditional publishing rather than taking advantage of the new medium's possibilities.

Risto Miilumaki (University of Turku) The Prerelease Materials for Finnegans Wake: A Hypermedia Approach to Joyce's Work in Progress

The electronic presentation of a complicated work such as Finnigans Wake and its manuscripts can facilitate an international cooperative effort at a critical edition. The present approach covers the drafts, typescripts and proofs of chapters vii and viii or part I of the work, done in Asymetrix' ToolBook OpenScript for MS-Windows/MCI. The implementation allows synoptic browsing of the pages, using a `handwriting' font for manuscripts, typewriter font for drafts and Times for proofs, with access to graphic images of the manuscript pages themselves.

Catherine Scott (University of North London) Hypertext as a Route into Computer Literacy

The high numbers of Humanities students required to undertake courses in computer literacy often undergo courses on a campus-wide basis, regardless of their discipline. The paper proposes the use of training in hypertext systems as a vehicle for instructing them in file creation, formatting, making links and incorporating graphics so that they learn to use the computer's screen as a vehicle for presenting arguments, displaying interrelated information and for providing readers with choices. The skills they learn are transferable into other applications, and the students who have gone through the UNL course have received it very enthusiastically.

Track 2, 9:00: Parsing and Morphological Analysis. Chair: Paul Fortier (University of Manitoba)

Hsin-Hsi Chen, Ting-Chuan Chung (National Taiwan University) Proper Treatments of Ellipsis Problems in an English- Chinese Machine Translation System

Conjunctions, comparatives and other complex sentences usually omit some constituents. These elliptical materials interfere with the parsing and the transfer in machine translation systems. This paper formulates Ellipsis Rules based on X-scheme. The differences between English and Chinese constructions are properly treated by a set of transfer rules.

Ellipsis is the omission of an element whose implied presence may be inferred from other components of the sentence, as in I like football and Kevin @ tennis (where the `@' stands for an omitted `likes'). The approach to parsing ellipsis is to divide the grammar rule base into normal (N) rules and ellipsis (E) rules. Recognition of one phrase (I like football) can then trigger a differential analysis of the remainder of the sentence, the relationship being governed by the E-rules.

Chen identifies four elliptical constructions in English,

Because the specific features of elliptical construction in English are described by the uniform E-rules, the grammar rules for other phrases need not be changed. A set of lexical and structural transfer rules has been constructed to capture the differences between English and Chinese elliptical constructions, implemented on an English-Chinese machine translation system using Quintus Prolog and C.

Jorge Hankamer (University of California, Santa Cruz) keCitexts: Text-based Analysis of Morphology and Syntax in an Agglutinating Language

Text corpora are used for many purposes in the study of language and literature: frequency tables derived from corpora have become indispensable in experimental psycholinguistics. The keCi analyser has been developed to automatically lemmatise a library of texts in Turkish, a language with an agglutinating morphology.

The system matches the root at the left edge of the input string and follows a morphotactic network to uncover the remaining morphological structures of the string. The developing corpus has been converted from disparate formats into a common scheme in ASCII and so far, 5000 lines (sentences, about 70Kb) has been `cleaned' by keCi.

Juha Heikkilä, Atro Voutilainen (University of Helsinki) ENGCG: An Efficient and Accurate Parser for English Texts

The ENGCG parser constitutes a reliable linguistic interafce for a wide variety of potential applications in the humanities and related fields, ranging from parsing proper via corpus annotation to information retrieval.

It performs morphological analysis with part-of-speech disambiguation and assigns dependency-oriented surface- syntactic functions to input wordforms, using the Constraint Grammar techniques developed by Karlsson (1990) by expressing the structures which exclude inappropriate alternatives in parsing.

The system exists in a LISP development version and a C++ production version running on a Sun Sparcstation 2.

Track 3, 9:00: Documenting Electronic Texts (Panel)

Annelies Hoogcarspel (Center for Electronic Texts in the Humanities), Chair. TEI Header, Text Documentation, and Bibliographic Control of Electronic Texts
While it is estimated that there are thousands of electronic texts all over the world, there is generally no standardized bibliographic control. If electronic texts were cataloged according to accepted standards, duplication could be avoided, and their use encouraged more effectively.
The Rutgers Inventory of Machine-Readable Texts in the Humantiies is now maintained by CETH, the Center for Electronic Texts in the Humanities, and uses the standard AACR2 [Anglo-American Cataloging Rules (2nd Ed, 1988)]. The RLINMARC program is used to hold the data in MDF (computer files) format.
The file header described by the proposals (P2) of the Text Encoding Initiative (TEI) incorporates all the information needed to follow the rules of AARC2 as well as other information now often lacking. In particular, proper cataloging can indicate the degree of availability of a text, so where there is uncertainty about copyright questions, an entry could still indicate to the serious scholar whether a copy of the text is available or not.
Richard Giordano (Manchester University)
Lou Burnard (Oxford University)

Track 1, 11:00: Statistical Analysis of Texts. Chair: Joel Goldfield (Plymouth State College)

Thomas B Horton (Florida Atlantic University) Finding Verbal Correspondences Between Texts

In the early 1960s, Joseph Raben and his colleagues developed a program that compared two texts and found pairs of sentences in which one text contained verbal echoes of the other. Despite the flexibility of modern concordance systems, Raben's program appears not to have survived in modern form. This study examines the problem, Raben's solution and possible new approaches.

Using the accepted premise that Shelley was heaving influenced by Milton, Raben developed an algorithm to analyse canonically-converted sentences from work of both authors. Although the algorithm has not been retested in 30 years, it has now been reimplemented using modern tools, and its effectiveness examined. Work is also proceeding on comparing this approach to a passage-by-passage approach using the ``word cluster'' technique.

David Holmes (The University of the West of England), Michael L. Hilton (University of South Carolina) Cumulative Sum Charts for Authorship Attribution: An Appraisal

Cumulative sum (CuSum) charts are primarily used in industrial quality control, but have found application in authorship attribution studies, and one particular technique (QSUM, Morton & Michaelson) has been the centre of forensic controversy in the UK in some allegedly forged-confession cases.

The QSUM test uses the assumption that people have unique sets of communicative habits, and implements CuSum charts to present graphically the serial data in which the mean value of the plotted variable is subject to small but important changes. But as there is as yet no statistically valid way of comparing two CuSum charts, any decision regarding their significance will necessarily be either subjective or arbitrary.

Experiments with weighted CuSums indicate that they perform marginally better than the QSUM test, but are not consistently reliable: it may be that authors do not follow habit as rigidly as would be needed for CuSum techniques to determine authorship correctly.

Lisa Lena Opas (University of Joensuu) Analysing Stylistic Features in Translation: A Computer-Aided Approach

Computational linguistics can facilitate the examination of how successful a translation is in replicating important stylistic features of the original text. An example is the Finnish translation of Samuel Becket's How It Is, which describes the writing process itself and the effort put into it.

A feature of this novel is its use of repetition, and research is cited on the effect of ``shifts'' in style occasioned by the non-coincidence of stylistic conventions such as repetition which differ between two languages. TACT was used to analyse consistency in the use of specific words and phrases which were used in the translation, and it was noted that shifts have indeed occurred.

Track 2, 11:00: Phonetic Analysis. Chair: Joe Rudman (Carnegie Mellon University)

Wen-Chiu Tu (University of Illinois) Sound Correspondences in Dialect Subgrouping

In the classification of the sound properties of cognate words, there is a lexicon-based, equal-weighted technique for quantification which can be used without constructing phonological rules. In a study of Rukai, a Formosan language), a modification of Cheng's quantitative methods was used to store 867 words and variants. These were subjected to statistical refinement and then sucessfully subgrouped by a process using a data matrix of difference and sameness, a measure of the degree of similarity, and cluster analysis.

Ellen Johnson, William A Kretzschmar, Jr (University of Georgia) Using Linguistic Atlas Databases for Phonetic Analysis

The Linguistic Atlas of the Middle and South Atlantic States (LAMSAS) [USA] has been used to analyse the well-known phonological feature of the loss of post-vocalic /r/. Uisng a specially-designed screen and laserprinter font in the upper half of the PC character set, database searches can be cariied out on complex phonetic strings.

Two methods were examined, assigning each pronounciation a score, and treating each pronounciation as binary (ie with or without retroflexion). The techniques used are discussed, and the system for encoding phonetic symbols and diacritical marks is described.

Track 3, 11:00: Preserving the Human Electronic Record: Responsibilities, Problems, Solutions (Panel)

Peter Graham (Rutgers University), Chair
Whoever takes on the responsibility of preserving the elctronic human record will find two problems: preservation of the media and preservation of the integrity of the data stored on it. Shifts of culture and training are needed among librarians and archivists to enable the continuation of their past print responsibilities into the electronic media.
Gordon B. Neavill (University of Alabama)
There are parallels between the oral tradition and the manuscript period, and the electronic environment. The malleability of electronic text contrasts sharply with the fixity of print, and the problem is re-emerging of how information should survive through time, and how we authenticate the intellectual content.
W Scott Stornetta (Bellcore)
In addition to authentication there is the problem of ficing a document at a point in time.A technique has been developed at Bellcore of timestamping a document digitally which satisfies both requirements, and work is under way to see if this is an appropriate tool for preserving scholarly information integrity.

Track 1, 2:00: The Wittgenstein Archives at the University of Bergen (Panel) Claus Huitfeldt (University of Bergen), Chair

Claus Huitfeldt, Ole Letnes (University of Bergen) Encoding Wittgenstein
A specialized encoding system is being developed to facilitate preparation, presentation and analysis of the texts being collected for the computerized version of Ludwig Wittgenstein's writings. The objective is to make both a strictly diplomatic, and a normalized and simplified reading-version of every manuscript in his 20,000- page unpublished Nachlass, with its constant revisions, rearrangements and overlap, using a modified version of MECS (Multi- Element Code System). The target is for scanned raster images, MECS transcriptions, a TEI version, the diplomatic and normalised/simplified versions in wordprocessor formats, a free-text retrieval system, and a filter/browser/converter.
Claus Huitfeldt (University of Bergen) Manuscript Encoding: Alphatexts and Betatexts
Within the markup scheme being used, are some important distinctions to handle the complexity of the material:
Alois Pichler (University of Bergen) What Is Transcription, Really?
The encoding of a text involves a number of different activities transferring the multidimensional aspects of a handwritten text into the unidimensional medium of a computer file. The author identifies nine specific kinds of coding and exmaines their use, and presents an alternative model to the hierarchic one used in MECS-WIT. The requirement of ``well-formedness'' should be regarded as only one rule among the other, equally valid, rules.

Track 2, 2:00: Data Collection and Collections. Chair: Antonio Zampolli (Istituto di Linguistica Computazionale)

Shoichiro Hara, Hisashi Yasunaga (National Institute of Japanese Literature) On the Full-Text Database of Japanese Classical Literature

The spread of computers which can handle Japanese language processing has led to the starting of the National Institute of Japanese Literature (NJIL)'s recension full-text database The Outline of Japanese Classical Literature (100 vols, 600 works), comprising databases containing the Texts, Bibliographies, Utilities (revision notes) and Notes (headnotes, footnotes, sidenotes etc).

Another project is underway to construct a full-text database of current papers in the natural sciences, using DQL and SGML. A visual approach has been chosen to overcome the inherent complexity of DQL's SQL parentage.An elemental query is written in a box attached to the node or leaf, and complex queries can be constructed by gathering these using the mouse.

The possibility of applying TEI standards to Japanese classical literature is being studied.

Ian Lancashire (University of Toronto) A Textbase of Early Modern English Dictionaries, 1499-1659

The scale of the task of adequately documenting Early Modern English may explain why a Dictionary project was not funded several decades ago, when research showed that such a peoject could turn out larger than the OED. However, a text database of Renaissance bilingual and English-only dictionaries would be feasible as a way of making available information that would appear in an EMED.

Using such an electronic knowledge base, a virtual Renaissance English dictionary could be constructed, using SGNL tagging, and inverting the structure of bilingual texts: early results (ICAME, Nijmegen, 1992) indicate that there are some phrasal forms and new senses not found in the OED.

Dionysis Goutsos, Ourania Hatzidaki, Philip King (University of Birmingham) Towards a Corpus of Spoken Modern Greek

The analysis of the perennial and much-disputed problem of diglossia in contemporary Greek, and the problems of teaching Modern Greek both as a first and a foreign language, would be much facilitated if there were a database of the corpus of modern spoken Greek. Such a project has been proposed since 1986, and many of the technical problems identified then have since found at least partial solutions.

The have been many fragmented projects around the world on modern written Greek, and there is now a survey under way to determine the nature and size of the extant corpus.

Track 3, 2:00: Networked Electronic Resources: New Opportunities for Humanities Scholars (Panel)

Christine Mullings (University of Bath), Chair. HUMBUL: A Successful Experiment
The Humanities Bulletin Board (HUMBUL) was established experimentally in 1986 to meet the growing need for up-to-date information about the use of computer-based techniques in the Arts and Humanities in the UK and elsewhere. There are currently over 4,000 users, and the number is growing at around 60 per month. A survey of usage revealed that primary use was made of the Diary and Conferences section, Situations Vacant, and the list of other bulletin board on JANET (the UK academic network). A recent addition is HUMGRAD, a mailing- list service for postgraduate students, who often feel isolated from other areas of their work, and the possibility of adding facilities like item expiry is being considered.
Richard Gartner (Bodleian Library) Moves Towards the Electronic Bodleian: Introducing Digital Imaging into the Bodleian Library, Oxford
Digital imaging presents new opportunities for conserving material and disseminating it to readers more efficiently tan hitherto possible. The Bodleian Library, which has an ongoing programme of microfilming its more valuable material, is now applying digital imaging techniques to some areas, starting with the John Johnson Collection of Printed Ephemera's images of motor cars. The 7,000 images will be available in a database of Photo-CD using JPEG, with attached textual information, searchable in full-text, and accessible both as a CD and by FTP (and possibly interactively when bandwidth allows). A further project will be the conversion of 60,000 slides of illuminated mediæval manuscripts with an associated database catalogue.
Jonathan Moffett (Ashmolean Museum) Making Resource Databases Accessible to the Humanities
A type of short-termism pervades society, and this conflicts with the inherent need for humanities project funding to be long-term and ongoing. Information Technology, while not a panacea, may provide more than a palliative, as three projects illustrate: There are thus problems with timescale no less than with users' expectations, personal skills, freedom of access, copyright, publication and upgrades. Bold decisions are required if full use is to be made of IT's potential.

Evening Activities

4:00: ACH Open Meeting
[text from Susan Hockey?]
5:30: Reception in Leavey Conference Center
7:00: Keynote Speaker [Introduction, Roy Flannagan (Ohio University)] Hugh Kenner, Franklin and Calloway Professor of English, University of Georgia
[summary text needed]
8:00: Conference Banquet in Leavey Conference Center

Friday, June 18

Track 1, 9:00: Text Encoding and Encoded Text. Chair: Lou Burnard (Oxford University)

Nancy Ide (Vassar College), Jean Veronis (GRTC/CNRS) An Encoding Scheme for Machine Readable Dictionaries

Previous work has made it clear that the development of a common dictionary encoding format is extremely difficult. This paper describes the encoding scheme for mono- and bilingual dictionaries developed by a working group of the Text Encoding Initiative, intended for use by publishers and lexicographers, computational linguists, and philologists and print historians.

Three different views of dictionaries exist: the typographic view, the textual view and the lexical view. Dictionary structure is typically factored to avoid redundancy, but the way in which the data is factored varies widely from one dictionary to another. The two DTDs provided for in the TEI guidelines cover the `regular' structure, allowing nesting of senses within homographs or entries; and the `free' DTD, which allows any element anywhere.

The high level of information compression within entries makes faithful reproduction of the typographic format difficult, and principles are given for the retention of the textual view where this is desired, and the difficulties of the rearrangements undertaken in providing a lexical view are discussed.

Peter Flynn (University College, Cork) Spinning the Web -Using WorldWideWeb for Browsing SGML

The development of the WorldWideWeb (W3) hypertext system has led to a growth in interest in the use of SGML-encoded texts for the dissemination of information. The relatively primitive facilities offered by the HTML DTD currently in use will be supplanted by those of HTML+ and it is hoped that a mechanism will be developed to allow the replication of TEI-encoded texts in a similar manner. Providing a W3 service is relatively straightforward but entails the acceptance of the onus to keep the machine running for the benefit of remote users. Some difficulties were encountered with diacriticals. but these have largely been solved. More important is the ease of use for the reader, and the ease with which documents can be made available: these are important aspects given the rapid drive towards computer literacy for humanities scholars and students.

Claus Huitfeldt (University of Bergen) MECS - A Multi-Element Code System

The Norwegian Wittgenstein Project (1980-1987) had at that time to construct its own markup system. When it was continued in 1989-90, alternatives were sought. SGML was found lacking in some critical areas (software, overlapping elements and the need to define the DTD in advance).

The Multi-Element Code System (MECS) was developed to allow a more flexible approach, while retaining sufficient similarities to allow accommodation to SGML's reference concrete syntax. It does not presuppose any hierarchical document structure and distinguishes between no-element tags, one-, poly- and N-element tags by the delimiters used. A Program Package contains software to create, validate, [re-]format, analyse and convert MECS-conforming documents.

Although MECS is better for descriptive coding purposes, there are some shortcomings, but work is ongoing to enable a closer coexistence between it and SGML.

Track 2, 9:00: Invited SIGIR Panel on Information Retrieval

Edward Fox (Virginia Technical University), Chair and Presenter Electronic Dissertation Project
Elizabeth D. Liddy (Syracuse University) Use of Extractable Semantics from a Machine Readable Dictionary for Information Tasks
Robert P. Futrelle (Northeastern University) Representing, Searching, Annotating, and Classifying Scientific and Complex Orthographic Text

Track 3, 9:00: Developing and Managing Electronic Texts Centers (Panel)

Mark Day (Indiana University), Chair and Presenter
Anita Lowry (University of Iowa)
John-Price Wilkin (University of Virginia)

Track 1, 11:00: Statistical Analysis in Literature and Philosophy. Chair: Helmut Schanze, (Universität; Gesamthochschule)

Wilfried Ver Eecke (Georgetown University) Computer Analysis of Hegel's Phenomenology of Mind

The text of Hegel's Phenomenology of Mind is available in Baillie's translation in electronic form for Wordcruncher or Micro-OCP. The greater accuracy of the computer compared to existing indexes is demonstrated by reference to the larger number of occurrences found when searching for a word, and it is also possible for the words to be located accurately.

Searching for word combinations allowed definitions to be identified and compared with those of other philosophers, and enabled the location of those definitions in their context to be used as a basis for further philosophical interpretation.

Frequency analysis and concordance-making revealed that Hegel used a limited amount of concepts to talk about desire, the list being confirmed by the theory of Lacan 150 years later. The author hopes to establish lists of concepts that are used for the first time in a particular problem. Preliminary results are not clear but include promising hints.

Tony Jappy (University of Perpignan) The Verbal Structure of Romantic and Serious Fiction [paper read by Glyn Holmes]

The author investigates the distribution of verbal forms in two novels by Virginia Woolf and three from Mills & Boon. Theoretical linguistics was applied by students using OCP to investigate the constative and commentative aspects of verb usage, and the results were compared with earlier work by Biber [1988]. The study illustrates the way the macroscopic text analysis using text retrieval and simple statistics can be conducted.

Thomas Rommel (University of Tuebingen) An Analysis of Word Clusters in Lord Byron's Don Juan

A computer-aided analysis of the poem provides the textual basis for thematically organised lists of all single elements within word clusters and thus helps determine for what purpose Byron may have used them. The poem was scanned and tagged using a modified SGML structure, and analysed using TUSTEP (Tübinger System von Textverarbeitungs-Programmen). What began as an analysis of clusters finally led to a vast number of new observations on how language is handled in one particular text: while literary computing is ideally suited for the initial preparation of the text, final classification, analysis and interpretation still has to be done by human beings.

Track 2, 11:00: Technological Enhancements. Chair: Mary Dee Harris (Language Technology) [Kathryn Burroughs Taylor substituting]

Kathryn Burroughs Taylor (McLean, Virginia) Transferring Automatic Speech Recognizer (ASR) Performance Improvement Technology to Optical Character Recognition (OCR)

Optical character recognition is the most promising data acquisition technology for the humanities computer user, but the error rate (or rate of correction-requirement) raises the question of how OCR could benefit from lessons learned in probme reduction for automatic speech recognition, as the two technologies appear to share enough of the same problem characteristics. The author proposes a model based on a frequency approach, as used in biphone and triphone modelling in speech recognition, to see if particular combinations of characters are causes of error. A Scanned Language System would take advantage of the power of the grammar and semantics resident in Natural Language systems to enhance the recognition power of OCR systems.

David J. Hutches (University of California, San Diego) Lexical Classification: Examining a New Tool for the Statistical Processing of Plain Text Corpora

The author defines an algorithm for `mutual information' (Church & Hanks, 1990) which expresses the measure of association between two events (lexical items) in terms of the probability of their occurrence at a given distance (separation). Taking the Lancaster-Oslo-Bergen corpus, experimental phrases were generated, and the algorithm employed to word classes which were related to the function of the priming lexical items in the original context. The algorithm is based on the notion that the elements of a phrase mutually constrain one another and on the way in which lexical items can be characterized by the neighbourhoods in which they `live'. The results are encouraging, and may be a means of validating the techniques of representation which can be applied to the statistical analysis of linguistic phenomena in corpora.

Track 3, 11:00: Design Principles for Electronic Textual Resources: Integrating the Uses, Users and Developers (Panel)

Susan Hockey (Center for Electronic Text in the Humanities), Chair
Over the last 30 years or so, humanities electronic textual resources have been developed in an ad hoc manner. More recently, publishers and libraries have begun to work with electronic texts and some specific projects are now being progressed as electronic resources rather than for specific research purposes. The development of software tools has been similarly ad hoc. This panel takes as its premise the view that requirements of scholarship must be the first priority, and that software tools must be tailored to these requirements (as for example, the work of the Text Encoding Initiative).
Nicholas Belkin (Rutgers University)
Elaine Brennan (Brown University)
Robin Cover (Dallas, TX)

Track 1, 2:00: Music Applications. Chair: Gordon Dixon (Manchester Metropolitan University)

Daniel C. Jacobson (University of North Dakota) Multi- Media Environments for the Study of Musical Form and Analysis

The integration of computer-assisted graphics, animation and CD-ROM/audio within an object-oriented environment gives today's music theory teachers the means to create platforms for the simultaneous investigation of local and large-scale musical processes. The author presents a demonstration of such techniques applied to Schubert's Erlkönig. When used in such ways, the power of multimedia allows students to see and hear sophisticated long-range structural relationships that could not be demonstrated using traditional teaching methods. Music educators who use and design this technology must do so with the full awareness that our visual representations no longer lie dormant on the page.

John Morehen (University of Nottingham) Computers and Authenticity in the Performance of Elizabethan Keyboard Music

The choice of keyboard fingering for a piece of early music can have a profound bearing on its musical interpretation. The paper describes a computewr program which has been developed for determining an appropriate fingering for pieces of English keyboard music of the English Virginalist school. Using a databank of fingerings from 1560 to 1630, and an input passage for which fingering is sought, the program scans first for exact matches, then for partial matches from 12 notes down to two notes, and finally uses comparative matches varying pitch and duration. The program has so far only been used for the period mentioned, but it could be used without modification for any sufficiently large body of previously-fingered music.

Track 2, 2:00: Historical Information Systems. Chair: Willard McCarty (University of Toronto)

Espen S. Ore, Anne Haavaldsen (Norwegian Computing Centre for the Humanities) Computerizing the Runic Inscriptions at the Historic Museum in Bergen

The project aims are (a) to build a database of runic inscriptions at the Historic Musem; (b) to establish a graphical-based rune typology; (c) to develop computer-based methods for interpreting difficult-to-read signs; and (d) to develop computer-based methods for studying variations in runic form. Preliminary work began in May 1992, and a database structure and proposed set of tools has been established. The database will allow searches on linguistic criteria as well as containing images connected with the texts.

Daan van Reenen (Free University, Amsterdam) Early Islamic Traditions, History and Information Science

The author reports on research into the authenticity of Hadith literature, the most important legal source for muslims outside the Qur'an. Each text is preceded by a chain of `transmitters', giving the reported provenance of the text from the Prophet to the writer. 2,000 traditions were analysed, covering the prohibition of making images. The texts are encoded in detail and loaded into an Ingres database, from where it is possible to pose complex questions about the characteristics of the traditions.

Angela Gilham (Tyne and Wear, UK) Knowledge-Based Simulation: Applications in History

This paper is an exploration of some of the possibilities inherent in building a knowledge-based simulation of a complex socio-political institution, based upon the emergent properties exhibited by a number of competing and cooperating agents. The institution is the House of Lords in the English Parliament of 1640-1642 (the `Long Parliament'). The author introduces and examines a hybrid form of simulation which employes techniques from distributed systems research, theories of agent modelling, and distributed artificial intelligence. The resulting model of the House of Lords is presented and discussed with reference to the problems of representing historical knowledge, and a tentative evaluation of the model is examined.

Track 3, 2:00: What Next After the TEI? Call for a Text Software Initiative (Panel)

Nancy Ide (Vassar College), Chair
The Text Encoding Initiative (TEI) and numerous text and corpus collection efforts have made realizable the availability of large amounts of electronic texts in an interchangeable format. As a result, the need for generally-available, flexible text-analytic software tools is substantially greater than ever. The panel is intended to discuss the founding principles of a Text Software Initiative (TSI), set up by the humanities computing community and specifically intended to provide the kinds of software appropriate for its research needs.
Malcolm Brown (Dartmouth College)
Mark Olsen (University of Chicago)
Jean Veronis (CNRS, Marseille)
Antonio Zampolli (Istituto di Linguistica, Pisa)

Track 1, 4:00: Signs, Symbols, and Discourses: A New Direction for Computer-Aided Literary Studies - New Responses (Panel)

Paul A. Fortier (University of Manitoba), Chair
Donald Bruce (University of Alberta) Towards the Implementation of Text and Discourse Theory in Computer-Aided Analysis
Paul Fortier (University of Manitoba) Babies, Bathwater, and the Study of Literature
Joel D. Goldfield (Plymouth State College) An Argument for Single-Author and Other Focused Studies Using Quantitative Criticism: A Collegial Response to Mark Olsen
Peter Shoemaker (Princeton University) and Gina L. Greco (Portland State University) Computer-Aided Literary Studies: Addressing the Particularities of Medieval Texts
This paper examines the manner in which computers have been integrated into mediæval textual studies. The mediæval process of writing was one of rewriting, interpreting and glossing a pre-existing text. Contemporary scholarship has attempted to address these problems, and these paradigms call for the analysis of large bodies of text data. Contrary to Olsen's assertion that a reorientation of theoretical models of computer research is necessary, contemporary mediæval studies already provide us with notions of textual analysis which are well-suited to computer development.
Ellen Spolsky (Bar-Ilan University) Have It Your Way and Mine: The Theory of Styles
Olsen has the right diagnosis of the problem: the theoretical foundations of stylistics were faulty, but machines were not the problem - if we need a new model, the author suggests that cognitive science can provide some interesting ones, for example vision theory.
Greg Lessard, Johanne Bénard (Queen's University) Computerizing Céline
It is important to view the computer not as a device for finding what we have missed, but as a means of focussing more closely on what we have already felt, and of verifying hypotheses. While the gap between traditional scholarship and computer-aided analyses is domonstrably wide in English-language work, this is not so in French-language work. Olsen treats computer use as a solitary vice: more and more, literary scholarship is becoming a social enterprise, and it is important that computer-aided analysis reflect this change. The author illustrates these points with details from the methodology of work on the writings of Céline.
Mark Olsen (University of Chicago) Critical Theory and Textual Computing

Track 3, 4:00: Issues in Humanities Computing Support (Panel)

Charles D. Bush (Brigham Young University), Chair and Presenter
The discussion focussed on the role of Humanities Computing Facilities (HCFs) in faculty teaching and research. Eleven questions were posed to the panel and the answers discussed.
Eric Dahlin (University of California, Santa Barbara)
Terry Butler (University of Alberta)
Christine Mullings (University of Bath)
Malcolm Brown (Dartmouth College)
Harold Short (King's College, London)

Saturday, June 19

Track 1, 9:00: Overview of Methodologies. Chair: Mark Olsen (University of Chicago)

Christian Delcourt (Université de Liège) Computational Linguistics from 500 BC to AD 1700

In order to shed light on those who performed textual computations before the advent of the computer, the author has brought together a vast collection of data which were spread in unconnected fields and hidden in hard-to-find documents, focussing on five branches:

Catherine N. Ball (Georgetown University) Automated Text Analysis: Cautionary Tales

Despite the increasing availability of online text corpora and text analysis tools, automated text analysis is not a solved problem: recent tests reveal varying degrees of accuracy. The author identifies four common errors:

The author calls for tools to be used with an awareness of the pitfalls, and for the development of evaluation metrics.

Jean-Jacques Hamm, Greg Lessard (Queen's University) Do Literary Studies Really Need Computers?

The authors questios the usefulness of computer analysis in literary studies. While acknowledging that quantitative methods have contributed a great deal to the field, they raise questions about validity: small sample size, faulty selection of statistical tools, and the misinterpretation of results. They identify a tendency to associate perceived surplus with measures of frequency, and cite many examples where statistical readingsare prone to errors of the sorts mentioned.

Track 2, 9:00: The British National Corpus: Problems in Producing a Large Text Corpus (Panel Session)

Gavin Burnage (Oxford University Computing Service), Chair and Presenter

The design of the BNC was to produce a 100 million word corpus of modern british English, and is now over half way through its planned development stage. Inclusion in the corpus is on four criteria: subject matter; date of first publication (mostly 1975 onwards); medium; and general level (high, middle or low). To maximise variety, the number of words in any one text does not normally exceed 40,000. Ninety million words will be written material, the remainder transcribed from spoken sources. Most text has had to be keyed in or scanned, and has been marked up using CDIF (Corpus Document Interchange Format), an application of SGML conforming to the standards of the TEI. Part-of-speech tagging is carried out at Lancaster University using a modified version of Claws. The production phase is scheduled to end in April 1994, when the corpus will become available for research.

Roger Garside (Lancaster University)

Frank Keenan (Oxford University Press)

Track 3, 9:00: The Scholar's Workbench and the `Edition': Legitimate Aspiration or Chimera (Panel)

Frank Colson (University of Southampton), Chair and Presenter. The Debate on Multi-Media Standards
In practice, constraints inherent in the traditional design of software have meant that scholars could not be provided with a seamless environment for work on electronic resources. The emergence of Microcosm, implemented at Southampton's Media and Video Laboratory using HiDES and Kleio, scholars can explore the metaphor of an `edition'. Although the work of the TEI contains rigorously-defined standards for text, those for images, moving media and sound are still the subject of controversy.
Matthew Woollard (University of Southampton) The English version of Kleio
Kleio was developed from 1978 at the Max Planck Institute for History in Göttingen to create database-oriented software, integrate a variety of software tools and discuss the best strategies for analysing historical sources. In mid-1992, a consortium of British universities undertook the translation of the Kleio program and its documentation into English. It is currently being used in 40-50 projects (about 400 copies of the program), and is built on a new data model based on semantic networks for handling full text as well as structured data, providing separation between raw data and knowledge about it, and interfaces to other functions like mapping and typographics.
Dino Buzzetti (University of Bologna) Masters and Books in Fourteenth Century Bologna
This research programme studies the prosopography of Italian masters of arts and philosophy and the production of books which provide evidence of the teachings imparted by them. Kleio was selected to allow the storage of sources separately from hypotheses which might be developed from them. Such a database offers the special opportunity of comparing the different versions of a text, and the image-processing modules allow a one-to-one correspondance between alphanumerical and iconic representations of the textual segments.
Jean Colson (University of Southampton) Microcosm as the Dynamic Edition
To be effective, a scholar's workbench should allow the seamless transfer of data between various environments. Microcosm (for MS-Windows) allows readers to browse through large bodies of multimedia information by following links from one place to another. The separation of link from data not only provides for the creation and referencing of various forms of links, but allows users to maintain a nominated linkbase containing all the links they have created.

11:00: Featured Speaker: [Introduction, John Roper (University of East Anglia)] John Burrows (University of Newcastle, Australia) Noisy Signals? Or Signals in the Noise?

If we regard ourselves as the eventual transmitters and receivers [of signals], the model needs to be refined, because our own, often unwitting, interventions in the process override the boundaries between trasnmitter, signal and noise, as well as those between signal, noise and receiver. In computational stylistics, it is possible to decipher a superabundance of signals in what is usually dismissed as noise.

The analytical procedure employed here is to establish frequency-tables for the 30, 50 or 100 most common words of a given set of texts, to extract a correlation matrix, and subject it to a Principal Components analysis. In the second phase, the tables are subjected to distribution tests to determine which words play most part in separating the texts. Ten years' work [at Newcastle] has enabled the formation of useful inferences at extremely high levels of probability, including consistent and intelligible differences between 18th-, 19th- and 20th-century authors; between authors of different nationalities writing in English; and between male and female authors in earlier eras.

The gender differentiation was based on analyses of 99 first-person authors (45 female and 55 male) and revealed statistically significant differences between these subsets which increase substantially when the authors are also divided chronologically. The inference is that male and female authors write more like each other now than they used to do, but when the most powerful of the wordtypes acting as gender differentiators are subjected to Principal Components analysis, the resulting graph still locates most authors appropriately as male or female. The exceptions testify in their various ways to the claim that narrative style is affected by factors such as education and upbringing, so the results may be best understood as reflecting educational rather than gender differences.

11:30: Closing Ceremony.

Comments by Nancy Ide, President, Association for Computers and the Humanities; Susan Hockey, President, Association for Literary and Linguistic Computing; Michael Neuman, Local Organizer, ACH-ALLC93; Pierre Lafon, Local Organizer, ALLC-ACH94.