This report is a summary of the joint conference of the Association for Computing in the Humanities and the Association for Literary and Linguistic Computing, held at Georgetown University, Washington DC, 16&enrule;19 June 1993. It contains a précis of the text published in the preprints supplemented by the author's notes, but omissions occur for a few sessions for which (a) no paper was available; (b) where a panel discussion was held viva voce; or (c) where a fuller report is available from the speaker. In dealing with topics sometimes outside my own field, I will naturally have made mistakes, and I ask the authors' pardon if I have misrepresented them.
A hypertext version of this report is available on the Internet at http://curia.ucc.ie/info/achallc/georgetown.html and can be accessed through the WorldWideWeb using lynx, mosaic or similar browsers.
Welcomes: Mr. John J. DeGioia, Associate Vice President and Chief Administrative Officer for the Main Campus; Rev. Robert B. Lawton, S.J., Dean, Georgetown College; Susan K. Martin, University Librarian; Nancy Ide, President, Association for Computers and the Humanities; Susan Hockey, President, Association for Literary and Linguistic Computing
Keynote Speaker: Clifford Lynch, Director of Library Automation, Office of the President, University of California
The opening ceremony was held in the splendour of Gaston Hall, Georgetown University. Dr Michael Neuman, organiser of the conference, warmly welcomed attendees and then introduced each of the speakers.
In his welcoming speech, Mr John J DeGioia, Associate Vice President and Chief Administrative Officer for the Main Campus noted that it was impressive that the Georgetown conference followed Oxford and preceded the Sorbonne. Georgetown had a campus in Italy and there seemed to be similarities between Georgetown and Florence in literature, history, philosophy and art. Working in computing in the humanities must be a little like working in Florence in the 16th century. We stand on the verge of several possibilities and the very idea of `text' is central to shaping these possibilities. We will be facing serious questions in the years ahead and it was appropriate that the purpose of the conference was to improve learning.
The Rev Robert B Lawton SJ, Dean of Georgetown College, spoke of the computer as representing sophisticated technology for retrieving and processing information that really represents a profound movement in human evolution when we can extend our very powers of thinking.He hoped that the conference would lead to profitable conversations which will enrich the time spent at Georgetown.
Nancy Ide welcomed the audience on behalf of the Association for Computing and the Humanities. She concluded by pointing out that the theme suggests that we are at an important moment and set the scene and for the conference.
On behalf of the Association for Literary and Linguistic Computing, Chairman Susan Hockey complimented the Georgetown University organisers for their effort and effectiveness in organising the conference.The programme committee had put together a programme that really shows where Literary and Linguistic Computing and the Humanities Computing will make contributions well into the next decade.There is great potential for working together with librarians to pursue electronic research so that a significant contribution could be made to the electronic library of the future. The conference is a valuable opportunity to bring together the issues which concern librarians, scholars and the skills of computer scientists to develop programs for the creation of electronic texts and their manipulation.
Susan K. Martin, University Librarian also noted the growing interest in electronic texts in the research library community. Librarians will take on new and exciting roles as the true potential of electronic texts becomes more fully understood.
In the opening keynote address, Clifford Lynch, Director of Library Automation, Office of the President, University of California, surveyed the current and future scenes for electronic information delivery and access. In a presentation which ranged over many topics with great clarity and vision, Dr Lynch stressed that the future lay in electronic information and getting definitional handles on its components. He spoke about the technology and computer methods whereby 12,000 constituent networks existed where ideas could be exchanged, and where information can be accessed. Information exists that is independent of particular computer technology and is usable through open standards and network servers that can migrate from one generation of technology to the next thereby preventing preservation disasters of the past; information can now cross many generations of technology.
Inspiration and new analysis tools continue to allow researchers to build on the work of others. Databases and knowledge management by libraries have allowed them to become a central part of scientific enterprise to be used and shared internationally. Networks act as a facilitator of collaboration and for the inclusion of geographically remote researchers.
To establish how to handle electronic texts there was a need for coordination between libraries, centres for electronic texts and initiatives such as the TEI. He also pointed to the fact that intellectual property and copyright of textual material is turning out to be an incredible nightmare.
A move to overcome problems by developing a superstructure for the use of textual resources was proposed. Networking information was the right kind of perspective for thinking about these problems. And the future of networking information provided an interesting voyage fundamental to the issues of scholarship.
By linking over 50 texts dealing with the French language, from Estienne's treatises to the Dictionnaire de l'Académie, in which the key issues in the debate over usage are mentioned (dialects, archaisms, neologism, foreign borrowing, spelling, pronounciation, sociolinguistic variation, etc) it is possible to reconstruct what constituted the metalanguage of grammatical discussion in 17th century France. He argues that the full-text database techniques of corpora linguistics can be brought to bear on the neglected analysis of the history of discourse (in particular the definition of `usage') and that the importance of these disciplines in another age cannot be subject to current theoretical fashions.
This study concentrates on the representational aspects of the discursive status of the Paris Commune of 1871, using a computer-aided analysis of the titular trilogy. The authors' hypothesis (that Proudhon's formal proposition of `anarchism' is realized in the narrative, metaphor and lexical items of Vallès, Rimbaud and Reclus) is being tested in two stages, firstly on unmarked texts, and secondly using PAT and Tact on TEI-marked versions of the texts, both to verify the existence of the empirical regularities and to ascertain heuristically any undiscovered patterns or relationships.
The author distinguishes two type of corpus exploitation: micro, where a specific phenomenon (for example, a linguistic element) is studied in detail, and macro, where groups of phenomena are studied on a corpus-wide basis (for example, to derive a probabilistic parser). Given the substantial effort needed to create syntactic annotation of corpus material, he examines the level of detail in annotation required for the examples of each of the two types of exploitation identified. Micro-exploitation is usually more successful if it involves only categorisation, rather than exploring functional relationships; the macro- exploitation of parser generation is more problematic, since the use of the data is much more varied.
In the analysis of lexical conceptual structure, distributional data may help in solving problems of linguistic underdetermination. With mophological productivity as an example, the author uses the Dutch Eindhoven corpus to show that the frequency distributions of inchoative and separative readings of the Dutch prefix `ont-' are statistically non-distinct. He argues that linguistic analysis is called for in which deverbal and denominal reversatives are assigned identical lexical conceptual structures.
Previous research has demonstrated that there is a formalisable, learnable set of mechanisms which can generate, in principle, an unlimited set of `Tom Swifties' (a form of wordplay such as `I hate Chemistry, said Tom acidly'). This is now extended to analyse structures such as riddles (`Why did the dog go out into the sun? To be a hot dog'). Such riddles share an essential trait with Tom Swifties: they are learned and learnable linguistic strategies.
The VINCI natural language generation environment offers a context-free phrase-structure, a syntactic tree transformation mechanism, a lexicon and lexical pointer mechanism and a lexical transformation mechanism to provide a modelling environment suitable for such analyses.
The model analysed allows the selection of the specific different semantic traits which pose the problem, and generates a question containing them. Three different kinds of question are exemplified, and the paper examines in more detail the linguistic constraints on riddles, in particular the tension between lexicalisation of the correct answer versus productivity.
One of the most vexing problems in Natural Language Processing is the linguistic knowledge acquisition bottleneck. For each new task, basic linguistic datastructures have to be handcrafted almost from scratch. This paper suggests the application of Machine Learning algorithms to automatically derive the knowledge necessary to achieve particular linguistic mappings.
Instance-Based Learning is a framework and methodology for incremental supervised learning, whose distinguishing feature is the fact that no explicit abstractions are constructed on the basis of the training examples during the training phase: a selection of the training items themselves is used to classify new inputs.
A training example consists of a pattern (a set of value/attribute pairs) and a category. The algorithm is input as an unseen test pattern drawn from input space, and associates a category with it. The paper compares the results of this approach quantitavely to the alternative similarity-based approaches, and qualitatively to handcarfted rules-based alternatives.
The NewsAnalyzer program is intended to assist researchers froma variety of fields in the studty of newspaper discourse. It works by breaking up newspaper articles into individual statements which are then categorised along several syntactic, semantic and pragmatic dimensions. The segmented and categorised text can then serve as data for further analysis along lines of interest to the researcher, for example statistical, content or idoelogical analysis.
The classification scheme distinguishes between factual and non-factual statements: factual ones are further classified between stative or eventive, and non-factual ones have several further subcategories. This is done through a combination of shallow syntactic analyses, the use of semantic features on verbs and auxiliaries, and the identification of special function words.
The program is implemented in scheme and runs on Apple Macintosh computers. A hand-coded version has been used in a linguistic-based content analysis of two weeks of newspaper coverage in which all front-page stories were coded and analysed for writing style, and has also been used to study the ways in which newswriters reported predictions on the Presidential and Vice-presidential candidates[in the 1992 US Presidential election].
The Dartmouth College Information System (DCIS) is organised as a distributed computing resource using the layered OSI model for its network interactions. It includes features such as:
The client/server model used introduces a modularity that was not previously available, and the use of the network frees the user from geographical constraints and delays. It is felt that the information thus made available has the potential equal to that of word processing to facilitate the basic work of Humanities scholarship.
The CNI could be the key in supporting the Humanities, as it builds upon its original programs of scholarship enhancement through the creation of a free and accessible global library of electronic holdings via the National Education and Research Network (NREN).
Through its working group `The Transformation of Scholarly Communication', the CNI intends to identify and help promulgate projects and programs in the Humanities that have significant implications for changing scholarship methodology and teaching when made available on the NREN.
This six-year project started in March 1991 to convert the paper-based archives of the collection departments in Norwegian universities to computer-based form, making the `Norwegian Universities' Database for Language and Culture'. An estimated 750 manyears work is required: currently 120 people are engaged from the skilled unemployed on a 50%-50% work and education programme.
The information core is held in SYBASE and accessed using PAT and SIFT on UNIX platforms. Using client-server technology, the database resides on a machine in Oslo and can be accessed through local clients (Omnis7 or HyperCard) or by remote X-terminals.
Data loading has started with four subprojects: coins (a collection of approximately 200,000 items at the University of Oslo), archaeology (reports in SGML on all archaeological sites in Norway), Old Norse (from 1537 CE, approximately 30,000 printed pages) and Modern Norwegian (the creation of a lexical database for the language). All text is stored in TEI-conformant form.
The difficulties of providing a Computer-Assisted Language Learning system (CALL) are exacerbated by the teacher is often not a computer specialist. Such packages need to accommodate both the needs and aptitudes of the learner as well as the goals of the teaching programme. Good interface design is essential so that it is easy for teachers to add material as well being easy for the students to use. The proposed system uses SGML to define the formal structure, and a sample DTD for a student exercise is included.
The program content covers the story of the Hebrew people from their settlement in the Jordan valley to the fall of Jerusalem in 70 CE. It is implemented in ToolBook 1.5 for the IBM PC but can be ported to Hypercard for the Macintosh. The courseware is designed to provide interactive lessons on forces that influenced the development of the Hebrew people and students can select from 12 Lesson units, which include problem-solving questions. It can be used for enrolled students (giving a grade) or for guest users (interest only) and provides pre-and post-test elements to judge performance.
An increasing number and variety of networked archives related to religious studies have appeared in the last few years, based on LISTSERV and ftp. The author provides an overview of the experience of creating and cataloguing network-based resources in religious studies, and relates the process and online research strategy of writing a comprehensive bibliography and guide to networked resources in religious studies, The Electric Mystic's Guide to the Internet. The issues of size and growth rate, security, copyright and verification, skills and tools required, and the funding strategies neededare also discussed.
Cocitation Analysis was developed in 1973 and is used here to study how humanities scholars perceive the similarities of 29 core religion journals. The resultant map is a graphic picture of the field of religious studies as organized by its journals. The map depicts the spatial relationships between each speciality area and also between the individual journals within each area. These maps are useful for providing an objective technique for tracing the development of a discipline over time and are of potential benefit in teaching basic courses in religious studies.
The recent innovation of hypertext has been taken as providing a clear and convincing solution to the problems facing textual scholars. The author examines what hypertext can and cannot do for editions compared with print.
Space is the predominant factor: it is unlimited in hypertext and has obvious attractions for handlinmg multiple- version and apparatus criticus problems. The other main factor is ordering: hypertext can provide multiple views of a corpus that are unobtainable in print. The mechanisms used are the link, the connection between one point in hypertext and another, and the path, which is a seres of pre-specified links.
Despite its obvious attractions, editors have been too limited in what they want from hypertext, seeking only solutions to the problems of traditional publishing rather than taking advantage of the new medium's possibilities.
The electronic presentation of a complicated work such as Finnigans Wake and its manuscripts can facilitate an international cooperative effort at a critical edition. The present approach covers the drafts, typescripts and proofs of chapters vii and viii or part I of the work, done in Asymetrix' ToolBook OpenScript for MS-Windows/MCI. The implementation allows synoptic browsing of the pages, using a `handwriting' font for manuscripts, typewriter font for drafts and Times for proofs, with access to graphic images of the manuscript pages themselves.
The high numbers of Humanities students required to undertake courses in computer literacy often undergo courses on a campus-wide basis, regardless of their discipline. The paper proposes the use of training in hypertext systems as a vehicle for instructing them in file creation, formatting, making links and incorporating graphics so that they learn to use the computer's screen as a vehicle for presenting arguments, displaying interrelated information and for providing readers with choices. The skills they learn are transferable into other applications, and the students who have gone through the UNL course have received it very enthusiastically.
Conjunctions, comparatives and other complex sentences usually omit some constituents. These elliptical materials interfere with the parsing and the transfer in machine translation systems. This paper formulates Ellipsis Rules based on X-scheme. The differences between English and Chinese constructions are properly treated by a set of transfer rules.
Ellipsis is the omission of an element whose implied presence may be inferred from other components of the sentence, as in I like football and Kevin @ tennis (where the `@' stands for an omitted `likes'). The approach to parsing ellipsis is to divide the grammar rule base into normal (N) rules and ellipsis (E) rules. Recognition of one phrase (I like football) can then trigger a differential analysis of the remainder of the sentence, the relationship being governed by the E-rules.
Chen identifies four elliptical constructions in English,
Because the specific features of elliptical construction in English are described by the uniform E-rules, the grammar rules for other phrases need not be changed. A set of lexical and structural transfer rules has been constructed to capture the differences between English and Chinese elliptical constructions, implemented on an English-Chinese machine translation system using Quintus Prolog and C.
Text corpora are used for many purposes in the study of language and literature: frequency tables derived from corpora have become indispensable in experimental psycholinguistics. The keCi analyser has been developed to automatically lemmatise a library of texts in Turkish, a language with an agglutinating morphology.
The system matches the root at the left edge of the input string and follows a morphotactic network to uncover the remaining morphological structures of the string. The developing corpus has been converted from disparate formats into a common scheme in ASCII and so far, 5000 lines (sentences, about 70Kb) has been `cleaned' by keCi.
The ENGCG parser constitutes a reliable linguistic interafce for a wide variety of potential applications in the humanities and related fields, ranging from parsing proper via corpus annotation to information retrieval.
It performs morphological analysis with part-of-speech disambiguation and assigns dependency-oriented surface- syntactic functions to input wordforms, using the Constraint Grammar techniques developed by Karlsson (1990) by expressing the structures which exclude inappropriate alternatives in parsing.
The system exists in a LISP development version and a C++ production version running on a Sun Sparcstation 2.
In the early 1960s, Joseph Raben and his colleagues developed a program that compared two texts and found pairs of sentences in which one text contained verbal echoes of the other. Despite the flexibility of modern concordance systems, Raben's program appears not to have survived in modern form. This study examines the problem, Raben's solution and possible new approaches.
Using the accepted premise that Shelley was heaving influenced by Milton, Raben developed an algorithm to analyse canonically-converted sentences from work of both authors. Although the algorithm has not been retested in 30 years, it has now been reimplemented using modern tools, and its effectiveness examined. Work is also proceeding on comparing this approach to a passage-by-passage approach using the ``word cluster'' technique.
Cumulative sum (CuSum) charts are primarily used in industrial quality control, but have found application in authorship attribution studies, and one particular technique (QSUM, Morton & Michaelson) has been the centre of forensic controversy in the UK in some allegedly forged-confession cases.
The QSUM test uses the assumption that people have unique sets of communicative habits, and implements CuSum charts to present graphically the serial data in which the mean value of the plotted variable is subject to small but important changes. But as there is as yet no statistically valid way of comparing two CuSum charts, any decision regarding their significance will necessarily be either subjective or arbitrary.
Experiments with weighted CuSums indicate that they perform marginally better than the QSUM test, but are not consistently reliable: it may be that authors do not follow habit as rigidly as would be needed for CuSum techniques to determine authorship correctly.
Computational linguistics can facilitate the examination of how successful a translation is in replicating important stylistic features of the original text. An example is the Finnish translation of Samuel Becket's How It Is, which describes the writing process itself and the effort put into it.
A feature of this novel is its use of repetition, and research is cited on the effect of ``shifts'' in style occasioned by the non-coincidence of stylistic conventions such as repetition which differ between two languages. TACT was used to analyse consistency in the use of specific words and phrases which were used in the translation, and it was noted that shifts have indeed occurred.
In the classification of the sound properties of cognate words, there is a lexicon-based, equal-weighted technique for quantification which can be used without constructing phonological rules. In a study of Rukai, a Formosan language), a modification of Cheng's quantitative methods was used to store 867 words and variants. These were subjected to statistical refinement and then sucessfully subgrouped by a process using a data matrix of difference and sameness, a measure of the degree of similarity, and cluster analysis.
The Linguistic Atlas of the Middle and South Atlantic States (LAMSAS) [USA] has been used to analyse the well-known phonological feature of the loss of post-vocalic /r/. Uisng a specially-designed screen and laserprinter font in the upper half of the PC character set, database searches can be cariied out on complex phonetic strings.
Two methods were examined, assigning each pronounciation a score, and treating each pronounciation as binary (ie with or without retroflexion). The techniques used are discussed, and the system for encoding phonetic symbols and diacritical marks is described.
The spread of computers which can handle Japanese language processing has led to the starting of the National Institute of Japanese Literature (NJIL)'s recension full-text database The Outline of Japanese Classical Literature (100 vols, 600 works), comprising databases containing the Texts, Bibliographies, Utilities (revision notes) and Notes (headnotes, footnotes, sidenotes etc).
Another project is underway to construct a full-text database of current papers in the natural sciences, using DQL and SGML. A visual approach has been chosen to overcome the inherent complexity of DQL's SQL parentage.An elemental query is written in a box attached to the node or leaf, and complex queries can be constructed by gathering these using the mouse.
The possibility of applying TEI standards to Japanese classical literature is being studied.
The scale of the task of adequately documenting Early Modern English may explain why a Dictionary project was not funded several decades ago, when research showed that such a peoject could turn out larger than the OED. However, a text database of Renaissance bilingual and English-only dictionaries would be feasible as a way of making available information that would appear in an EMED.
Using such an electronic knowledge base, a virtual Renaissance English dictionary could be constructed, using SGNL tagging, and inverting the structure of bilingual texts: early results (ICAME, Nijmegen, 1992) indicate that there are some phrasal forms and new senses not found in the OED.
The analysis of the perennial and much-disputed problem of diglossia in contemporary Greek, and the problems of teaching Modern Greek both as a first and a foreign language, would be much facilitated if there were a database of the corpus of modern spoken Greek. Such a project has been proposed since 1986, and many of the technical problems identified then have since found at least partial solutions.
The have been many fragmented projects around the world on modern written Greek, and there is now a survey under way to determine the nature and size of the extant corpus.
Previous work has made it clear that the development of a common dictionary encoding format is extremely difficult. This paper describes the encoding scheme for mono- and bilingual dictionaries developed by a working group of the Text Encoding Initiative, intended for use by publishers and lexicographers, computational linguists, and philologists and print historians.
Three different views of dictionaries exist: the typographic view, the textual view and the lexical view. Dictionary structure is typically factored to avoid redundancy, but the way in which the data is factored varies widely from one dictionary to another. The two DTDs provided for in the TEI guidelines cover the `regular' structure, allowing nesting of senses within homographs or entries; and the `free' DTD, which allows any element anywhere.
The high level of information compression within entries makes faithful reproduction of the typographic format difficult, and principles are given for the retention of the textual view where this is desired, and the difficulties of the rearrangements undertaken in providing a lexical view are discussed.
The development of the WorldWideWeb (W3) hypertext system has led to a growth in interest in the use of SGML-encoded texts for the dissemination of information. The relatively primitive facilities offered by the HTML DTD currently in use will be supplanted by those of HTML+ and it is hoped that a mechanism will be developed to allow the replication of TEI-encoded texts in a similar manner. Providing a W3 service is relatively straightforward but entails the acceptance of the onus to keep the machine running for the benefit of remote users. Some difficulties were encountered with diacriticals. but these have largely been solved. More important is the ease of use for the reader, and the ease with which documents can be made available: these are important aspects given the rapid drive towards computer literacy for humanities scholars and students.
The Norwegian Wittgenstein Project (1980-1987) had at that time to construct its own markup system. When it was continued in 1989-90, alternatives were sought. SGML was found lacking in some critical areas (software, overlapping elements and the need to define the DTD in advance).
The Multi-Element Code System (MECS) was developed to allow a more flexible approach, while retaining sufficient similarities to allow accommodation to SGML's reference concrete syntax. It does not presuppose any hierarchical document structure and distinguishes between no-element tags, one-, poly- and N-element tags by the delimiters used. A Program Package contains software to create, validate, [re-]format, analyse and convert MECS-conforming documents.
Although MECS is better for descriptive coding purposes, there are some shortcomings, but work is ongoing to enable a closer coexistence between it and SGML.
The text of Hegel's Phenomenology of Mind is available in Baillie's translation in electronic form for Wordcruncher or Micro-OCP. The greater accuracy of the computer compared to existing indexes is demonstrated by reference to the larger number of occurrences found when searching for a word, and it is also possible for the words to be located accurately.
Searching for word combinations allowed definitions to be identified and compared with those of other philosophers, and enabled the location of those definitions in their context to be used as a basis for further philosophical interpretation.
Frequency analysis and concordance-making revealed that Hegel used a limited amount of concepts to talk about desire, the list being confirmed by the theory of Lacan 150 years later. The author hopes to establish lists of concepts that are used for the first time in a particular problem. Preliminary results are not clear but include promising hints.
The author investigates the distribution of verbal forms in two novels by Virginia Woolf and three from Mills & Boon. Theoretical linguistics was applied by students using OCP to investigate the constative and commentative aspects of verb usage, and the results were compared with earlier work by Biber . The study illustrates the way the macroscopic text analysis using text retrieval and simple statistics can be conducted.
A computer-aided analysis of the poem provides the textual basis for thematically organised lists of all single elements within word clusters and thus helps determine for what purpose Byron may have used them. The poem was scanned and tagged using a modified SGML structure, and analysed using TUSTEP (Tübinger System von Textverarbeitungs-Programmen). What began as an analysis of clusters finally led to a vast number of new observations on how language is handled in one particular text: while literary computing is ideally suited for the initial preparation of the text, final classification, analysis and interpretation still has to be done by human beings.
Optical character recognition is the most promising data acquisition technology for the humanities computer user, but the error rate (or rate of correction-requirement) raises the question of how OCR could benefit from lessons learned in probme reduction for automatic speech recognition, as the two technologies appear to share enough of the same problem characteristics. The author proposes a model based on a frequency approach, as used in biphone and triphone modelling in speech recognition, to see if particular combinations of characters are causes of error. A Scanned Language System would take advantage of the power of the grammar and semantics resident in Natural Language systems to enhance the recognition power of OCR systems.
The author defines an algorithm for `mutual information' (Church & Hanks, 1990) which expresses the measure of association between two events (lexical items) in terms of the probability of their occurrence at a given distance (separation). Taking the Lancaster-Oslo-Bergen corpus, experimental phrases were generated, and the algorithm employed to word classes which were related to the function of the priming lexical items in the original context. The algorithm is based on the notion that the elements of a phrase mutually constrain one another and on the way in which lexical items can be characterized by the neighbourhoods in which they `live'. The results are encouraging, and may be a means of validating the techniques of representation which can be applied to the statistical analysis of linguistic phenomena in corpora.
The integration of computer-assisted graphics, animation and CD-ROM/audio within an object-oriented environment gives today's music theory teachers the means to create platforms for the simultaneous investigation of local and large-scale musical processes. The author presents a demonstration of such techniques applied to Schubert's Erlkönig. When used in such ways, the power of multimedia allows students to see and hear sophisticated long-range structural relationships that could not be demonstrated using traditional teaching methods. Music educators who use and design this technology must do so with the full awareness that our visual representations no longer lie dormant on the page.
The choice of keyboard fingering for a piece of early music can have a profound bearing on its musical interpretation. The paper describes a computewr program which has been developed for determining an appropriate fingering for pieces of English keyboard music of the English Virginalist school. Using a databank of fingerings from 1560 to 1630, and an input passage for which fingering is sought, the program scans first for exact matches, then for partial matches from 12 notes down to two notes, and finally uses comparative matches varying pitch and duration. The program has so far only been used for the period mentioned, but it could be used without modification for any sufficiently large body of previously-fingered music.
The project aims are (a) to build a database of runic inscriptions at the Historic Musem; (b) to establish a graphical-based rune typology; (c) to develop computer-based methods for interpreting difficult-to-read signs; and (d) to develop computer-based methods for studying variations in runic form. Preliminary work began in May 1992, and a database structure and proposed set of tools has been established. The database will allow searches on linguistic criteria as well as containing images connected with the texts.
The author reports on research into the authenticity of Hadith literature, the most important legal source for muslims outside the Qur'an. Each text is preceded by a chain of `transmitters', giving the reported provenance of the text from the Prophet to the writer. 2,000 traditions were analysed, covering the prohibition of making images. The texts are encoded in detail and loaded into an Ingres database, from where it is possible to pose complex questions about the characteristics of the traditions.
This paper is an exploration of some of the possibilities inherent in building a knowledge-based simulation of a complex socio-political institution, based upon the emergent properties exhibited by a number of competing and cooperating agents. The institution is the House of Lords in the English Parliament of 1640-1642 (the `Long Parliament'). The author introduces and examines a hybrid form of simulation which employes techniques from distributed systems research, theories of agent modelling, and distributed artificial intelligence. The resulting model of the House of Lords is presented and discussed with reference to the problems of representing historical knowledge, and a tentative evaluation of the model is examined.
In order to shed light on those who performed textual computations before the advent of the computer, the author has brought together a vast collection of data which were spread in unconnected fields and hidden in hard-to-find documents, focussing on five branches:
Despite the increasing availability of online text corpora and text analysis tools, automated text analysis is not a solved problem: recent tests reveal varying degrees of accuracy. The author identifies four common errors:
The authors questios the usefulness of computer analysis in literary studies. While acknowledging that quantitative methods have contributed a great deal to the field, they raise questions about validity: small sample size, faulty selection of statistical tools, and the misinterpretation of results. They identify a tendency to associate perceived surplus with measures of frequency, and cite many examples where statistical readingsare prone to errors of the sorts mentioned.
The design of the BNC was to produce a 100 million word corpus of modern british English, and is now over half way through its planned development stage. Inclusion in the corpus is on four criteria: subject matter; date of first publication (mostly 1975 onwards); medium; and general level (high, middle or low). To maximise variety, the number of words in any one text does not normally exceed 40,000. Ninety million words will be written material, the remainder transcribed from spoken sources. Most text has had to be keyed in or scanned, and has been marked up using CDIF (Corpus Document Interchange Format), an application of SGML conforming to the standards of the TEI. Part-of-speech tagging is carried out at Lancaster University using a modified version of Claws. The production phase is scheduled to end in April 1994, when the corpus will become available for research.
If we regard ourselves as the eventual transmitters and receivers [of signals], the model needs to be refined, because our own, often unwitting, interventions in the process override the boundaries between trasnmitter, signal and noise, as well as those between signal, noise and receiver. In computational stylistics, it is possible to decipher a superabundance of signals in what is usually dismissed as noise.
The analytical procedure employed here is to establish frequency-tables for the 30, 50 or 100 most common words of a given set of texts, to extract a correlation matrix, and subject it to a Principal Components analysis. In the second phase, the tables are subjected to distribution tests to determine which words play most part in separating the texts. Ten years' work [at Newcastle] has enabled the formation of useful inferences at extremely high levels of probability, including consistent and intelligible differences between 18th-, 19th- and 20th-century authors; between authors of different nationalities writing in English; and between male and female authors in earlier eras.
The gender differentiation was based on analyses of 99 first-person authors (45 female and 55 male) and revealed statistically significant differences between these subsets which increase substantially when the authors are also divided chronologically. The inference is that male and female authors write more like each other now than they used to do, but when the most powerful of the wordtypes acting as gender differentiators are subjected to Principal Components analysis, the resulting graph still locates most authors appropriately as male or female. The exceptions testify in their various ways to the claim that narrative style is affected by factors such as education and upbringing, so the results may be best understood as reflecting educational rather than gender differences.
Comments by Nancy Ide, President, Association for Computers and the Humanities; Susan Hockey, President, Association for Literary and Linguistic Computing; Michael Neuman, Local Organizer, ACH-ALLC93; Pierre Lafon, Local Organizer, ALLC-ACH94.