From ese@msstate.edu Tue Nov 1 18:38:43 1994 Return-Path: <ese@msstate.edu> Received: from Archive.MsState.Edu by utafll.uta.edu (4.1/25-eef) id AA03340; Tue, 1 Nov 94 18:38:37 CST Received: from (localhost [127.0.0.1]); by Archive.MsState.Edu using SMTP (8.6.8.1/7.0m-FWP-MsState); id QAA18111; Tue, 1 Nov 1994 16:36:53 -0600 Date: Tue, 1 Nov 1994 16:36:53 -0600 Message-Id: <01HIZ14OUHFM9OWYK6@zodiac.rutgers.edu> Errors-To: pls1@ra.msstate.edu Reply-To: ese@msstate.edu Originator: ese@msstate.edu Sender: ese@msstate.edu Precedence: bulk From: Susan Hockey <HOCKEY@zodiac.rutgers.edu> To: Multiple recipients of list <ese@msstate.edu> Subject: Developing access to electronic texts - summary of ADE presentation X-Listprocessor-Version: 6.0c -- ListProcessor by Anastasios Kotsikonas X-Comment: Electronic Scholarly Editions Status: R At Peter's request, I am posting a summary of the remarks I made in the session `Scholarly Editions in the Digital Age' at the Association for Documentary Editing meeting last week. Comments are very welcome now. I will also base my presentation at the MLA (Session 736 at 10.15 on Friday 30 December) on these remarks and hope that this session will generate a good discussion. The other speakers are Charles Faulhaber and Michael Sperberg-McQuee. Susan Hockey Center for Electronic Texts in the Humanities Rutgers and Princeton Universities ----------------------------------------------------------------------------- Developing Access to Electronic Text Summary of ADE Presentation for Documentary Editing The electronic edition will consist of digital images, transcripts of text and scholarly apparatus of various kinds. A usable scholarly resource must emerge from this amorphous mass of material in such a way that the potential of the electronic medium for new forms of publication and research is fully exploited whilst still maintaining the integrity and authority of the traditional printed edition. SGML is important for maintaining an archival form of the information and for separating the data from the software. The Text Encoding Initiative's implementation of SGML contains many features relevant for electronic editions. Areas which are not well catered for within the current version of the TEI Guidelines can be added in the future within the same overall framework. Important aspects of the TEI are, firstly, the header which provides documentation of the electronic text and a link to the library catalog and,secondly, the tags for analytic and interpretive information which permit multiple and possibly conflicting interpretations associated with the same section of text. The encoding is incremental with new scholars adding encoding to the same text. For the software, new users are often attracted by glitz, but it is the functionality which is important. It takes longer to grasp the scope of the functionality and new users often want to `make a CDROM'. CDROMs are perhaps the medium of choice at present, since they fit in better with current publishing and library procedures, but they are limited in size, their lifespan is not known and they present a closed system since the user cannot write to them. The information can only be used in pre-determined ways and the variety of CDROMs available now presents substantial support problems for libraries. The future means of access to the electronic edition will be via the network, but we need to look beyond the current generation of network navigation tools (ftp, gopher, WAIS etc) which are too limited for this kind of data. The Mosaic interface to the World Wide Web is very popular at present, but it is based on a simple SGML application which is oriented towards presentation rather than analysis, and it causes fragmentation of the data into many small documents. Mosaic is useful as a front-end to a more sophisticated retrieval and analysis system. At present there are no authentication procedures on the Internet and so users have no way of knowing what they have got and where it came from. Maintenance is also dependent on volunteers who are often more enthusiastic about new material than keeping the old accessible. The network-based system of the future needs to satisfy many, possibly conflicting, scholarly concerns, but must also be manageable for maintenance purposes. The access software must provide a wide range of options from general purpose easy-to-use facilities to detailed and specific requirements. The specification for this software must be based on research on what scholars want to do. Retrieval and concordance functions are reasonably well- understood. TACT and OCP, two of the most widely-used humanities computing programs which were developed within the humanities computing communities, are examples which include very flexible searching allowing for different alphabets, punctuation, searches by frequency etc. However retrieval programs in use today employ low-level string searching with Boolean logic, a technology which has not progressed since the 1960's. We need to separate homographs, and include lemmatization, and morphological and syntactic analysis. One approach to developing more sophisticated retrieval is to create electronic linguistic resources such as a lexical database from which a retrieval program will derive more information about the word or concept which the user is searching for. The database will contain lemma, morphological analysis, common collocates to indicate the semantic field and can be treated as a dynamic object which is being constantly updated. The networked electronic medium also facilitates ways of dealing with multiple versions, enabling a model to be built of the development of a text, but we need to construct a prototype to investigate how well this might happen in practice. The network also permits multiple annotations by different people on the same text. This would operate best in a controlled environment with annotation management software which would control who makes annotations and present a menu of annotations to users of the text. The network also facilitates access to parts of a document, enabling the user to model the scholarly process by accessing small units of information from different places. It is inefficient to store the information in small units and so an adequate linking or pointer mechanism must be set up. The access software must be able to handle images, enabling zoom, rotate, enhance, superimpose functions and the like. There is also a need to link images to transcripts ideally at the word level or below. This would enable a user, for example, to click on a word on a digital image and move to all other places where that word occurs, displaying the image of the source. Linking text to images is a very timeconsuming process and experiments to assist this by automation are already being conducted. Various approaches are now being tested for authenticating electronic documents, one of the most notable being timestamping which generates a unique number from the document. A server on the network then verifies the number. It is also possible to timestamp only certain SGML elements leaving user free to modify the rest of the text. Institutional support is needed for such a system to operate well. It is needed to guarantee the availability of the material, to ensure that it is supported, to manage updates in a controlled fashion and to monitor usage with the purpose of optimizing access and effort. All this costs money and an annual subscription, rather than pay-per-view, seems best for the academic community. Developing the system as outlined here is a collaborative process. It will involve scholars, librarians and computer scientists. It will entail much research to examine the needs of different groups and to identify those requirements which are common to several groups and can thus be satisfied by the same computer code. Flexibility and a path to enhancement are essential. The system must provide texts which are recognized scholarly resources and sufficient functionality in the software to exploit those texts in many different ways. The development of such a system is also an iterative approach. It is usually easier for people to comment on the functionality of software when they have used it for some time and can become fully aware of its features and limitations. Development in stages over a period of time will help to satisfy as many needs as possible and to bring together different viewpoints. In our view this is an essential component if the system is to last well into the next century. Acknowledgements These remarks draw on ideas from many sources, but I would particularly like to acknowledge the work of Peter Robinson, John Lavagnino, the Electronic Peirce Consortium and the draft MLA proposals for electronic editions circulated by Peter Shillingsburg.