From ese@msstate.edu Tue Nov  1 18:38:43 1994
Return-Path: <ese@msstate.edu>
Received: from Archive.MsState.Edu by utafll.uta.edu (4.1/25-eef)
	id AA03340; Tue, 1 Nov 94 18:38:37 CST
Received: from  (localhost [127.0.0.1]);
           by Archive.MsState.Edu using SMTP (8.6.8.1/7.0m-FWP-MsState);
           id QAA18111; Tue, 1 Nov 1994 16:36:53 -0600
Date: Tue, 1 Nov 1994 16:36:53 -0600
Message-Id: <01HIZ14OUHFM9OWYK6@zodiac.rutgers.edu>
Errors-To: pls1@ra.msstate.edu
Reply-To: ese@msstate.edu
Originator: ese@msstate.edu
Sender: ese@msstate.edu
Precedence: bulk
From: Susan Hockey <HOCKEY@zodiac.rutgers.edu>
To: Multiple recipients of list <ese@msstate.edu>
Subject: Developing access to electronic texts - summary of ADE presentation
X-Listprocessor-Version: 6.0c -- ListProcessor by Anastasios Kotsikonas
X-Comment: Electronic Scholarly Editions
Status: R

At Peter's request, I am posting a summary of the remarks I made
in the session `Scholarly Editions in the Digital Age' at the
Association for Documentary Editing meeting last week. 

Comments are very welcome now. I will also base my presentation at
the MLA (Session 736 at 10.15 on Friday 30 December) on these remarks
and hope that this session will generate a good discussion. The other 
speakers are Charles Faulhaber and Michael Sperberg-McQuee.

Susan Hockey
Center for Electronic Texts in the Humanities
Rutgers and Princeton Universities
-----------------------------------------------------------------------------
Developing Access to Electronic Text

Summary of ADE Presentation for Documentary Editing

The electronic edition will consist of digital images, transcripts of text and
scholarly apparatus of various kinds. A usable scholarly resource must emerge
from this amorphous mass of material in such a way that the potential of the
electronic medium for new forms of publication and research is fully exploited
whilst still maintaining the integrity and authority of the traditional
printed edition. 

SGML is important for maintaining an archival form of the information and for
separating the data from the software. The Text Encoding Initiative's
implementation of SGML contains many features relevant for electronic
editions. Areas which are not well catered for within the current version 
of the TEI Guidelines can be added in the future within the same overall 
framework. Important aspects of the TEI are, firstly, the header which 
provides documentation of the electronic text and a link to the library 
catalog and,secondly, the tags for analytic and interpretive information 
which permit multiple and possibly conflicting interpretations associated 
with the same section of text. The encoding is incremental with new scholars 
adding encoding to the same text.

For the software, new users are often attracted by glitz, but it is the
functionality which is important. It takes longer to grasp the scope of the
functionality and new users often want to `make a CDROM'. CDROMs are perhaps
the medium of choice at present, since they fit in better with current
publishing and library procedures, but they are limited in size, their
lifespan is not known and they present a closed system since the user cannot
write to them. The information can only be used in pre-determined ways and the
variety of CDROMs available now presents substantial support problems for
libraries.

The future means of access to the electronic edition will be via the network,
but we need to look beyond the current generation of network navigation tools
(ftp, gopher, WAIS etc) which are too limited for this kind of data. The
Mosaic interface to the World Wide Web is very popular at present, but 
it is based on a simple SGML application which is oriented towards 
presentation rather than analysis, and it causes fragmentation of the data 
into many small documents. Mosaic is useful as a front-end to a more 
sophisticated retrieval and analysis system. At present there are no 
authentication procedures on the Internet and so users have no way of 
knowing what they have got and where it came from. Maintenance is also
dependent on volunteers who are often more enthusiastic about new material
than keeping the old accessible.

The network-based system of the future needs to satisfy many, possibly
conflicting, scholarly concerns, but must also be manageable for maintenance
purposes. The access software must provide a wide range of options from
general purpose easy-to-use facilities to detailed and specific requirements.
The specification for this software must be based on research on what scholars
want to do. Retrieval and concordance functions are reasonably well-
understood. TACT and OCP, two of the most widely-used humanities computing
programs which were developed within the humanities computing communities, are
examples which include very flexible searching allowing for different
alphabets, punctuation, searches by frequency etc. However retrieval programs
in use today employ low-level string searching with Boolean logic, a technology
which has not progressed since the 1960's. We need to separate homographs, and
include lemmatization, and morphological and syntactic analysis. One approach
to developing more sophisticated retrieval is to create electronic linguistic
resources such as a lexical database from which a retrieval program will
derive more information about the word or concept which the user is searching
for. The database will contain lemma, morphological analysis, common
collocates to indicate the semantic field and can be treated as a dynamic
object which is being constantly updated.

The networked electronic medium also facilitates ways of dealing with multiple
versions, enabling a model to be built of the development of a text, but we
need to construct a prototype to investigate how well this might happen in
practice. The network also permits multiple annotations by different people on
the same text. This would operate best in a controlled environment with
annotation management software which would control who makes annotations and
present a menu of annotations to users of the text. The network also
facilitates access to parts of a document, enabling the user to model the
scholarly process by accessing small units of information from different
places. It is inefficient to store the information in small units and so an
adequate linking or pointer mechanism must be set up.

The access software must be able to handle images, enabling zoom, rotate,
enhance, superimpose functions and the like. There is also a need to link
images to transcripts ideally at the word level or below. This would enable a
user, for example, to click on a word on a digital image and move to all other
places where that word occurs, displaying the image of the source. Linking
text to images is a very timeconsuming process and experiments to assist this
by automation are already being conducted. Various approaches are now being
tested for authenticating electronic documents, one of the most notable being
timestamping which generates a unique number from the document. A server on
the network then verifies the number. It is also possible to timestamp only
certain SGML elements leaving user free to modify the rest of the text.

Institutional support is needed for such a system to operate well. It is
needed to guarantee the availability of the material, to ensure that it is
supported, to manage updates in a controlled fashion and to monitor usage with
the purpose of optimizing access and effort. All this costs money and an
annual subscription, rather than pay-per-view, seems best for the academic
community.

Developing the system as outlined here is a collaborative process. It will
involve scholars, librarians and computer scientists. It will entail much
research to examine the needs of different groups and to identify those
requirements which are common to several groups and can thus be satisfied by
the same computer code. Flexibility and a path to enhancement are essential.
The system must provide texts which are recognized scholarly resources and
sufficient functionality in the software to exploit those texts in many
different ways. The development of such a system is also an iterative
approach. It is usually easier for people to comment on the functionality of
software when they have used it for some time and can become fully aware of
its features and limitations. Development in stages over a period of time will
help to satisfy as many needs as possible and to bring together different
viewpoints. In our view this is an essential component if the system is to
last well into the next century.

Acknowledgements

These remarks draw on ideas from many sources, but I would particularly like
to acknowledge the work of Peter Robinson, John Lavagnino, the Electronic
Peirce Consortium and the draft MLA proposals for electronic editions
circulated by Peter Shillingsburg.