SGML: Trip report, CETH Summer Seminar 1995


From @UTARLVM1.UTA.EDU:owner-tei-l@UICVM.CC.UIC.EDU Tue Jun 27 12:11:19 1995
Return-Path: <@UTARLVM1.UTA.EDU:owner-tei-l@UICVM.CC.UIC.EDU>
Received: from utarlvm1.uta.edu by utafll.uta.edu (4.1/25-eef)
	id AB21676; Tue, 27 Jun 95 12:11:13 CDT
Received: from UTARLVM1.UTA.EDU by UTARLVM1.UTA.EDU (IBM VM SMTP V2R2)
   with BSMTP id 3643; Tue, 27 Jun 95 10:13:41 CDT
Received: from UTARLVM1.UTA.EDU (NJE origin LISTSERV@UTARLVM1) by UTARLVM1.UTA.EDU (LMail V1.2a/1.8a) with BSMTP id 8388; Tue, 27 Jun 1995 10:13:39 -0500
Received: from UICVM.UIC.EDU by UICVM.UIC.EDU (LISTSERV release 1.8b) with NJE
          id 4079 for TEI-L@UICVM.UIC.EDU; Tue, 27 Jun 1995 10:12:47 -0500
Received: from UICVM (NJE origin U59467@UICVM) by UICVM.CC.UIC.EDU (LMail
          V1.2a/1.8a) with BSMTP id 8749; Tue, 27 Jun 1995 10:11:04 -0500
Received: from UICVM.UIC.EDU (NJE origin LISTSERV@UICVM) by UICVM.CC.UIC.EDU
          (LMail V1.2a/1.8a) with BSMTP id 4965; Mon, 26 Jun 1995 12:24:28 -0500
Received: from UICVM (NJE origin U35395@UICVM) by UICVM.CC.UIC.EDU (LMail
          V1.2a/1.8a) with BSMTP id 4832; Mon, 26 Jun 1995 12:23:29 -0500
Approved-By:  tei-l <U59467@UICVM.BITNET>
Message-Id:  <TEI-L%95062710124756@UICVM.UIC.EDU>
Date:         Tue, 27 Jun 1995 10:04:52 CDT
Reply-To: "C. M. Sperberg-McQueen" <U35395%UICVM.bitnet@UTARLVM1.UTA.EDU>
Sender: Text Encoding Initiative public discussion list
              <TEI-L@UICVM.UIC.EDU>
From: "C. M. Sperberg-McQueen" <U35395%UICVM.bitnet@UTARLVM1.UTA.EDU>
Organization: ACH/ACL/ALLC Text Encoding Initiative
Subject:      Trip report:  CETH Summer Seminar 1995
To: Multiple recipients of list TEI-L <TEI-L@UICVM.UIC.EDU>
Status: RO



                 Trip Report:  CETH Summer Seminar '95


                         C. M. Sperberg-McQueen

                              25 June 1995


   The Center for Electronic Texts in the Humanities at Princeton and
Rutgers Universities held its fourth summer seminar earlier this month
under the title ELECRONIC TEXTS IN THE HUMANITIES:  METHODS AND TOOLS.
The organizers paid me the compliment of inviting me to teach one ple-
nary session and several breakout sessions on SGML and the Text Encoding
Initiative, so I had the pleasure of attending the entire course.  The
two weeks of the seminar were too full to allow a comprehensive report
of their content to be made, and the intensity of the participants was
too great for a written summary to convey the full experience of the
seminar.  But it does seem worthwhile nevertheless to make at least a
brief report on the outstanding impressions of the seminar, while those
impressions are fresh.

   In previous years, the seminar had been organized as a unified series
of lectures and hands-on sessions in the Princeton computer labs.  This
year all participants attended the same series of plenary lectures and
one or two plenary hands-on sessions, but about a quarter of the
instructional time was reserved for special-interest sessions which ran
in parallel tracks.  By means of the shared plenary sessions, partici-
pants got a systematic overview of issues relating to electronic texts;
in the parallel tracks, it was possible to pursue certain issues in more
depth than has been possible at previous CETH summer seminars.  Susan
Hockey and Willard McCarty, the co-directors of the seminar, taught a
track on Textual Analysis.  Daniel Greenstein of Glasgow taught a track
on Tools for Historical Analysis.  Anita Lowry of Iowa led her partici-
pants through the technical, policy, personnel, and other issues of Set-
ting Up an Electronic Text Center, while Peter Robinson (Oxford), Geof-
frey Rockwell (McMaster) and I taught, respectively, tracks on Scholarly
Editing, Hypertext for the Humanities, and the Text Encoding Initiative
and SGML.  Participants included a range of researchers and members of
support staff, employed variously as librarians, faculty members, mem-
bers of professional research or technical staffs, and graduate stu-
dents.  There was a bias toward literary subjects, but ample representa-
tion from linguists and other textual disciplines.

   As usual, the participants were for the most part lodged in Princeton
dormitories; these are no more Spartan than those at many other campus-
es, I suppose, and two weeks of communal baths can be a nostalgic
experience for many of us, but the absence of air conditioning did seem
particularly brutal this year.  Some participants, more foresighted or
less nostalgic than the rest of us, took rooms at the Nassau Inn for the
duration.

   In the introductory plenary session, Susan Hockey introduced the
notion of electronic texts through a brief survey of text retrieval
tools (word lists, concordances, indices, etc.) and the history of
computer-assisted literary and linguistic studies (and their immediate
forebears, going back to the word-length studies of Mendenhall at the
end of the nineteenth century).  She also gave a survey of existing
archives and inventories.  Major problems confronting potential users of
electronic text are the fact that many archival sites -- perhaps most --
don't know precisely what they have got, or don't have adequate biblio-
graphic or technical descriptions of it.  Until recently, there has been
no standard method of documenting the texts; as a result, there is an
urgent need to compile a short list of the most urgently needed informa-
tion.  The potential user must also come to grips with the wide variety
of encoding schemes in which material of interest may have been encoded,
with the general obscurity of the copyright situation, and with the
highly variable quality of existing texts.

   In the afternoon, Willard McCarty outlined the issues involved in
choosing whether to keyboard a text or to scan it, and demonstrated some
fairly typical microcomputer software for optical character recognition.
The enthusiasm of some participants was noticeably dampened when he
walked them through the calculation which reckons up, for a scanner with
98 or 99% accuracy, how many errors will be found on a typical page.
(If 1% of the 2000 characters on a typical typed paged are in error,
there will be twenty errors on that page:  not enough to get a passing
grade in most first-year typing classes.)  Methods for raising the accu-
racy rate were discussed, the simplest of which seems to be contracting
out with a service bureau for a better rate (at, of course, a commensu-
rately higher cost).

   On the second day, Willard McCarty gave an introduction to basic
tools, including most prominently the concordance.  While a full history
of the concordance remains to be written (and deserves to be written),
it can nevertheless be traced readily to the middle ages, when (as one
participant pointed out) it was developed when the monasteries gave way
to the universities as the primary cultivators of literacy.  This intro-
ductory session was followed by a hands-on introduction to TACT, the
interactive concordance program developed at the University of Toronto
by John Bradley.  TACT is unfortunately rather fussy about running on
networked machines, unfortunately, so this session proved rather frus-
trating to some participants, I gather.

   In the afternoon of the second day, I gave an introduction to SGML
and the Text Encoding Initiative.  After describing the goals and syntax
of SGML, I took the participants through the steps of document analysis
and we performed a superficial but diverting analysis of a fragment of
the "Rime of the Ancient Mariner" (which won out by a narrow margin over
a fragment of Gibbon's DECLINE AND FALL OF THE ROMAN EMPIRE).  The
afternoon was concluded with an summary of background information on the
TEI and a brief overview of its contents.  After supper, a group of die-
hards reconvened in the basement computer lab at the Princeton Computer
Center to tag the "Rime of the Ancient Mariner" using the TEI's SGML
document type definition.  This was the first of a whole series of
improvised evening sessions added to the program ad hoc to address spe-
cific points of interest.

   On the third day, Bob Hollander of Princeton discussed the Dartmouth
Dante Project, in which he has managed to put into electronic form a
selection of running commentaries on Dante's Comedy ranging from the
fourteenth to the twentieth centuries, with software to enable them to
be searched, and for commentaries on the same passage to be compared.
Greg Murphy, who manages text systems for CETH, also gave an introduc-
tion to the ARTFL database at the University of Chicago.  The afternoon
was devoted to the parallel tracks.

   Thursday morning of the first week saw an introduction to issues of
scholarly editing from Peter Robinson (the author of a prominent colla-
tion program) with a panel discussion in which a number of practicing
editors participated:  David Chesnutt, the editor of the PAPERS OF HENRY
LAURENS; Hoyt Duggan, who is creating a massive parallel electronic edi-
tion of the manuscripts (and eventually of the archetypes) of the vari-
ous texts of PIERS PLOWMAN; and Richard Finneran, the editor of a hyper-
text edition of Yeats; were joined by Douglas Kincade of Princeton
University Press in a discussion of practical and theoretical issues in
electronic editions and their relation to paper editions.

   The first week was concluded by an exceptionally lively presentation
from Dan Greenstein on the topic of structured databases, and the trou-
bles of historians who use databases to summarize their data but wish to
keep the data tied closely to the textual witnesses from which the sum-
maries are derived.  He discussed the difficulties in yoking normalized
relational databases to text at such length, and with such vivid exam-
ples, that it was rather a relief when at length he began to discuss
methods of bringing text and database together in a useful way.  This is
territory first mapped out, I believe, by Manfred Thaller's programs
CLIO and (later) KLEIO, but Greenstein discussed at somewhat more length
the advantages of using the feature structure notation defined in the
TEI Guidelines.  This surprised no one, since Greenstein was one of the
principal actors in demonstrating the applicability of feature struc-
tures to areas outside linguistics.

   At the weekend, some participants went home, others to New York, and
still others spent much of a beautiful Saturday in the computer lab.  On
Sunday about half the participants in the seminar went into northern New
Jersey for a hike near the Delaware Water Gap.  Alumni of previous years
will remember this hike with some fondness, and will grieve, no doubt,
to learn that there were too many of us this year to fit into the diner
at the bottom of the hill, so that this year's participants had to
return home without initiation into the mysteries of scrapple.

   The second week began with an introduction to hypertext from Geoff
Rockwell, who was teaching the hypertext track.  He demonstrated free-
standing commercial hypertexts, the use of Hypercard to create one's own
hypertext, and of course the use of the World Wide Web and HTML for
hypertext delivery.  The high point, by common consensus, was the Mac-
beth segment, especially the demonstration of the karaoke Macbeth with
Rockwell as Macbeth.  Also Monday morning, Greg Crane of Tufts Universi-
ty, described the Perseus Project, a large collection of materials for
the study of classical civilization currently delivered with hypertext
functionality provided in Hypercard.  The Perseus materials, however,
Crane was careful to note, are encoded in non-proprietary standard
forms: the texts in SGML, the images in standard graphics formats at
significantly higher resolutions than are currently deliverable on the
desktop.  As a result, Perseus will be able to survive and be delivered
in other forms when Hypercard has gone the way of all software and dis-
appeared into obsolescence.  He capped his talk with a short presenta-
tion of the electronic version of the Greek lexicon of Liddell, Scott,
and Jones, which has recently been digitized thanks to a grant from the
National Endowment for the Humanities.  He had expected a great deal of
arduous work to be necessary, he said, before the material could be
retagged enough to be useful, but had recently discovered that even with
relatively rudimentary SGML tagging (identifying little more than
entries, definitions, and citations to classical authors) the lexicon
can be usefully consulted. He demonstrated how the lexicon could be
linked to the morphological analyzer developed for Perseus so that a
student can read a text on line, click on a form, and be sent to the
appropriate entry in Liddell and Scott.  (This does not work, of course,
for abolutely every form:  the analyzer is stumped by some forms.)  The
citations in the lexicon can also be analyzed, with large though not
absolutely complete success, so that the student can see when a particu-
lar word in Thucydides, for example, is discussed in the lexicon.
(Since the lexicon focuses, as does Perseus, on the most commonly stud-
ied texts, about one word in fifteen in Perseus texts is cited specifi-
cally in the lexicon.)  As the work of enhancing the markup of the dic-
tionary progresses, even more sophisticated tools and searches will be
possible.  But even the simple expedient of inserting a blank line
between senses can render a complex article more easily read in elec-
tronic form than in paper.

   Tuesday, Anita Lowry discussed issues of institutional support for
electronic texts, drawing on her own extensive experience at Columbia,
Iowa, and the experience of colleagues elsewhere.  A guest lecture from
Richard Gartner of the Bodleian Library gave insight into how that
institution is responding, in its own complex ways, corresponding to its
own labyrinthine organization, to the advent of electronic texts.  Paul
Evan Peters of the Coalition for Networked Information also appeared for
a guest lecture, in which he addressed policy issues at a national and
international level.  I never expect discussions of national or interna-
tional information policy to be coherent, let alone interesting, but
Paul Peters is the kind of speaker who could give policy analysis a good
reputation.

   Wednesday, the omnipresent Peter Robinson reappeared, this time to
discuss digital imaging techniques, which he has been dealing with in
connection with his project to create CD-ROM editions of all the manu-
scripts of the CANTERBURY TALES, tale by tale.  Kirk Alexander of
Princeton's Interactive Computer Graphics Lab gave a presentation on the
Piero Project, which uses CAD software to allow the user to study in
great detail the program of a fresco cycle by Piero della Francesca.

   On Thursday, Greg Murphy, Peter Robinson, and I gave a brief compara-
tive demonstration of three methods of delivering SGML-encoded electron-
ic text to potential readers. (By this time, the seminar had taken on
the character of a TEI tent meeting, and no one in the seminar was
admitting to any interest in any text NOT encoded in SGML, and prefera-
bly in TEI.)  First, we showed a sample text written in TEI form but
translated automatically into HTML, to illustrate the critical point
that delivery in HTML does not require that the document be maintained
in HTML; then we showed the same text as it might be delivered over the
network in SGML and displayed in Panorama, the SGML viewer distributed
by SoftQuad in both free and commercial forms.  Since the sample text
was not particularly complex or demanding typographically or hypertextu-
ally, Greg Murphy then showed some more complex materials encoded in
SGML, which can be found on the CETH home page at
http://www.princeton.edu/~gjmurphy/sgml/.  These include a selection of
Aesop's FABLES with two different Panorama navigators, and two articles
from the psyho-analytic literature on aphasia, one by Freud and one by a
later commentator.  Finally, Peter Robinson showed the original text,
compiled as a DynaText book.

   In the second half of the morning, George Miller of Princeton spoke
about Word Net, a large semantic network of modern English he has been
working on for some time, and demonstrated some of the many varieties of
software which can exploit the information it contains.

   On the Thursday afternoon and Friday morning of the second week, the
sessions were given over the presentations by participants in the semi-
nar, describing the long-term goals of the projects they were working
on, and the progress made on them during the course of the seminar.
These included several WWW pages constructed to address institutional or
disciplinary needs, a number of projects in linguistic analysis (of
endangered languages in Northwest China, of English, of Estonian, and of
Korean), and a variety of editions (of the glossa ordinaria, of a
twentieth-century writer influential in Futurist circles, of
revolutionary-era American state constitutions, of Mark Twain letters,
and of a poem by Puschkin, among others), as well as essays in literary
or stylistic analysis, materials for language instruction, and yet more.
As in previous years, the participants' presentations were a highlight
of the entire seminar, and strengthened me in my belief that with such
vigorous and interesting work going on, humanities computing is in very
healthy shape.

   Congratulations are due to the sponsors (CETH, together with the Cen-
tre for Computing in the Humanities at Toronto) and the organizers (an
untiring staff at CETH, with assistance from Princeton's Computing and
Information Technology group), who provide, in this annual seminar, a
signal service to all of us interested in the application of computers
to humanistic studies.

-C. M. Sperberg-McQueen