SGML: Trip Report, CNI (Sperberg-McQueen)

SGML: Trip Report, CNI (Sperberg-McQueen)


From @UTARLVM1.UTA.EDU:owner-tei-l@UICVM.BITNET Fri Apr 21 19:11:46 1995
Return-Path: <@UTARLVM1.UTA.EDU:owner-tei-l@UICVM.BITNET>
Received: from UTARLVM1.UTA.EDU by utafll.uta.edu (4.1/25-eef)
	id AA09447; Fri, 21 Apr 95 19:11:40 CDT
Message-Id: <9504220011.AA09447@utafll.uta.edu>
Received: from UTARLVM1.UTA.EDU by UTARLVM1.UTA.EDU (IBM VM SMTP V2R2)
   with BSMTP id 0239; Fri, 21 Apr 95 17:14:00 CDT
Received: from KSUVM.KSU.EDU (NJE origin MAILER@KSUVM) by UTARLVM1.UTA.EDU (LMail V1.2a/1.8a) with BSMTP id 9373; Fri, 21 Apr 1995 17:13:59 -0500
Received: from KSUVM.KSU.EDU (NJE origin LISTSERV@KSUVM) by KSUVM.KSU.EDU
 (LMail V1.2a/1.8a) with BSMTP id 9080; Fri, 21 Apr 1995 17:12:03 -0500
Date:         Fri, 21 Apr 1995 17:11:00 CDT
Reply-To: "C. M. Sperberg-McQueen" <U35395%UICVM.bitnet@UTARLVM1.UTA.EDU>
Sender: Text Encoding Initiative public discussion list
              <TEI-L%UICVM.bitnet@UTARLVM1.UTA.EDU>
From: "C. M. Sperberg-McQueen" <U35395%UICVM.bitnet@UTARLVM1.UTA.EDU>
Organization: ACH/ACL/ALLC Text Encoding Initiative
Subject:      trip report (CNI)
To: Multiple recipients of list TEI-L <TEI-L%UICVM.bitnet@UTARLVM1.UTA.EDU>
Status: R



                              Trip Report

                  Coalition for Networked Information
         Task Force Meeting, Washington, D.C. 10-11 April 1995
          CNI/AAUP Joint Initiative Workshop, 11-12 April 1995


                         C. M. Sperberg-McQueen
                             21 April 1995


   The spring meeting of the Coalition for Networked Information took
place earlier this month in Washington D.C.; it was immediately followed
by a workshop for participants in a joint project of CNI and the Ameri-
can Association of University Presses (AAUP).  I attended both; herewith
a short description, by no means complete, of some things which caught
my eye or ear.

   CNI is a joint project of the Association of Research Libraries, Edu-
com, and CAUSE; I hope the reader knows what these organizations are and
do, because if you don't, you won't find out here.  The major activities
of CNI, however, are staffed not by its three sponsors but by institu-
tions which pay annual membership dues to be included in the CNI Task
Force.  The specific task to be addressed by this Task Force is, to be
honest, not completely clear to me; one of its main functions (though I
assume not its only function) appears to be holding semiannual meetings
at which representatives from publishing houses, vendors of network ser-
vices, software, and hardware, universities, and other organizations
interested in fostering the networked information revolution, or at
least in surviving that revolution with their heads firmly attached to
their shoulders, can talk with each other and learn what is up in the
various arenas within which the revolution is being played out.  I
attended the meeting because the Model Editions Partnership organized by
David Chesnutt (more details below), in which I play a small role, has
recently been accepted as a participant in the CNI/AAUP joint initia-
tive; one result of that acceptance was an invitation to attend the
spring task force meeting.

   The theme of the spring meeting was Digital Libraries, and great
efforts had been made to enable participants to learn more about the
Digital Libraries Initiative (hereafter mostly DLI, for short) now
underway under the joint sponsorship of NSF, ARPA, and NASA.  After the
opening welcome by Paul Evan Peters (CNI's executive director), there
was a panel discussion by representatives of the three funding agencies,
and during the various breakout sessions there were project briefings by
the six consortia (based at Stanford, UC Berkeley, Carnegie Mellon,
Illinois, UC Santa Barbara, and Michigan) on the nature of their goals
and on their results so far.  The meeting agenda lists the purposes of
the meeting as (I quote):

*   to promote understanding of digital library research and development
    [...]
*   to promote understanding of key concepts and exemplary initiatives
    [in] networked information resource and discovery tools and services
*   to provide an environment in which people [...] can share experi-
    ences, visions, and plans
*   to provide an opportunity for people [...] to discuss network [...]
    policy issues and initiatives
*   to provide an opportunity for [people] to identify needs, to formu-
    late priorities, and to evaluate results

   The opening panel, I have to say, gave the meeting an inauspicious
beginning, because of the three speakers only one was a lively or ener-
getic speaker, and the diffident, mumbling delivery of the other two did
not do justice to the material they were presenting.  When one is sit-
ting in the second row, directly in front of the overhead projector,
should one normally expect to have trouble hearing or understanding the
speakers?  Pity the people in the far corners of the room!

   Steve Griffith of the National Science Foundation (NSF) described the
organizational structure of NSF, ARPA, and NASA, as a way of clarifying
the organizational context within which the Digital Libraries Initiative
arose.  He gave a brief overview of its chronology (announcement and
program solicitation in fall 1993, deadline for submissions in February
1994, four-stage proposal evaluation in spring and summer 1994,
announcement of awards in September 1994, projects to run about four
years, beginning in fall 1994), and noted some of the highlights of the
program (multi-agency sponsorship, multi-directorate support within NSF,
awards as cooperative agreements rather than as grants or contracts,
emphasis on multidisciplinary, multi-sector projects, and focus on cre-
ating experimental testbeds and prototypes.  He reviewed the list of
funded projects, and noted that the project has achieved much broader
public discussion than was originally expected, with recent cover sto-
ries in the COMMUNICATIONS OF THE ACM and other similar publications.

   He said the goals of the project were:

*   to advance fundamental research
*   to develop and distribute new digital-library technology
*   to build new applications and services
*   to establish commercial presence and influence in the digital
    library

If he described exactly what sort of commercial influence the sponsors
want to establish, my notes don't say, which is perhaps a shame.  He did
describe in more detail the research areas being addressed by the
project:

*   methods of capturing data in all forms
*   categorization of information resources and services
*   development of software and algorithms
*   development of tools, protocols, and procedures for digital library
    services
*   impact of digital libraries

Related areas within NSF include network management, cross-disciplinary
activities, and the multidisciplinary research program.

   Nand Lal from NASA's Goddard Space Flight Center then described
NASA's perspective on the digital libraries program.  Like any large
technical organization, NASA has lots of data to manage, and its data
holdings are growing at the rate of about 250 Gigabytes per day (raw;
the processed data runs about 1.2 Terabytes per day).  A national infor-
mation infrastructure will require significant technical advances;
NASA's work is driven by a vision of the digital library which "includes
the functionality of a traditional library, but is more than simply a
digitized version of the same.  It is a collection of INFORMATION
RESOURCES and SERVICES (accessible via the national information infra-
structure) that allows a subscriber EASY and TIMELY ACCESS to USEFUL
INFORMATION and knowledge at a REASONABLE COST."  (My notes have quota-
tion marks around this, so I believe I transcribed this right, but I
won't swear to it.)  I believe it was Mr. Lal who had the grace to conf-
ess to feeling a certain uneasiness talking about digital libraries to
an audience with so many professional librarians in attendance; perhaps
this sign of humility is why no one rose, later on, to ask how the digi-
tal library envisaged by NASA differs from (is MORE than) a digital ver-
sion of traditional libraries; if there are any goals in Lal's descrip-
tion of the digital library which existing libraries don't share, I
don't know what they are, and existing libraries have a much better
track record on providing information resources and services at reason-
able cost than digital libraries have now or are likely to have for some
time to come.

   The third speaker in the panel, Glenn Rikert, represented ARPA and
gave an overview of some of the technical, political, techno-political,
politico-economic, sociological, historical-tragical-poetical, and other
issues facing research funding agencies at this stage in the development
of the information infrastructure.  I would be happier, I think, if few-
er of them seemed to boil down somehow to making the Internet safe for
greed-heads -- er, sorry, commercial activities.  (Yes, yes, I know, the
net cannot forever remain exclusively a non-commercial playpen for aca-
demics and other researchers.  But isn't there something bizarre in the
notion that it's wrong for the government to subsidize an expensive
development effort for the free exchange of information, but fine if
that same expensive development effort subsidizes the efforts of people
to make money?  There are lots of reasons to support the development of
better security facilities, authentication, etc., for the net; why is
making money the only reason anyone in this panel mentioned?)

   At the conclusion of his talk, Rikert asked for questions from the
floor, and since I was sitting right in front of him, I was recognized
first.  I asked a question which had been on my mind since early in the
panel:  "How does it come about that our national digital-libraries ini-
tiative is organized by a consortium of agencies which includes neither
the National Endowment for the Humanities nor the Library of Congress?"
When phrased as bluntly as I put it, this question may have seemed rath-
er rude, which I regretted a bit, since I was only half sure that rude-
ness was called for.  The answer, however, did little to dispel my
uneasy feeling that the national digital library research program needs
a clearer understanding of the way existing libraries actually work, and
what they actually do.  As I understood it, the answer went something
like this (if this is an unfairly truncated paraphrase, I apologize):
DLI is a research program, and so NEH and LC did not seem relevant agen-
cies, as their interests were not directly involved.  Also, DLI is con-
cerned mostly with new digital material, not digital conversions of old
material.  Nevertheless, it is recognized that NEH and LC do have some
useful insights into relevant issues, and so the DLI staff do in fact
maintain regular contact with NEH, LC, and the National Historical Pub-
lications and Records Commission (NHPRC).  I did not quite know what to
say to this; it fell to Peter Graham of Rutgers, sitting in front of me,
to point out that both NEH and LC support research, neither restricts
its interests to conversion of existing material, and both can bring to
the table large bodies of relevant experience with complex information
resources and services of the sort digital libraries are going to have
to offer if they wish to be worthy of the name.  It is naive, at best,
for the DLI work to ignore the potential contributions of humanistic
research and major libraries.

   After a break, William Y. Arms of the Corporation for National
Research Initiatives (CNRI) gave a useful overview of "Concepts in Digi-
tal Library Research and Development."  He began with an overview of the
CS-TR (computer science technical reports) project which CNRI has coor-
dinated; this began as a very low-key experiment in small-scale informa-
tion sharing among a few high-powered computer science departments, but
has done, he said, very useful and significant work demonstrating meth-
ods of building network-based information resources.  Three of the six
DLI sites are also participants in CS-TR, which Arms interpreted as a
sign of CS-TR's useful role in helping people think usefully about digi-
tal information.  (It would surely be uncharitable to suggest it might
also be, say, a sign of a strong old-boy network.)

   Arms organized his talk as a series of fundamental points to be borne
in mind when contemplating the networked electronic future.  As I got
them down in my notes, these included:

*   the technical framework exists within a legal framework, including
    concepts of copyright, performance, intellectual property, libel,
    and obscenity, as well as the distinction between publishers and
    common carriers;  both the technical and the legal framework, more-
    over, cross national boundaries and so have local and international
    issues must be kept track of.  The technical architecture must
    respect the intellectual property rights of creators and owners, and
    must help clarify the boundaries between different areas of respon-
    sibility.
*   some characteristics are common to all objects in the digital
    library (e.g. name, identifier, security constraints); others vary
    with the type of content (text, music, computer program, ...).
*   names and identifiers will need to be location-independent (so the
    object can move without being renamed), globally unique, and persis-
    tent.  We need methods of resolving them fast (i.e. into locations
    or addresses), decentralizing their administration, managing change-
    or version-control, and supporting standard naming schemes from all
    standard user interfaces.
*   a digital object is more than a sack of bits, and our technology
    needs to be aware of at least part of its internal structure.  That
    internal structure includes at least the following internal organs
    (parenthetical notes are mine, and may not be what Arms said):
    -   handle (I assume this is the name or ID, but am not sure)
    -   properties
    -   transaction log (where did this object come from and what has
        been done to it since its creation?)
    -   content (this, I assume, does remain basically a bag o' bits)
    -   signature (to allow us to authenticate the object we receive and
        detect tampering)

*   repositories must look after the data they hold.
*   the stored object is not necessarily the same as the object used by
    the user.  In general, objects are not replicated for use; instead
    of a replica of the stored object, the user gets the output from
    some program.  The user fetching a document to read it will see the
    output of a rendering engine; a musical score will be performed; a
    video game will be played; a database will be searched.  What the
    user normally wants is the output of some program, not a copy of the
    stored object.  And if the user does want a copy, for whatever rea-
    son, that too can be viewed as the output of a program:  the copy or
    file-transfer program. (The key point is that ANY program may be the
    one which should be inserted between the stored object and the user;
    our systems cannot blindly assume that the software in that slot
    will always be COPY or FTP.)
*   users want intellectual works, not digital objects.  A user may want
    THE FOLLETT REPORT, but the Follett report is available in a wide
    variety of forms (Postscript, marked-up, ASCII-only, ...) and prob-
    ably also in different versions (with or without all the annexes,
    etc.) -- these will be distinct digital objects, but they remain the
    same WORK, and when the user wants the Follett Report, a good system
    is going to have to know how to map from that intellectual work to
    the various digital objects actually on the disks of the servers.
*   variations in terminology often hamper understanding of digital
    library concepts; some words have such strong overtones, and the
    overtones vary so much from community to community, that those words
    may actively inhibit mutual understanding more than recording it.
    Arms named DOCUMENT, PUBLISH, and WORK as particularly devious cul-
    prits here.  (He is right, too:  the TEI text-documentation commit-
    tee almost came to blows over the word PUBLISH, one long long long
    spring day.  Fortunately, there were a few calm heads in the group,
    and no one actually went home with a black eye.  Luckily, I can't
    remember who won and who lost on that issue, though it seemed so
    important at the time; I remember this when I need help keeping
    things in perspective.)

   Arms's talk seemed to me to demonstrate very clearly, if it needed
demonstration, why the Digital Libraries Initiative ought to be paying
closer heed to the experience of NEH and LC.  I hope the representatives
of the funding agencies were there, and listening, and I hope they rec-
ognized, among Arms's lists of outstanding problems, several chestnuts
of library cataloguing and public service doctrine, which are not nearly
so mysterious to librarians and humanists as they appear to be to com-
puter scientists discovering them for the first time.

   At this point the meeting broke up into a large number of parallel
sessions, of which I attended the project briefing on the DLI project at
Stanford.  This left me with very mixed feelings.  I went to school on
the Farm(1) and I want to like the DLI work being done there. And I did
rather like the report on the CS-TR project with which the session
began, although I never did find out what possessed them to decide that
what the world needed more than anything else was yet one more format
for the exchange of bibliographic data, unique to the CS-TR project:  it
isn't as though existing bibliographic formats cannot handle technical
reports.  Is not-invented-here syndrome really so epidemic among comput-
er scientists?

   The report on the DLI work proper, delivered by an amiable fellow
with a nice manner, put a lot of strain on my residual good will for my
alma mater.  In brief, he described a long complicated search for ways
to make it possible for users to interact with different net-based
resources through a single unified front end.  Good problem; lots of
people are interested in it.  In trying to address this problem, as near
as I can make out, the Stanford team first worked hard for a long time,
at the conclusion of which they had come very close to reinventing the
Common Command Language (an American national standard -- ANSI Z39.58 --
publicly available since the late 1980s), and then proceeded to come
very close to reinventing ANSI Z39.50 (the protocol for network-based
information retrieval).  Unfortunately, the reinventions do not appear
to have been improvements on the existing standards, and if the team had
some reasons for addressing these issues from scratch instead of imple-
menting -- or at least studying -- the existing standards, those reasons
never became clear.  (It did become moderately clear that the presenter,
at least, had not studied either standard.)  I was left with the uncom-
fortable feeling that the project could have saved several months of
work by having someone wander over to the -- traditional -- library and
look up some relevant literature, and read it.  I hope that the rest of
the Stanford project is better than this; I hope that the work actually
described was in reality better than the presentation suggested.  But
going by the session by itself, anyone would have to wonder why NSF,
ARPA, and NASA are spending a million dollars a year in research funds
for this work, instead of sending NISO a check for ninety-two dollars
(plus shipping) to get copies of Z39.58 and Z39.50.  The standards are
more thorough, they were developed through extensive public discussion
and reflect a wide consensus in interested communities, and they can be
observed in practice today.  Is it possible that the Stanford team did
not do any basic bibliographic research on their topic?  Is this the
kind of group we want inventing the digital library we are all going to
have to use?

   The second parallel session shed a more encouraging light on the DLI
projects.  I went to hear the presentation on the work at Urbana (with
which I am shamefully unfamiliar, given that Chicago and Urbana are sis-
ter campuses -- but like many sisters, they don't always communicate
that well).  William Mischo gave an energetic account of the work Urbana
is doing with SGML-encoded texts supplied by publishers for the project.
His discussion of SGML had enough rough edges to alarm a few purists
(from time to time I heard the fellow sitting beside me swear under his
breath at technical inaccuracies, infelicities, or vaguenesses), but on
the whole he seemed to have his head in the right place -- i.e. he is
after all using SGML instead of inventing yet another markup language.
I was interested to hear that the publishers are all using ISO 12083 --
and not terribly surprised to learn that they are all extending it in
different ways to handle tricky bits of their texts.  Mischo clearly
felt this reflected flaws in 12083; after the TEI's experiences trying
to create generally usable tag sets for widely disparate users, I'm not
so sure.  But the most interesting detail of the presentation, for me,
was the techniques being used in Urbana to index SGML documents using
relational databases.  In all essentials, the table structures used
resemble those proposed fifteen years ago or so by Henry Van Dyke Paru-
nak at a conference held in Ann Arbor.  (The meeting was before my time;
I read the paper in the proceedings of the conference, which were edited
by Richard Bailey).  I don't know whether Parunak was the first to pro-
pose such structures -- he is almost surely not the only one, since
standard methods of data normalization would lead to substantially the
same structure, no matter who did the analysis -- but I think of them as
Parunak's structures, since I worked through his paper in some detail
when I started worrying about these problems, and he made more sense
than anything else I could find.  It was good to see that they do work
pretty well in practice, even on fairly large bodies of material.
Whether an SQL server can keep up with search engines like Pat (the
engine developed at Waterloo for the Oxford English Dictionary) on large
bodies of data, however, remains an open question, and one on which my
expectations differ from Bill Mischo's -- maybe I can get him to accept
a small wager when I see him next ...

   At about this point, the task force meeting began to follow the way
of all good meetings, and the hallway meetings and informal conversa-
tions take up more space in my memory than the formal sessions.  (CNI
has exceptionally long breaks, presumably to encourage this phenomenon.)
It was in this way that I heard some encouraging things about an elec-
tronic version of the MIDDLE ENGLISH DICTIONARY at Michigan, and a
project of the Big Ten university presses to experiment cooperatively
with electronic books (or was that later?).


   On the second day, I did hear Nancy Ide give a sound, informative
overview of what humanities computing is; she began with an introduction
to the Association for Computers and the Humanities, and continued with
a fast review of the history of humanities computing.  In the 1960s (and
before), the field concentrated largely on concordances, word indices,
and the like; in the 1970s, these activities continued -- after all, the
Oxford Concordance Program did not appear until the later 1970s -- but
there was also new emphasis on quantitative studies (especially of sty-
listics), linguistic analysis, content analysis (though this has roots
going back to the 1950s), semantic analysis, development of dictionar-
ies, scholarly editions, and the creation of corpora and collections.
About this time, Ide said, computational linguistics split off from lin-
guistic computing to become an identifiably distinct field, which tended
to involve itself more with complete parsers for toy languages or small
subsets of natural languages than with real texts.  In the 1980s, all
these trends continued, but methods improved.  The spread of microcom-
puters led to interest in the use of computers for teaching humanities
disciplines, and in particular for the teaching of composition.  Work in
computational linguistics led about this time to a strong interest in
lexical databases.  In the 1990s, Ide descries an emphasis on corpora,
on encoding issues and standards, on text databases, and on computation-
al lexicography.  There is also interest in ensuring that materials of
interest for humanists find their way into the digital library, includ-
ing hypertext and digital images.  Ensuring that this happens will
require high quality support for versioning of data, fine-grained detail
in the encoding, support for metadata and similar kinds of ancillary
information, and support for multiple views of the same data.

   Sneaking out early from Ide's session (it is the done thing at CNI,
apparently, to hop from session to session), I heard a cataloguer from
the Library of Congress describe a small tool he's written in Rexx to
run under OS/2, which materially simplifies the task of creating MARC
records from bibliographic descriptions in electronic form (e.g. from
OPAC screen dumps, or from the electronic text itself, or from an SGML-
encoded header like that defined by the TEI).  Made me want to get a
copy of OS/2 -- if only my hard disk were big enough!

   Lunch was the final plenary session of the conference.  The lunch-
time speaker, a science fiction writer by the name of Daniel Keys Moran,
gave a rousing talk on the digital future and the digital present, with
lots of thought-provoking questions and lots of jokes (what is the dif-
ference between a con man and a computer salesman?  a con man knows when
he is lying -- why is it I can remember the jokes but can't describe the
thought-provoking parts coherently?).

   In the final parallel session of the meeting, Susan Hockey had organ-
ized a session on the humanities scholar and the digital library, which
was fairly well attended, considering the number of people leaving early
to catch planes home.  Hockey began with a description of what humanists
actually do with computers, and the challenges facing those who wish to
make computers more serviceable for humanistic research.  She spoke
about the fundamental importance of disagreement and dissent in humanis-
tic research and stressed the complexity of the cultural objects human-
ists deal with.  Texts studied by humanists may be characterized by mul-
tiple versions of the text, variant spellings, and multiple conflicting
interpretations of each version.  (Actually, ALL texts tend to have mul-
tiple versions and to be susceptible to multiple interpretations; for
some texts we simply ignore the complexity by pretending it doesn't
exist.  For culturally important texts, we prefer to face, not to
ignore, the complexity -- this is one way to gauge how important a text
is regarded as being.)  Serious work with texts will require complex
annotation (electronic Post-It notes are not going to suffice for seri-
ous work with real annotations on real texts!), provision for variant
spellings and conflicting readings, marginalia, and canonical reference
schemes (e.g. book chapter and verse in Biblical texts; Stephanus num-
bers for Plato, and so on).  Existing texts in ASCII-only form, or using
ad hoc markup schemes, have shown the limits of such mechanisms and
illustrate the necessity for better methods of text representation, to
ensure reusability, to provide slots for bibliographic control informa-
tion and change logs, and to enable images of printed or manuscript
sources to by synchronized with their transcriptions.

   She next described some developments in the library community, with
particular emphasis, naturally, on the Center for Electronic Texts in
the Humanities (CETH), of which she is director, and its activities.
The continuation of the Rutgers Inventory of Machine-Readable Texts in
the Humanities -- not another let's-catalogue-the-internet project,
since most of the resources it covers are not net-accessible just now --
is now available on RLIN. In addition to the inventory, CETH has been
working on the TEI header and on defining mappings from the TEI header
into MARC and into relational database systems.  The TEI is important,
Hockey said, because it allows electronic texts to be associated with
the necessary metadata, it can handle the complex structures of real
texts, and it thus provides a potential push-off point for better search
and retrieval in the future.  Bill Arms had observed, the previous day,
that users want intellectual works, not digital artifacts; in the same
way, they need search and retrieval for words and concepts, not just for
strings of bytes in some underlying character set.  In addition to its
work on cataloguing issues and markup, CETH also organizes a summer sem-
inar in humanities computing, which she described.

   David Chesnutt then spoke about the requirements of practicing human-
istic scholars, with special attention to historical editors.  He began
by clarifying the notion of DOCUMENT, which for a historical editor is
likely to be one attestation of one version of a "work".  Thomas Jeffer-
son's draft of the declaration of independence is a document (and very
distinct from the final version, which is a different document, though
it attests the same work).  So are the diaries of unidentified individu-
als, letters (in authors' drafts, recipients' copies, letter-book cop-
ies, etc.), notes, legal papers, and yet other varieties of written
record.

   Modern historical editing began, Chesnutt said, in 1950 with the pub-
lication of the first volume of the PAPERS OF THOMAS JEFFERSON, edited
by Julian Boyd.  This project immediately attracted imitators, and
before long Yale was hosting the Benjamin Franklin papers, Chicago the
Madison papers, and South Carolina the papers of both John C. Calhoun
and the colonial planter Henry Laurens; Chesnutt is the third editor of
this last project.  Such historical editions face a difficult task in
selecting documents for publication, since it is economically infeasible
to publish all the surviving papers of most public figures.  They also
must find ways to make the documents understandable both linguistically
and historically to modern readers, while retaining faithful to the
originals; editions also attempt to make the historical information con-
veyed by the documents they publish accessible, mostly by means of
detailed indices (a 600-page volume of historical documents is likely to
include ten thousand index entries in a sixty-page index -- probably
five or ten times as large an index as is usually found in trade books
of comparable size.

   Electronic publication seems to hold great promise for historical
editions.  It may make it possible to publish more of the documents --
possibly all of them -- although a full electronic edition is likely to
incur more editorial costs than a selective edition in print or elec-
tronic form.  It will surely make possible more indexing and more ways
of making the material accessible to readers.  But it also poses some
new problems:  the indexing techniques historical editors now use are
predicated on the page as the unit of reference, and it's not immediate-
ly obvious to everyone how to index a pageless edition.

   The Model Editions Partnership (I told you we'd get back to it even-
tually) unites seven existing historical editing projects in an experi-
ment in methods of electronic preparation and publication of historical
documents.  The partners are:

*   the Documentary History of the Ratification of the Constitution
*   the Papers of Elizabeth Cady Stanton and Susan B. Anthony
*   the Lincoln Legal Papers
*   the Papers of General Nathanael Greene
*   the Documentary History of the First Federal Congress of the United
    States of America
*   the Papers of Margaret Sanger
*   the Papers of Henry Laurens

They include conventional letterpress editions now starting (the
Stanton/Anthony papers will publish a selection of two thousand docu-
ments, from the fourteen thousand included in a microfilm edition
already completed), microfilm editions (the Sanger papers), a CD-ROM
edition (the Lincoln Legals), and letterpress editions already in prog-
ress, most of which started before the advent of computers in editorial
work.  The latter group face a particularly tricky situation with regard
to electronic publication:  many have been preparing their transcrip-
tions in electronic form for years, so for recent volumes they already
have their texts in electronic form (often in some project-internal gen-
eric markup schme), but for their earlier volumes, only the printed text
exists.  The partnership will experiment with various ways of organizing
and presenting these materials in electronic form.

   Wherever the electronic samples prepared by the partnership use
"live" text, the TEI encoding scheme will be used.  Chesnutt concluded
his talk by explaining briefly why this is so.  The viability of SGML
has been amply demonstrated by the growth of the World Wide Web and
HTML, the WWW application of SGML.  HTML is not adequate for the needs
of the partnership, however:  it is too small and too limited.  The TEI
tag sets, by contrast, provide much more extensive support for the needs
of scholars.  The cancellations, interlineations, and other complica-
tions seen in the examples Chesnutt had showed at the beginning of his
talk can all be captured in TEI encoding; none can be captured in HTML.
In addition, the TEI tag sets can be extended  where necessary, in prin-
cipled ways.  Because it is not tied to the capabilities of a particular
generation of software the TEI may have a longer shelf life than other
electronic formats.  Print editions now use paper expected to survive
for three hundred years; it is to be hoped, Chesnutt said, that the TEI
encodings to be created by the Model Editions Partnership can last that
long, too.

   I concluded the session with a discussion of why the TEI was created
and the problems which must be faced by any serious attempt to encode
texts in an intellectually serious way for real research.  I spent most
of my time showing pictures of manuscripts and older printed texts, to
illustrate the kinds of problems which must be faced.  Happily, as time
has gone on, the TEI tag sets have grown capable of handling many,
though not all, of the challenges illustrated by my collection of trans-
parencies.  Concrete poetry, for example, continues to be a serious
problem; the best anyone can come up with is to scan an image of the
page, transcribe the text, and link the two.  This is fine, but it does
not constitute a real solution:  you cannot tell, from the image of the
page, that the words form the shape of a pair of angel wings, or of a
battle axe, or of an apple with a bite taken out of it (and a worm in
the center).  And similarly, the problems of analytical bibliography,
paleography, and codicology, though they have been taken up by some
prominent practitioners of those fields, have yet to find deeply satis-
fying, generally accepted solutions.  These are areas for further work
within the TEI, and I hope those interested will be willing to serve on
a TEI work group when the time comes.

   The CNI task force meeting was immediately followed by a day and half
of meetings of the CNI/AAUP Joint Initiative, in which representatives
of university presses met to discuss their efforts to explore electronic
publication, whether successful or failures, and to explore issues of
common interest.  In view of the length this trip report has already
attained, however, I think I had better break off here, and make a sepa-
rate trip report on the joint initiative's workshop.

  More to follow.  End of Part I.


---------------------------------

(1) Stanford was built on what was once a horse farm; hence the nick-

                                               April 21, 1995 (16:53:12)