SGML: Trip report (CETH, May 1996)
From @UTARLVM1.UTA.EDU:owner-tei-l@UICVM.CC.UIC.EDU Thu May 23 19:16:49 1996
Message-Id: <TEI-L%96052317211957@UICVM.UIC.EDU>
Date: Thu, 23 May 1996 17:18:59 CDT
Reply-To: Michael Sperberg-McQueen <U35395%UICVM.bitnet@UTARLVM1.UTA.EDU>
Sender: "TEI (Text Encoding Initiative) public discussion list"
From: Michael Sperberg-McQueen <U35395%UICVM.bitnet@UTARLVM1.UTA.EDU>
Organization: ACH/ACL/ALLC Text Encoding Initiative
Subject: trip report (CETH, May 1996)
To: Multiple recipients of list TEI-L <TEI-L@UICVM.UIC.EDU>
-----------------------------------------------------------------------
(For those who prefer to read long pieces using WWW or SGML software,
tagged versions of this trip report may be retrieved from
http://www.uic.edu/~cmsmcq/trips/ceth9505.html and (if you have
Panorama or another SGML-aware browser) .../ceth9505.tei .
Text Analysis Software Planning Meeting
Princeton, 17-19 May 1996
Trip Report
C. M. Sperberg-McQueen
-------------------------------------------------------------------------
1 The Next Generation
The humanistic researcher who wants computer assistance with research
on texts today faces a rich array of possibilities: concordance
programs of the batch and interactive persuasions, lemmatizers,
morphological analysers, full text retrieval systems, sophisticated
and simple hypertext delivery systems, collation programs, typesetting
software. Some of this embarrassment of riches runs in a client/server
environment, some runs standalone, some for DOS, some for Windows, the
Mac, Unix. Much of this software is well written, much is well
documented, much is available at reasonable cost or for free. And
sometimes these three categories even overlap.
Why, then, does there seem to be a consensus that we are facing a sort
of crisis of confidence in our software? Why do so many people in
humanities computing regard software as a Problem We Must Face Up To
Soon? Why am I one of them?
Because a generation is passing.
>From the first uses of machine-readable text for humanistic research
in the late 1940s, to today, I count three generations. The first
generation included special-purpose, ad hoc programs written for
particular projects to apply to particular texts. The second
generation can be counted from early efforts to create reusable
libraries of text-processing routines, some of which eventually turned
into efforts to create general-purpose, reusable programs for use with
many texts. Naturally, these were batch programs. The Oxford
Concordance Program (OCP) and some modules of the Tustep system are
good examples. In the third generation, general-purpose programs
became interactive: Arras, the Tustep shell, Word Cruncher, Tact, and
other newer programs are all third-generation programs in this sense.
What these programs do is useful: produce concordances, allow
interactive searching, annotate texts with linguistic or other
information, and a multitude of other tasks. Why aren't people happier
about the current generation of software?
I think there are several reasons:
1. For many potential users, existing software still seems very
hard to learn. Not everyone thinks so, but those who find current
software easy are in a decided minority. User interfaces vary so
much between packages that learning a new package typically means
learning an entire new user interface: there is very little
transfer of training.
2. Current programs don't interoperate well, or at all. CLAWS or
other morphological analysers can tag your English text with
part-of-speech information, but if you've tagged your text with
Cocoa or SGML markup, you'll have to strip it out beforehand, to
avoid confusing CLAWS. If you want to keep the tagging, you'll have
to fold it back into the text after CLAWS gets through with it, as
the British National Corpus did. Other programs, of course, will
be confused in turn by the part-of-speech tagging, so you'll have
to strip that out, too, before applying other text analysis tools,
and then fold it back in again afterwards.
3. Current programs are often closed systems, which cannot easily
be extended to deal with problems or analyses not originally
foreseen. You either do what the authors expected you to do, or you
are out of luck. The more effort a program has put into its user
interface, the more likely it is to resist extension (this need not
be so, but it is commonly so); the more open a system is, i.e. the
more care its developers have put into allowing the user freedom to
undertake new tasks, the worse the reputation of its user interface
is likely to be. A comparison of OCP and Tustep is instructive
here: OCP has a very careful user interface design, and is
remarkably easy to learn, but it is a closed black box, and cannot
be extended except by defeating its designers' determined efforts
to keep you from seeing what's happening under the hood. It is a
closed system. Tustep, on the other hand, is remarkably open and
flexible, but has a reputation for complexity which is perhaps not
completely unrelated to that openness and flexibility.
4. Almost all current text analysis tools rely on what now seems
a hopelessly inadequate model of text structure. Text, for these
programs, is almost invariably nothing but a linear sequence of
words or letters. The most sophisticated of them may envisage it as
a linear sequence with tags showing the values of different
variables at different points in the sequence (as in Cocoa-style
tagging), or as an alternating sequence of text and processing
instructions (what is sometimes called the
one-damn-thing-after-another model of text structure). None of the
current generation of text-analysis tools support the significantly
richer structural model of SGML, though the experience of the TEI
has persuaded many people that that structural model represents a
radical breakthrough in our ability to model text successfully in
machine-readable form.
Different people, of course, assign different weights to these
problems. Some are most moved by the need for better user interfaces,
others by the need for tools which combine SGML-awareness with the
textual-analysis primitives of the current generation of software. But
for whatever reasons, many people seem to agree that the
textual-research community is in need of a new generation of
text-analysis tools.
And that is why now, on a sweltering afternoon, I am sitting in Newark
Airport awaiting my flight home after a two-day Text Analysis Software
Planning Meeting at Princeton sponsored by the Princeton / Rutgers
Center for Electronic Texts in the Humanities (CETH), and trying to
decide what to think about the meeting. I enjoyed myself immensely,
and am grateful to CETH for having put it on and for having invited
me, but I am not quite sure what to think. Was the meeting a rousing
success, because there was so much successful exchange of information?
Or was it a bit of a disappointment, because we did not leave with a
clear consensus on what the next generation of software should look
like or how to set about developing it? Perhaps it's expecting too
much to hope that a meeting like this will develop a concrete plan of
attack. Is the glass half full or half empty?
2 The CETH Text Analysis Software Planning Meeting
The meeting was organized by Susan Hockey at CETH, with assistance
from Willard McCarty and Malcolm Brown, along with (less intensively)
Allen Renear, Harold Short, and myself. An earlier meeting had
gathered a smaller number of participants to discuss the same general
topic; one conclusion of that earlier meeting was that further
meetings, with broader representation of interested parties, were
desirable. Participants came to this second meeting from North
America, Europe, and Japan, with interests primarily in language and
literature.
A couple people asked me how this effort related to the Text Software
Initiative, announced by Nancy Ide and Jean Veronis at the Georgetown
ACH/ALLC '93 conference. There is, as far as I know, no relation, but
this meeting took pains to avoid at least one mistake for which the
TSI was criticized at Georgetown, by inviting so many people, from so
many places, to participate in the planning, and to ensure that any
resulting effort would have broad community support.
The organizers as a group believed, not unanimously but on the whole,
that it was dangerous to allow the conversations to veer too close to
the topic of implementation, and that it was important to keep the
participants focused on user requirements. This had the salutary
consequence of preventing a rapid descent into geek-talk, but another
result, unfortunately, was to rule out of order in advance many of the
most interesting topics which come up in planning a new generation of
text analysis software. It was also not possible to keep everyone's
mind off implementation issues, broadly defined: they turned up
repeatedly in discussions, unconvincingly disguised as user
requirements. But "users need good search facilities" and "users need
an open, modular architecture" are, despite the similarity of their
structure, statements about very different aspects of the system. As a
member of the organizing committee, I must accept responsibility for
the mistake, but in retrospect it does seem to me to have been a
mistake; a fuller, freer discussion of technical questions would I
think have allowed a much more convincing treatment of them than was
managed in the event.
3 Day 1: Overview of Current Tools
3.1 Pre-SGML Tools
We spent the first day in a useful overview of existing tools,
including Lexa (a large collection of DOS-based utility programs for
managing and studying language corpora), Monoconc (a Windows-based
concordance package with a strong emphasis on simplicity of
interface), OCP (an aging batch concordance program with some features
still missing from most of its livelier interactive brethren), Tact (a
DOS-based interactive concordance package), TextPack (a package of
programs for content analysis developed at ZUMA, the Zentrum f|r
Umfragen, Methoden, und Analyse, in Mannheim, originally for
mainframes but now for microcomputers), and Tustep (a collection of
interoperable programs developed in T|bingen for creating,
managing, manipulating, and typesetting electronic texts). The
programs in this first batch of tools were all created for
researchers, and almost all of them were finished or well under way
before SGML came, like a great wave, and changed forever the way many
people think about electronic representation of text.
All the talks were interesting in one way or another; perhaps the most
important points for thinking about the next generation of text
analysis tools were made by Wilhelm Ott in a paper originally given at
ALLC/ACH 92 in Oxford and redistributed to the participants. The key
features he identified for any such software are:
* modularity (the system should be a "collection of relatively
independent programs, each of which offers a well-defined subset of
basic operations for processing textual data")
* professionality (the modules should support serious research work;
this requires more than supporting office automation)
* integration (the modules should, together, handle all stages of a
project: data capture, analysis and processing, and presentation or
output)
* portability (programs and data should be system-independent, to
buffer users from rapid changes in hardware and to allow inter-site
collaboration)
Tustep exemplifies these principles in its way. Its individual modules
each perform a relatively specific task; all Tustep programs read and
write a single basic file structure, which means that they can be
combined, like Tinkertoys or Unix filters, in arbitrary ways to
perform work not explicitly foreseen by the developers. They perform
to a high standard, well beyond what is possible with common
off-the-shelf office-automation software. They include tools for all
the major tasks a user normally needs, including some duplicating
basic operating-system functions, which help ensure that projects can
change platforms in mid-project without having to relearn how to copy
and rename files, etc. And Tustep runs on a wide range of operating
platforms.
The news that Tustep is now bilingual (German/English) clearly took
some participants by surprise, and I suspect that the number of Tustep
sites in North America will rise as a result of this meeting.
Given the focus of the meeting on text analysis, it is perhaps not
surprising that most of the discussion ignored the issues raised by
the third point (integration). In practice, however, it requires
serious attention. If we wish to focus on users and their tasks, not
just on software and its functions, then our attention must inexorably
be drawn to activities of text preparation and presentation, and
cannot be restricted to some narrow notion of text analysis as opposed
to other activities. At the same time, insisting on supporting every
possible user task within a single suite of related programs can lead
to pointless duplication of effort and to a certain sense of being
closed off from the rest of the world. Some important functions,
e.g. data capture and interactive display, are well supported, at
least for SGML data, by existing commercial and non-commercial
software. Tools for text enrichment and analysis must cooperate
gracefully with these editors and browsers, but we don't need to
reimplement them all.
3.2 SGML Tools
The second half of the first morning was devoted to SGML-aware tools:
Dynatext, Explorer, Pat, and SARA. The first two are systems for the
publication of SGML documents; the third an extremely powerful search
engine originally developed for the New Oxford English Dictionary. The
last is an SGML-aware search and retrieval system developed in
connection with the British National Corpus (BNC), a 100-million-word
corpus of modern written and spoken British English. The main drawback
of the first three systems, apart from their commercial price -- often
quite reasonable by reasonable institutional standards, but out of the
range of individuals -- is that the two display systems are not geared
to scholarly querying, and that Pat, while geared to extremely
elaborate querying, is an engine without a user interface, and
experience shows it is not easy to find a good user interface for
query capabilities so extensive and powerful. SARA is very promising,
but hovers uncomfortably between being a general SGML tool and being
tailored exclusively for the BNC. The intention of the developers, I
am told, is to release it as a general tool, but when I ask questions
about specific SGML queries I'd like to be able to issue, the answer,
with depressing regularity, is "No, it can't do that, we didn't need
anything like that for the BNC." It looks promising, but as a general
SGML tool it isn't yet ready for prime time.
3.3 User Requirements
After lunch, we heard talks from representatives of various
specialized user communities discussing the requirements of those
communities: classics and Biblical studies (with a clear overview by
Winfried Bader of the German Bible Society), manuscript studies (with
examples from the autograph MSS of C.S. Peirce by Michael Neuman,
focusing in particular on the enormous problem of sequencing the
confused, disordered mass of his posthumous papers), literary
criticism (in the form of a letter to the participants from John
Burrows, who was unable to attend), East Asian material (for which
Shoichiro Hara provided illustrations of problems facing the National
Institute of Japanese Literature, some of them related to the
character set problem and some shared with Western manuscript
material), and documentary editors (David Chesnutt described the work
of preparing scholarly editions -- to my astonished chagrin, however,
most participants appear to have decided that `text preparation' and
`text analysis' did not have much in common, although in fact I think
each has more problems in common with the other than either can call
uniquely its own).
We concluded the afternoon (but not the day) in small groups
discussing who the potential users of text analysis software are and
what requirements they might have. The group I was in came to the
unsurprising conclusion that every discipline in the humanities, and
most disciplines outside the humanities, might use text analysis
tools, as might university administrators (only on occasion, however,
and only by accident) and people outside the university. The core
functions we identified were support for a rational model of text in
the form of support for SGML/TEI encoding, good search functions, and
a fully scriptable user interface. Character set support and good
style sheets and display also came up, but didn't get as many votes,
perhaps because some of us were able to persuade ourselves that
character set support was implicitly included in good TEI support, and
style sheets are implied logically by a fully configurable user
interface. This seems a rather meager result for two hours of
intensive discussion, but explicit confirmation of one's hunches by
the hunches of other people should probably not be valued too
lightly. More serious, perhaps, was that after the end of the session,
Wilhelm Ott and I each thought of critical aspects of functionality
left completely unmentioned: sorting, for one, and all manner of
interim processing, including linguistic or other analysis and
tagging. Later in the evening, it occurred to me that I had left
unmentioned what I think is the most important area of all for new
work: the systematic transformation of one SGML document into another
SGML document; this is known in the SGML industry as tree-to-tree
transformation. It includes the enrichment of the content with
annotation, but more generally also includes a lot of processing which
cannot usefully be described as annotation.
After dinner, the participants met once more (flagging slightly by
this time) to hear reports from the work groups.
4 Day 2: Implementation Issues
The second morning's program had the label Implementation Issues, but
in fact it was a heterogeneous series of presentations on
miscellaneous topics. Where I come from, implementation implies being
a bit more concrete and specific than any of these talks (fifteen
minutes each in length) was able to be.
In my own fifteen minutes, on SGML and TEI encoding, I had been asked
to address the questions:
* should the software we are talking about support TEI markup?
* if so, what challenges will arise for design and implementation?
* might SGML be used elsewhere in the system as well, e.g. for style
sheets or inter-module communication?
The first question had caused me to vacillate unhappily among
amusement, resignation, and truculence for several days, since I
devoutly hoped that for most participants this was not really an open
question, and I realized that if any participants still harbored
doubts about SGML and TEI, after the ten years of intensive discussion
they have received in the relevant conferences and journals, then a
fifteen minute talk from me was hardly likely to tip the balance. So I
asked the participants whether they had any specific concerns about
supporting TEI which we could usefully address. The questions they
asked seemed mostly informational in nature:
* Would TEI support imply SGML support in general? (Yes, at least in
some respects and possibly in most.)
* Would it be necessary to support all of the TEI equally well before
it was useful to support any of it at all? (No, I think support for
certain core constructs may plausibly come before support for some
of the more arcane bits.)
* Would full TEI support involve supporting TEI markup for
non-hierarchical phenomena, e.g. deletions that cross structural
boundaries? (Yes, I think that is one of the places where TEI
awareness is most needed.)
John Bradley expressed concern that writing software to support TEI
markup would leave users out in the cold if all they had were texts
without markup. Since we had spent most of the previous day hearing
about all the text analysis software already available for such texts,
I have to admit I do not see the force of this objection, even if
support for the TEI were held to imply non-support for other markup or
for texts not marked up at all, which it does not. The pressure of
time made it impossible to air this issue adequately, and it remained
unresolved through the entire meeting.
To be honest, of course, I think worrying excessively about legacy
data is a good way to make software development significantly harder
and longer and to water down the end product, and the only software
I'm interested in developing to work on non-marked-up text is software
to assist in tagging it. But I cannot stop other people from working
on whatever software they think interesting and useful, and would not
want to even if I could.
Otherwise, no one raised any serious objections to SGML support,
though it later became clear that Espen Ore also objected to the idea,
because he would prefer that the next generation of software support
an abstract markup into which one might translate TEI, or Cocoa
markup, or MECS, the markup used by the Wittgenstein Archives at the
University of Bergen, and which would thus be neutral among existing
markup schemes. The TEI markup scheme was designed for precisely this
neutrality among pre-existing markup schemes, however, which makes me
think such an approach either redundant (if the TEI is felt to have
succeeded adequately in the task) or very tricky (if the TEI's
encoding-neutral notation doesn't do the job, how easy will it be to
make a better one?).
After me, Robin Cover gave a bird's-eye view of the current state of
SGML and related standards and their software; he views with some
optimism the prospects for a successful marriage of SGML with
object-oriented database management systems. He asked how far one
might get, toward the kinds of tools we need, by using standard
generic SGML software rather than TEI-specific programs; the answer,
he suggested, was "surprisingly far -- farther than any application
has ever gone yet -- and yet, not quite far enough." The key problem,
he suggested, was the inability of current generic SGML systems to
know very much about the semantics of the markup in a document. An
SGML ID/IDREF link might in fact represent a hyperlink, which should
be represented to a user as a hot button, but it might also represent
something rather different (as does, for example, the TEI LANG
attribute). The absence of any formal semantics from SGML and SGML
systems has some far-reaching consequences -- it may be, in fact, the
main reason for desiring TEI-aware, not just SGML-aware, text analysis
software.
Gary Simons then gave a presentation of Cellar (Computer Environment
for Linguistic, Literary, and Anthropological Research), which has
been developed at the Summer Institute of Linguistics (SIL) by a team
under his direction. To the syntactic specifications of SGML, Cellar
adds tools for conceptual modeling which make it significantly easier
to build smart applications which `understand' the markup in a
text. He showed examples of a Cellar application which understands TEI
markup for textual variation and uses it to present the variants in
several very different ways. He was characteristically soft-spoken
about the program, but anyone who cares about text analysis should be
very excited to learn that SIL plans to burn CD-ROMs for the Windows
version of Cellar within the next few months. (The Mac version, he
told me, has fallen prey -- temporarily, he hopes -- to a vendor's
withdrawal of support for the Mac version of a library used by
Cellar.) Cover's talk helped strengthen my growing conviction that the
TEI might do well to attempt a formal treatment of the semantics of
the TEI tag sets; Simons's description of Cellar made me wonder
whether Cellar's conceptual-modeling language might be a useful
vehicle for such a formal semantic specification.
John Bradley and Geoff Rockwell then gave a quick overview of their
ideas about a visual interface for text analysis software, which now
goes by the name of Eye-ConTact. Inspired by some visual-programming
tools for scientific visualization, Eye-ConTact involves a sort of GUI
tool for creating pipelines of filters (which, however, unlike Unix
pipes, can have more than one input or output stream); to make this
work, one needs to have (1) process modules to do things to the data,
(2) framework modules to manage the user interface and (based on how
the user connects pipeline stages) manage the process modules, (3)
data files and result files, and (4) `map' files, which show the
combinations of processes which resulted in a given result.
I like the visual interface Bradley and Rockwell describe for
controlling the modular pieces of their system, but I think their most
important contribution was to make explicit a point which had arisen
obliquely in several discussions already: the next-generation software
system we are thinking about must not only comprise a number of
independent, interoperating modules with consistent interfaces, it
must also be open: it must allow single modules to be replaced by
other modules possibly developed by different programmers; it must be
possible to add new modules to the system, and to access data at any
and every module boundary.
I think this is important because it matches the reality of the user
requirements, and our funding situation. Research requirements cannot
be mapped out exhaustively in advance, because research involves
asking questions to which the answers are not yet known -- and from
time to time asking questions which themselves were not foreseen when
the research began. A research-oriented software system cannot,
therefore, be exhaustively complete, except in the most narrow
technical sense (if it has a programming language, it can be Turing
complete): newly discovered interests may require new processing
modules at any time. It is essential, therefore, that it be possible
to add new modules to a system -- possible not just for the original
developers, but for any and every user willing to take up a keyboard
and write the new modules. Extensibility is an absolutely essential
requirement for a really satisfactory system. And, in the current
funding climate, it's easier to imagine finding funds to make a new
module here and a new module there than funds to build a large new
system from scratch.
(It will be objected that writing new modules may require programming
skill. Yes, it may. Some will ask, what about the average humanist
who does not have programming skills? To which there is no answer but
to ask, what about the average humanist whose research leads into an
area which requires a good reading knowledge of Sanskrit?)
Tom Horton concluded the session with a talk about domain-specific
analysis, a new trend in software engineering which promises, he
thinks, substantial benefits for relatively circumscribed domains like
scholarly text analysis. By focusing on specific domains, this
approach makes it easier to create reusable resources: it focuses not
just on reusability of code, but also on reusable requirements
statements, user modeling, specifications, and the like. In an ideal
world, these are implemented by cleanly defined libraries supporting
well specified application-program interfaces, and it becomes easy to
implement a wide variety of software by combining these library
routines in various ways.
At this point, we broke into groups again, to discuss architectural
issues, searching capabilities, and user needs. The architectural
group began, plausibly enough, by deciding to decide what it might
mean to specify an architecture for the kind of system we had been
talking about: what needs to be specified, and at what level of
detail? This is, surely, a necessary first step. It would be nice to
be able to report that after it, we had taken another one, but after
we had reached something resembling agreement, it was time for
lunch. Our dedication to the cause fought with our desire to eat;
struggled; wavered; lost. We went to lunch.
For what it's worth, before knocking off we agreed that a general
document describing various components of the system would be needed
(what do we mean by user interface? Where does it start and end? Ditto
for MOdule-Management (MOM) modules, which direct the actions of the
other modules and child processes. Ditto for the other parts of the
large spaghetti-like diagram we drew on the flip chart.) About the
inter-module communications, we agreed only that a specification of
the architecture should describe how modules might send each other
messages (so that one window, controlled by one module, could be
updated in reaction to user actions taken in another window controlled
by a separate module), and what kinds of primitive search or other
operations might be undertaken. This last specification might take the
form of a grammar for a query language, which could be interpreted
either as the rules governing messages from a client to a query
server, or as a concise notation with which a user interface
describes, to a MOM module, what the user has just requested the MOM
tell the other modules to do.
The query language (John Bradley resisted the term, on the grounds
that it biases the mind toward a client/server approach, but we didn't
find any other short name for the thing) seemed to be potentially very
important, because a good non-proprietary query language could be used
to allow single clients to deal with multiple search servers, each
with their own query language. The client, or else a shim between the
network and the search server, would translate queries from the common
language into the server's own language, and package the results
appropriately before returning them. Antonio Zampolli pointed out
that such functionality was not a chimera; a European project will be
starting up in the next few months to define a common query language
for precisely that purpose: to allow single-client interrogation of
multiple query systems. (Since the meeting, Robin Cover has drawn my
attention to a project in Canada with a similar goal: the Canadian
Strategic Software Consortium, with a WWW home page at
http://www.cssc.ca/.)
5 Is the Glass Half Full or Half Empty?
It was about this time that I decided I had come to the meeting with a
fundamentally misconceived notion of what we should be aiming at, of
what would constitute perfect success of such a planning
meeting. While carefully keeping an open mind on the issue, I had
imagined that what one ought to be hoping for was the emergence of a
consensus among the participants that a new generation of
text-analysis software is indeed needed, some shared ideas about what
that new generation might look like, and an agreed-on plan for
organizing systematic technical work to design and promulgate an
architecture for interoperating modules, followed by systematic
efforts to develop modules to fit into that overall architecture.
There certainly seems to be a consensus that a new generation of tools
is needed -- though to be honest I am taking a short leap of faith
here, since there was no opportunity to put this proposition to an
up-or-down vote or discussion. There is also a very strong consensus
among some participants about the overall framework within which the
next generation should be developed: an open, modular system designed
to work both in client/server and in stand-alone environments, within
which institutions could deploy Pat or the other search engine of
their choice, working on SGML-encoded text and exploiting SGML for
other aspects of the system as well (e.g. for communication between
client and server, for style sheets, and for work files and other
inter-module communications), with modules developed in parallel at
multiple sites and not centrally, based on bottom-up democratic
development rather than top-down planning, and governed by the
well-known motto "Rough consensus and running code." These principles
seem so obvious to some potential collaborators that they go almost
without saying. But not everyone agrees. Some participants clearly
have very different views about how the software should be developed,
or what sort of architecture should govern it. The divergence of views
may reflect simple lack of communication or confusion about what is
meant by various catch-phrases or shorthand references, but in part I
think the differences of opinion are real. Both causes, no doubt,
contributed to the difficulty experienced by the work groups in
arriving at anything beyond generalities (an architectural
specification must specify the interfaces between modules; modularity
is good; motherhood and apple pie are admirable). It was regrettable
(this is where the glass begins to look empty) that there was no good
opportunity in the meeting to air these differences and try to iron
them out.
I realized, about lunchtime on the second day, that I no longer felt a
systematic top-down definition of architecture was realistic, or even
necessarily desirable. If it delays experimentation with new modules,
it is emphatically undesirable. What is needed is a commitment to
cooperative work among developers in a chaotic environment of
experimentation and communication. If we were building a closed,
monolithic system, planning and prior agreement about everything would
be as desirable as they always are in software engineering. But the
one point on which everyone seems agreed is that we need an open,
extensible system, to work with texts we have not read yet, on
machines that have not been built yet, performing analyses we have not
invented yet. This is not a system for which we can plan the details
in advance; its architecture, if we insist on calling it that, will be
an emergent property of its development, not an a priori
specification. We are not building a building; blueprints will get us
nowhere. We are trying to cultivate a coral reef; all we can do is try
to give the polyps something to attach themselves to, and watch over
their growth.
In practice, I think this means that what is needed is regular
communication among developers writing software for textual analysis
who are willing to make a shared commitment to cooperation, reuse and
sharing of code, and interoperability among their programs. The goal
should be to grow a coral reef of cooperating programs, not to attempt
to decide in advance what scholars will need and how software should
meet those needs. Improvisation and social pressure to Do the Right
Thing are important, as are the programmer's cardinal virtues of
laziness, impatience, and hubris (which can, properly channeled and
supported by communication, lead to effective reuse and improvement of
modules). Not all developers will be willing or able to do this,
though I think enough are to make it worth while. Any funding agency
willing to fund a small implementors' round-table meeting once every
six to nine months would be performing a massive service for the
humanities. Even a concerted effort to schedule panels and exchange
experiences at the annual conferences of ACH and ALLC might bring
useful results.
This train of thoughts came to me late enough in the meeting that I
was not able to discuss them at any length with other participants;
these remarks should not be taken as representing the views of anyone
else at the meeting, let alone a consensus of the participants. But if
the point of the meeting was to consider how to go about creating the
next generation of text analysis tools, then for me at least it was a
rousing success, because now I have a clear notion of how I think we
should proceed, where I had only uncertain, conflicting ideas
before. The path is not the one I expected. But I now feel confident
that there is a path.
Is the glass half full or half empty? Well, the disappointing news is
that the glass I had in mind when I left for this meeting is neither:
it has a hole in the bottom. The good news is, there is another glass,
and it's half full now. And if we can get our hands on a pitcher, we
can fill it up to overflowing. As I sit here in the hot, airless
departure lounge of Newark Airport, I am very thirsty. An overflowing
glass would be welcome indeed.
For the promise of that overflowing glass, everyone interested in
humanities computing owes a debt to Susan Hockey and to CETH for
organizing and supporting this meeting.