SGML '94 Trip Report: Michael Sperberg-McQueen

Header

From @UTARLVM1.UTA.EDU:owner-tei-l@UICVM.UIC.EDU Fri Nov 18 20:58:05 1994

Date:         Fri, 18 Nov 1994 18:21:14 CST
Reply-To: "C. M. Sperberg-McQueen" <U35395%UICVM.bitnet@UTARLVM1.UTA.EDU>
Sender: Text Encoding Initiative public discussion list
              <TEI-L%UICVM.bitnet@UTARLVM1.UTA.EDU>
From: "C. M. Sperberg-McQueen" <U35395%UICVM.bitnet@UTARLVM1.UTA.EDU>
Organization: ACH/ACL/ALLC Text Encoding Initiative
Subject:      SGML '94 Trip Report
To: Multiple recipients of list TEI-L <TEI-L%UICVM.bitnet@UTARLVM1.UTA.EDU>

Report


                              Trip Report

                SGML '94 and SGML/Open Technical Meeting

             (Tyson's Corner, Virginia, 7-11 November 1994)


                         C. M. Sperberg-McQueen

                      November 18, 1994 (17:36:06)


   SGML '94 is the fifteenth or so in the series of annual SGML confer-
ences organized by the Graphic Communications Association (GCA).  The
gathering this year was held in Tyson's Corner, Virginia, just south of
Washington, D.C.  It continued the pattern of the last few years by
growing about 50% from the gathering of the preceding year; about 700
people attended, up from 450 or so last year.  The organizers had been
prepared for some growth, but not quite so much; the hotel staff, to
their credit, worked hard to handle the overflow, and I heard no com-
plaints from the participants.

   I always enjoy the GCA's SGML conference, both for the generally sol-
id technical content and for the chance to see old friends.  In some
respects, the latter advantage outweighed the former this year:  with so
many new attendees, it is natural for there to be less hard-core techni-
cal material, disappointing though that may be for the hard-core SGML
techies among us.  (Next year, I am told, the conference will feature a
"geek track" so labeled, which should attract a good selection of hard-
core technical talks.)  This year's conference was not, however, com-
pletely bereft of technical content:  several talks provided as much
solid food for thought as one might ask for.  Herewith a selective view
of the highlights.

   As usual, Yuri Rubinsky and Tommie Usdin (chair and co-chair, respec-
tively, of the conference) gave a presentation during the opening ses-
sion, on the SGML Year in Review.  High points I find in my notes are:
that the SGML Conformance test suite developed by the GCA is now avail-
able on the Internet (I didn't get the address); that Lynne Price is
chairing a committee on registration of SGML public identifiers; that
ISO DIS 10179, the Document Style Semantics and Specification Language
(DSSSL) is now being balloted for adoption as an international standard,
with a meeting set for February 1995 to resolve comments.  ISO 13673, on
SGML conformance testing, has been approved, and the National Institute
for Science and Technology (NIST) is setting up a conformance testing
project.  (There was much muttering and gnashing of teeth over the way
NIST has handled this effort so far:  interested parties were not noti-
fied of the request for bids, and not everyone has total faith in NIST's
ability to handle the task without more consultation with the SGML com-
munity than they appear to be interested in undertaking.)  TEI P3 was
published.  (I'd like to report that at this point, the proceedings were
halted for a ten-minute spontaneous demonstration of the sort familiar
from American political conventions, where the mention of a candidate's
name sends the hall into a frenzy of celebration.  I'd like to, that is,
but on the whole, the audience took the announcement that TEI P3 is
done with a certain phlegmatic equanimity.)

   Other projects and initiatives also reached milestones this year:
the DocBook DTD is out in release 2.2, the Pinnacles DTD for semiconduc-
tor component data sheets is out in release 1.1, and the Air Transport
Association's DTD has been adopted and adapted for internal use by Luf-
thansa and United Airlines.  The news industry has formulated a so-
called "Universal Text Format", the acronym for which will sow lots of
confusion for people interested in both SGML and character sets, since
UTF is the acronym for a common compression scheme for the 32-bit char-
acter set ISO 10646.  The work of the International Committee for Acces-
sible Document Design (ICADD) has been exended, but my notes become
illegible here, so I can't tell you exactly how.

   The World Wide Web has become the best-known SGML application in the
world, though by no means the largest in terms of volume of data.  (Sev-
eral vendors said in public and private that they had at least six or
eight clients with more data each than all of the WWW put together.)
And the Web has gone commercial with a vengeance, the freely available
Mosaic developed at NCSA (the National Center for Supercomputing Appli-
cations, which also supplied this year's keynote speaker) now having
several commercial competitors.  The most important announcement in this
connection was, I personally believe, the one Yuri did not make, presum-
ably to avoid apparent conflict of interest.  Namely, that SoftQuad has
agreed to supply an SGML browser called Panorama, to be bundled in with
copies of NCSA Mosaic, which can handle arbitrary SGML DTDs.  (N.B. by
ARBITRARY DTDs I do not mean DTDs which themselves are capricious and
irrational, but DTDs chosen by the user's free will, according to the
user's own lights.)

   In other news of interest, Microsoft has now announced its SGML prod-
uct which ties in, naturally, to MS Word.  Avalanche also displayed bro-
chures for several packages intended to assist users in exploiting the
Microsoft materials.  And the conference was filled with the whispers of
organizations seeking to hire new people.

   Joseph Hardin of NCSA gave the keynote address, which described NCSA
and explained why a supercomputing center has paid so much attention to
issues which, on first glance, are not related to supercomputing, such
as  visualization, data handling and data formats, software for long-
distance collaboration, and information systems such as the World Wide
Web.  All of these are supported at NCSA because NCSA sees its mission
as supporting what Hardin called the "computational science revolution",
and these seem to be useful in that context.  Hardin stressed the impor-
tance of standards:  URLs and URNs, HTML, HTTP, ANSI Z39.50 (a network-
based protocol for information retrieval), and the Common Client Inter-
face (CCI, recently announced by NCSA:  this protocol allows external
viewers like Panorama to ask the WWW client software which launched them
to perform WWW services, such as fetching items from the net, on their
behalf).

   Charles Goldfarb, newly independent after years at IBM, reviewed the
current status of Project YAO, a cooperative project of partners in Chi-
na, California, and Norway to write an SGML parser to make publicly
available in source form.  The parser will include an application pro-
gramming interface (API) for access to low-level parser events (such as
recognition of a start-tag, recognition of an end-tag, etc.), in a pat-
tern familiar from other parser interfaces.  Procedures for saving and
restoring the parse state will ultimately be provided (i.e. they are not
yet available), which can be used to implement incremental parsing.
Multiple concurrent parsing contexts are also supported, Goldfarb said,
though in answer to a question he explained that he did not mean the
parser supported, or would support, CONCUR.  The multiple contexts will
allow parsing with different SGML declarations, different DTDs and link
process definitions, and can thus be used to check the conformance of a
document instance to the architectural forms.

   A variable-persistence cache will be provided, to allow rapid access
to parsed fragments; the cache will use a proprietary format.

   The low-level interface will be complemented by a high-level inter-
face to the "information objects" of the document in terms of its entity
and element structure.  This interface will also support references to
objects by means of HyTime location addressing.

   The Portable Object-oriented Entity Manager (POEM) is a separate
software project and may (if I understand things right) be incorporated
in other parsers, not just in the YAO parser.  POEM will provide a com-
plete buffer separating the entity structure of SGML from the file
structure of the operating system, allowing multiple entities to be
stored in a single file, and vice versa.

   Design documents and some but not all of the code are available for
review from ftp://ftp.ifi.uio.no/pub/SGML/YAO.

   The evening sessions at GCA SGML conferences have in recent years
been dovtd to some thorny technical issue or other, leading to memorable
arguments over query languages, SGML transformation tools, and so on.
This year, the sessions were on tables and, ... and, ... well, there was
a second one, but I had to look it up in my program to remember that the
topic was visual display of structural information.  I confess that
instead of attending either of these, I went to dinner, with a group of
people who turned out all to be interested in style sheets, especially
for network distribution of SGML documents.  We promptly turned our-
selves into an informal cabal and plotted a strategy for addressing the
style sheet issue; there turned out to be a strong consensus among those
present that a standard style sheet for net-based browsers was both fea-
sible and desirable, that it can and should be formulated as a subset of
DSSSL, and that SGML Open should consider organizing, or at least spon-
soring, the technical work, and adopting the result.  Later reports said
that the tables session was very interesting, but achieved little con-
sensus.  The session on graphic display of structure seems to have
focused, not surprisingly, on methods of displaying trees onscreen.

   Later in the week, the cabal produced results, in the form of a
very preliminary proposal for a subset of DSSSL for use in network
browsers; SGML Open discussed this at some length at the end of the
week, and work continues under the leadership of Steve Pepper, who
should be contacted (at pepper@falch.no) for further information.

   On the second day of the conference, SGML Open had arranged an all-
day series of talks reviewing all the various components of an SGML sys-
tem.  The morning started with an able survey of DTD development and
other utilities by Debbie Lapeyre, and continued with equally useful
surveys of parsers and SGML transformers (Pam Gennusa), SGML editors
(Paul Grosso), tools for electronic delivery (Tim Bray), and programs
for layout and composition (Mark Walters).  Into this series the organ-
izers had also slipped a talk by myself, on SGML database and document
management systems.  I won't speak for my own talk, but the others in
the series were extremely informative, and had far too much content to
be paraphrased successfully here.  There is time and space only to
report Tim Bray's useful distinction between browsers for "live" (chang-
ing) data and browsers for "dead" (static) data, the latter being nota-
ble for their "Kill, Cook, Freeze" processing model.  He also distin-
guished between structure-oriented viewers (mostly SGML-aware) and
page-oriented viewers (mostly not SGML-aware).  Structure viewing is
better, but page viewing is cheaper.  Structure viewing is the technolo-
gy of the future; page viewing is that of the past.

   N.B.  Tim's talk was, as always, very good, but I should note that if
I quote him at some length and the others not at all, it is not because
the other talks were less informative, but because he had better sound
bites.  If plans are successful, all the presentations should appear in
written form in a book, which will, it is hoped, be useful for potential
adopters of SGML.

   The evening session on the second day was devoted to SGML and the
Internet, which turned out (unsurprisingly) to mean SGML and the World
Wide Web.  I sat near a number of participants in the Pinnacles group,
who turned out to share my strong feelings that the long-term future of
the World Wide Web depends upon its abandoning the current idea of sup-
porting only a single SGML DTD (namely HTML) and providing support for
any DTD the information distributor chooses.  The presentations were
aimed at non-technical listeners, and in particular Eric Severson of
Avalanche must be singled out as giving the most persuasive non-
technical discussion of network publishing I have ever heard, explaining
in business terms why WWW, and in particular HTML, is best treated as a
publishing medium and emphatically not as a format suitable for document
creation or maintenance.  As the evening wore on, however, with the
speakers failing ever more emphatically to mention the shortcomings of
HTML, the inevitable inadequacy of any single-DTD strategy for the net,
or any of the technical issues which must be addressed, the Pinnacles
group and I turned into a slightly noisy heckling section.  Fortunately,
before too long the meeting was opened up to comment, and we were able
to applaud vigorously those who, like John Bosak, Terry Allen, and Mur-
ray Maloney, expounded viewpoints we found more rational; this gave us a
much needed outlet; without it, I am afraid we might have begun misbe-
having really badly. As it was, I fear the fellow who spoke about HTML
may still bear some psychological scars; he was clearly prepared for a
non-technical audience, not one of whom 80% reported experience author-
ing documents in HTML using ASCII editors.  He did the best one could
expect under the circumstances.

   The next morning was brightened by Dave Sklar's talk about transduc-
ing documents into SGML from other formats (ALCHEMY FOR THE MASSES).
With his usual mixture of sound technical information and manic humor,
Sklar observed that conversion of legacy data has moved beyond the ini-
tial state of technology, in which the user's choice is either to hire a
wizard, or become a wizard -- "enable thyself, or enable thy wallet", to
a more mature state in which the user has a third choice, namely to use
standard Do-It-Yourself-style toolkits, without having to become a wiz-
ard first.  These Do It Yourself tools generally work from the visual
appearance of the document, allow more or less simple mappings from
appearance to SGML tags, and provide a convenient user interface for
specifying the mappings.  They work in a surprising number of cases, but
-- surprise!  -- not in all.

   The afternoon and evening of the third day was devoted to a vendor
exhibit session, on which I need not comment here, beyond observing that
there were a lot of vendors, showing a lot of very good software.  Any-
one who argues, nowadays, that there is any shortage of good SGML soft-
ware, is hallucinating or out of touch.  Editors, browsers, document
management systems, typesetting engines, software libraries, search and
retrieval engines; monolithic systems and open-system frameworks; high-
end and low-end software for high- and low-end hardware; I saw all of
these, and I did not actually visit more than a third of the booths.
There is always room for more software, of course, and there are niches
still to be found and explored.  But existing off the shelf SGML systems
can now do more than ever before.

   On the final morning, John McFadden of Exoterica gave a talk on the
SGML SHORTREF feature, in which he attempted valiantly to be provoca-
tive.  Unfortunately, for me at least he succeeded only in being provok-
ing and condescending.  He argued that SHORTREF was widely misunder-
stood, because most SGML users focus on relatively simple applications
where SHORTREF does not show to advantage.  The true strength of
SHORTREF shows, he said, in applications with much higher information
density.  Along the way, in the interests of provocation, he took sever-
al potshots at those who would like to simplify the formal syntax of
SGML by removing what they believe to be unnecessary complications,
bells, and whistles.  (Since I have often argued that the formal syntax
of SGML is unnecessarily complex, my negative reaction to his remarks
may reflect irrational pique over these potshots.  The reader must form
an independent opinion.)

   Unfortunately, McFadden never did make clear what he believes the
true strength of SHORTREF really is, or why he claims that without
SHORTREF, applications with high markup density are "impossible". I
asked him what problem he encountered in high-density markup which made
SHORTREF essential, but his only answer was to suggest that if I had to
ask, I probably had never seen data with a really high density of markup
and information.  I respectfully suggested that I had seen such data,
and repeated my question, but he answered only by inviting me to Ottawa,
where he would show me, he said, texts with really dense markup, as much
as one tag per word of content.  I did not take the time to observe that
in the TEI-encoded British National Corpus, every word is tagged with
its part of speech; that in sample encodings of the Dead Sea Scrolls it
is not uncommon for every character to require separate elements record-
ing the certainty of its reading and the percentage of the character
preserved on the papyrus; and that in the first draft of the TEI Guide-
lines, the two-word sentence

      Wash sinks.

is encoded with a lexical and syntactic analysis which runs to six pag-
es.  (I should note that the sentence is four-way ambiguous -- each sin-
gle analysis therefore runs only to a page and a half.  Possibly McFad-
den regards this as not "really" dense.)  It may be that a judicious use
of SHORTREF could make work with these examples simpler -- but on the
whole, working with an SGML parser and an SGML editor with decent style
sheets already makes work with these examples simple enough for me.  So
I continue to be mystified by McFadden's claim that work with densely
marked up text is "impossible" without SHORTREF.  It's not impossible at
all: we've been doing it for years.  I could not help but wonder whether
his view simply reflected a failure to exploit the capabilities of SGML
editors, in which case it is not SGML, but an insistence on using inade-
quate editors, which is causing the problem.  Unfortunately, by failing
to provide concrete examples and by limiting himself to vague and unin-
formative comments rather than specific analysis, McFadden made his talk
vacuous and lost an opportunity to encourage serious technical discus-
sion of a topic he appears to care about a lot.

   Fortunately, two later talks the same morning provided shining exam-
ples of how to encourage technical discussion.  Jean Paoli, of Grif,
spoke about SGML Objects and the issues involved in defining behaviors
for them.  And Makoto Murata, of Fuji Xerox, gave what I thought was the
most substantial technical paper of the conference.  Under the unpre-
possessing title FILE FORMAT FOR DOCUMENTS CONTAINING BOTH LOGICAL
STRUCTURES AND LAYOUT STRUCTURES, Murata described the formal problems
confronting any attempt to record both the logical (or:  a logical) and
the (or: a) physical structure of the same text.  Since these problems
have been a constant looming presence in the TEI, especially in the work
groups for textual criticism, manuscript transcription, and dictionar-
ies, and since the TEI was never able to devise a fully satisfactory
general solution to them, I was particularly interested in his summary.
(In this summary, I will like Murata speak of the logical and the physi-
cal view; the problems, however, also occur when more than one logical
view, or more than one physical view, are to be encoded.)  In brief, the
problems include:

*   duplication of data (e.g. in a running head, which appears once in
    the logical view and several times in the physical view)
*   removal of data (e.g. annotations in the logical view which are not
    present in the physical view)
*   addition of data (e.g. page numbers in the physical view, not
    present in the logical view)
*   need for explicit expression of the alignment between elements in
    the two views
*   reordering of data (e.g. migration of footnote or endnote text away
    from its point of attachment to the bottom of the page or to the end
    of the chapter or volume); this Murata calls DISTORTION.  Fine-
    grained alignments of parallel documents in one or more languages
    also exhibit this problem.

The optional SGML feature CONCUR was intended to enable the simultaneous
encoding of multiple views of the document (in particular, of both a
logical and a layout view), but CONCUR has only awkward methods for han-
dling duplication, suppression, and addition of data, and no methods at
all, that I know of, for handling duplication and distortion.  The stan-
dard is silent on whether parsers which support CONCUR must support
simultaneous parsing with more than one DTD, so such parsers may or may
not support explicit linkage between nodes in different document trees.

   Borrowing concepts from other work on document processing and docu-
ment formatting, Murata defined an algorithmic process for augmenting
the logical and physical trees of the document with specialized node
types, which enable him to handle duplication, addition, suppression,
and distortion without having to store any portion of the text more than
once.  The augmented trees have explicit links expressing the correspon-
dences of their nodes, and each tree can be reconstructed in a straight-
forward way, undoing, as necessary, the effects of addition, duplica-
tion, etc.  In many respects, the specialized node types introduced by
Murata resemble the PTR, LINK, and JOIN elements of the TEI encoding
scheme; I need to study his work further before knowing how far these
TEI element types can be used to exploit his insights in a TEI context.

   Murata's work has, I think, critical implications for anyone con-
cerned with document formatting, with systematic encoding of text layout
or physical presentation, with multiple versions of a text (text dis-
placement, Murata's DISTORTION, is one of the hardest problems of text
criticism, not only in electronic form, but also in paper forms), or
with synchronization of multilingual corpora.  I warmly recommend its
close study.

   The concluding talk of the conference was Jean Pierre Gaspart's won-
derful injunction to keep PUSHING THE SGML PARADIGM.  Taking as his
starting point the aphorism "When I was seven, I received a 'hammer' and
suddenly everything looked like a 'nail'."  Gaspart observed that in
some approaches, SGML becomes nothing more than a nail for someone
else's hammer:  an export mechanism for relational database management
systems -- which are not intrinsically well suited to the storage and
management of hierarchical data like SGML -- or an alternative notation
for LISP S-expressions -- which Gaspart objected to on the grounds that
S-expressions are trees, while SGML documents are not trees but graphs,
in which containment is just one of many possible relations between
nodes.

   Instead of making SGML a nail for some other technological hammer,
Gaspart proposed that full exploitation of the SGML paradigm will
involve making SGML the hammer:  that is, making SGML the central organ-
izing principle of software systems.  He gave several examples of this
type of organization.  In one, an object-oriented system for tracking
court cases and scheduling sessions in a court of appeal is implemented
by distributed programs which act upon SGML-encoded representations of
the case and its status, and interact with each other by sending SGML-
encoded messages.  The central task dispatcher is an SGML parser which
treats incoming messages as a stack of entities to be parsed, and which
fires appropriate subroutines as a side effect of parsing. Other appli-
cations include SGML-based systems for object-oriented data management,
bill processing at the Belgian telecommunications authority, and a
process-flow control system for the Belgian parliament.  He concluded by
suggesting that the occasion of the SGML formal review should be used to
improve the language, and made a few concrete suggestions (including the
preservation of CONCUR and SHORTREF).  His concluding thought gave a
memorable close to the conference, with which I will also close this
trip report:

    Clarity, Precision and Ease of use does not mean
    Confinement, Verbosity and Futility.