Note: This document was obtained from the Internet.

Trip Report

SGML '94 and SGML/Open Technical Meeting

(Tyson's Corner, Virginia, 7-11 November 1994)
M. Sperberg-McQueen
November 18, 1994 (17:36:06)

SGML '94 is the fifteenth or so in the series of annual SGML conferences organized by the Graphic Communications Association (GCA). The gathering this year was held in Tyson's Corner, Virginia, just south of Washington, D.C. It continued the pattern of the last few years by growing about 50% from the gathering of the preceding year; about 700 people attended, up from 450 or so last year. The organizers had been prepared for some growth, but not quite so much; the hotel staff, to their credit, worked hard to handle the overflow, and I heard no complaints from the participants.

I always enjoy the GCA's SGML conference, both for the generally solid technical content and for the chance to see old friends. In some respects, the latter advantage outweighed the former this year: with so many new attendees, it is natural for there to be less hard-core technical material, disappointing though that may be for the hard-core SGML techies among us. (Next year, I am told, the conference will feature a "geek track" so labeled, which should attract a good selection of hardcore technical talks.) This year's conference was not, however, completely bereft of technical content: several talks provided as much solid food for thought as one might ask for. Herewith a selective view of the highlights.

As usual, Yuri Rubinsky and Tommie Usdin (chair and co-chair, respectively, of the conference) gave a presentation during the opening session, on the SGML Year in Review. High points I find in my notes are: that the SGML Conformance test suite developed by the GCA is now available on the Internet (I didn't get the address); that Lynne Price is chairing a committee on registration of SGML public identifiers; that ISO DIS 10179, the Document Style Semantics and Specification Language (DSSSL) is now being balloted for adoption as an international standard, with a meeting set for February 1995 to resolve comments. ISO 13673, on SGML conformance testing, has been approved, and the National Institute for Science and Technology (NIST) is setting up a conformance testing project. (There was much muttering and gnashing of teeth over the way NIST has handled this effort so far: interested parties were not notified of the request for bids, and not everyone has total faith in NIST's ability to handle the task without more consultation with the SGML community than they appear to be interested in undertaking.) TEI P3 was published. (I'd like to report that at this point, the proceedings were halted for a ten-minute spontaneous demonstration of the sort familiar from American political conventions, where the mention of a candidate's name sends the hall into a frenzy of celebration. I'd like to, that is, but on the whole, the audience took the announcement that TEI P3 is done with a certain phlegmatic equanimity.)

Other projects and initiatives also reached milestones this year: the DocBook DTD is out in release 2.2, the Pinnacles DTD for semiconductor component data sheets is out in release 1.1, and the Air Transport Association's DTD has been adopted and adapted for internal use by Lufthansa and United Airlines. The news industry has formulated a socalled "Universal Text Format", the acronym for which will sow lots of confusion for people interested in both SGML and character sets, since UTF is the acronym for a common compression scheme for the 32-bit character set ISO 10646. The work of the International Committee for Accessible Document Design (ICADD) has been exended, but my notes become illegible here, so I can't tell you exactly how.

The World Wide Web has become the best-known SGML application in the world, though by no means the largest in terms of volume of data. ( Several vendors said in public and private that they had at least six or eight clients with more data each than all of the WWW put together.) And the Web has gone commercial with a vengeance, the freely available Mosaic developed at NCSA (the National Center for Supercomputing Applications, which also supplied this year's keynote speaker) now having several commercial competitors. The most important announcement in this connection was, I personally believe, the one Yuri did not make, presumably to avoid apparent conflict of interest. Namely, that SoftQuad has agreed to supply an SGML browser called Panorama, to be bundled in with copies of NCSA Mosaic, which can handle arbitrary SGML DTDs. (N.B. by ARBITRARY DTDs I do not mean DTDs which themselves are capricious and irrational, but DTDs chosen by the user's free will, according to the user's own lights.)

In other news of interest, Microsoft has now announced its SGML product which ties in, naturally, to MS Word. Avalanche also displayed brochures for several packages intended to assist users in exploiting the Microsoft materials. And the conference was filled with the whispers of organizations seeking to hire new people.

Joseph Hardin of NCSA gave the keynote address, which described NCSA and explained why a supercomputing center has paid so much attention to issues which, on first glance, are not related to supercomputing, such as visualization, data handling and data formats, software for longdistance collaboration, and information systems such as the World Wide Web. All of these are supported at NCSA because NCSA sees its mission as supporting what Hardin called the "computational science revolution", and these seem to be useful in that context. Hardin stressed the importance of standards: URLs and URNs, HTML, HTTP, ANSI Z39.50 (a networkbased protocol for information retrieval), and the Common Client Interface (CCI, recently announced by NCSA: this protocol allows external viewers like Panorama to ask the WWW client software which launched them to perform WWW services, such as fetching items from the net, on their behalf).

Charles Goldfarb, newly independent after years at IBM, reviewed the current status of Project YAO, a cooperative project of partners in China, California, and Norway to write an SGML parser to make publicly available in source form. The parser will include an application programming interface (API) for access to low-level parser events (such as recognition of a start-tag, recognition of an end-tag, etc.), in a pattern familiar from other parser interfaces. Procedures for saving and restoring the parse state will ultimately be provided (i.e. they are not yet available), which can be used to implement incremental parsing. Multiple concurrent parsing contexts are also supported, Goldfarb said, though in answer to a question he explained that he did not mean the parser supported, or would support, CONCUR. The multiple contexts will allow parsing with different SGML declarations, different DTDs and link process definitions, and can thus be used to check the conformance of a document instance to the architectural forms. A variable-persistence cache will be provided, to allow rapid access to parsed fragments; the cache will use a proprietary format.

The low-level interface will be complemented by a high-level interface to the "information objects" of the document in terms of its entity and element structure. This interface will also support references to objects by means of HyTime location addressing.

The Portable Object-oriented Entity Manager (POEM) is a separate software project and may (if I understand things right) be incorporated in other parsers, not just in the YAO parser. POEM will provide a complete buffer separating the entity structure of SGML from the file structure of the operating system, allowing multiple entities to be stored in a single file, and vice versa. Design documents and some but not all of the code are available for review from ftp://ftp.ifi.uio.no/pub/SGML/YAO.

The evening sessions at GCA SGML conferences have in recent years been dovtd to some thorny technical issue or other, leading to memorable arguments over query languages, SGML transformation tools, and so on. This year, the sessions were on tables and, ... and, ... well, there was a second one, but I had to look it up in my program to remember that the topic was visual display of structural information. I confess that instead of attending either of these, I went to dinner, with a group of people who turned out all to be interested in style sheets, especially for network distribution of SGML documents. We promptly turned ourselves into an informal cabal and plotted a strategy for addressing the style sheet issue; there turned out to be a strong consensus among those present that a standard style sheet for net-based browsers was both feasible and desirable, that it can and should be formulated as a subset of DSSSL, and that SGML Open should consider organizing, or at least sponsoring, the technical work, and adopting the result. Later reports said that the tables session was very interesting, but achieved little consensus. The session on graphic display of structure seems to have focused, not surprisingly, on methods of displaying trees onscreen.

Later in the week, the cabal produced results, in the form of a very preliminary proposal for a subset of DSSSL for use in network browsers; SGML Open discussed this at some length at the end of the week, and work continues under the leadership of Steve Pepper, who should be contacted (at [email protected]) for further information.

On the second day of the conference, SGML Open had arranged an allday series of talks reviewing all the various components of an SGML system. The morning started with an able survey of DTD development and other utilities by Debbie Lapeyre, and continued with equally useful surveys of parsers and SGML transformers (Pam Gennusa), SGML editors (Paul Grosso), tools for electronic delivery (Tim Bray), and programs for layout and composition (Mark Walters). Into this series the organizers had also slipped a talk by myself, on SGML database and document management systems. I won't speak for my own talk, but the others in the series were extremely informative, and had far too much content to be paraphrased successfully here. There is time and space only to report Tim Bray's useful distinction between browsers for "live" ( changing) data and browsers for "dead" (static) data, the latter being notable for their "Kill, Cook, Freeze" processing model. He also distinguished between structure-oriented viewers (mostly SGML-aware) and page-oriented viewers (mostly not SGML-aware). Structure viewing is better, but page viewing is cheaper. Structure viewing is the technology of the future; page viewing is that of the past.

N.B. Tim's talk was, as always, very good, but I should note that if I quote him at some length and the others not at all, it is not because the other talks were less informative, but because he had better sound bites. If plans are successful, all the presentations should appear in written form in a book, which will, it is hoped, be useful for potential adopters of SGML.

The evening session on the second day was devoted to SGML and the Internet, which turned out (unsurprisingly) to mean SGML and the World Wide Web. I sat near a number of participants in the Pinnacles group, who turned out to share my strong feelings that the long-term future of the World Wide Web depends upon its abandoning the current idea of supporting only a single SGML DTD (namely HTML) and providing support for any DTD the information distributor chooses. The presentations were aimed at non-technical listeners, and in particular Eric Severson of Avalanche must be singled out as giving the most persuasive nontechnical discussion of network publishing I have ever heard, explaining in business terms why WWW, and in particular HTML, is best treated as a publishing medium and emphatically not as a format suitable for document creation or maintenance. As the evening wore on, however, with the speakers failing ever more emphatically to mention the shortcomings of HTML, the inevitable inadequacy of any single-DTD strategy for the net, or any of the technical issues which must be addressed, the Pinnacles group and I turned into a slightly noisy heckling section. Fortunately, before too long the meeting was opened up to comment, and we were able to applaud vigorously those who, like John Bosak, Terry Allen, and Murray Maloney, expounded viewpoints we found more rational; this gave us a much needed outlet; without it, I am afraid we might have begun misbehaving really badly. As it was, I fear the fellow who spoke about HTML may still bear some psychological scars; he was clearly prepared for a non-technical audience, not one of whom 80% reported experience authoring documents in HTML using ASCII editors. He did the best one could expect under the circumstances.

The next morning was brightened by Dave Sklar's talk about transducing documents into SGML from other formats (ALCHEMY FOR THE MASSES). With his usual mixture of sound technical information and manic humor, Sklar observed that conversion of legacy data has moved beyond the initial state of technology, in which the user's choice is either to hire a wizard, or become a wizard -- "enable thyself, or enable thy wallet", to a more mature state in which the user has a third choice, namely to use standard Do-It-Yourself-style toolkits, without having to become a wizard first. These Do It Yourself tools generally work from the visual appearance of the document, allow more or less simple mappings from appearance to SGML tags, and provide a convenient user interface for specifying the mappings. They work in a surprising number of cases, but surprise! -- not in all. The afternoon and evening of the third day was devoted to a vendor exhibit session, on which I need not comment here, beyond observing that there were a lot of vendors, showing a lot of very good software. Anyone who argues, nowadays, that there is any shortage of good SGML software, is hallucinating or out of touch. Editors, browsers, document management systems, typesetting engines, software libraries, search and retrieval engines; monolithic systems and open-system frameworks; highend and low-end software for high- and low-end hardware; I saw all of these, and I did not actually visit more than a third of the booths. There is always room for more software, of course, and there are niches still to be found and explored. But existing off the shelf SGML systems can now do more than ever before.

On the final morning, John McFadden of Exoterica gave a talk on the SGML SHORTREF feature, in which he attempted valiantly to be provocative. Unfortunately, for me at least he succeeded only in being provoking and condescending. He argued that SHORTREF was widely misunderstood, because most SGML users focus on relatively simple applications where SHORTREF does not show to advantage. The true strength of SHORTREF shows, he said, in applications with much higher information density. Along the way, in the interests of provocation, he took several potshots at those who would like to simplify the formal syntax of SGML by removing what they believe to be unnecessary complications, bells, and whistles. (Since I have often argued that the formal syntax of SGML is unnecessarily complex, my negative reaction to his remarks may reflect irrational pique over these potshots. The reader must form an independent opinion.)

Unfortunately, McFadden never did make clear what he believes the true strength of SHORTREF really is, or why he claims that without SHORTREF, applications with high markup density are "impossible". I asked him what problem he encountered in high-density markup which made SHORTREF essential, but his only answer was to suggest that if I had to ask, I probably had never seen data with a really high density of markup and information. I respectfully suggested that I had seen such data, and repeated my question, but he answered only by inviting me to Ottawa, where he would show me, he said, texts with really dense markup, as much as one tag per word of content. I did not take the time to observe that in the TEI-encoded British National Corpus, every word is tagged with its part of speech; that in sample encodings of the Dead Sea Scrolls it is not uncommon for every character to require separate elements recording the certainty of its reading and the percentage of the character preserved on the papyrus; and that in the first draft of the TEI Guidelines, the two-word sentence


      Wash sinks.

is encoded with a lexical and syntactic analysis which runs to six pages. (I should note that the sentence is four-way ambiguous -- each single analysis therefore runs only to a page and a half. Possibly McFadden regards this as not "really" dense.) It may be that a judicious use of SHORTREF could make work with these examples simpler -- but on the whole, working with an SGML parser and an SGML editor with decent style sheets already makes work with these examples simple enough for me. So I continue to be mystified by McFadden's claim that work with densely marked up text is "impossible" without SHORTREF. It's not impossible at all: we've been doing it for years. I could not help but wonder whether his view simply reflected a failure to exploit the capabilities of SGML editors, in which case it is not SGML, but an insistence on using inadequate editors, which is causing the problem. Unfortunately, by failing to provide concrete examples and by limiting himself to vague and uninformative comments rather than specific analysis, McFadden made his talk vacuous and lost an opportunity to encourage serious technical discussion of a topic he appears to care about a lot.

Fortunately, two later talks the same morning provided shining examples of how to encourage technical discussion. Jean Paoli, of Grif, spoke about SGML Objects and the issues involved in defining behaviors for them. And Makoto Murata, of Fuji Xerox, gave what I thought was the most substantial technical paper of the conference. Under the unprepossessing title FILE FORMAT FOR DOCUMENTS CONTAINING BOTH LOGICAL STRUCTURES AND LAYOUT STRUCTURES, Murata described the formal problems confronting any attempt to record both the logical (or: a logical) and the (or: a) physical structure of the same text. Since these problems have been a constant looming presence in the TEI, especially in the work groups for textual criticism, manuscript transcription, and dictionaries, and since the TEI was never able to devise a fully satisfactory general solution to them, I was particularly interested in his summary. (In this summary, I will like Murata speak of the logical and the physical view; the problems, however, also occur when more than one logical view, or more than one physical view, are to be encoded.) In brief, the problems include:

duplication of data (e.g. in a running head, which appears once in the logical view and several times in the physical view)
removal of data (e.g. annotations in the logical view which are not present in the physical view)
addition of data (e.g. page numbers in the physical view, not present in the logical view)
need for explicit expression of the alignment between elements in the two views
reordering of data (e.g. migration of footnote or endnote text away from its point of attachment to the bottom of the page or to the end of the chapter or volume); this Murata calls DISTORTION.

Fine grained alignments of parallel documents in one or more languages also exhibit this problem.

The optional SGML feature CONCUR was intended to enable the simultaneous encoding of multiple views of the document (in particular, of both a logical and a layout view), but CONCUR has only awkward methods for handling duplication, suppression, and addition of data, and no methods at all, that I know of, for handling duplication and distortion. The standard is silent on whether parsers which support CONCUR must support simultaneous parsing with more than one DTD, so such parsers may or may not support explicit linkage between nodes in different document trees.

Borrowing concepts from other work on document processing and document formatting, Murata defined an algorithmic process for augmenting the logical and physical trees of the document with specialized node types, which enable him to handle duplication, addition, suppression, and distortion without having to store any portion of the text more than once. The augmented trees have explicit links expressing the correspondences of their nodes, and each tree can be reconstructed in a straightforward way, undoing, as necessary, the effects of addition, duplication, etc. In many respects, the specialized node types introduced by Murata resemble the PTR, LINK, and JOIN elements of the TEI encoding scheme; I need to study his work further before knowing how far these TEI element types can be used to exploit his insights in a TEI context.

Murata's work has, I think, critical implications for anyone concerned with document formatting, with systematic encoding of text layout or physical presentation, with multiple versions of a text (text displacement, Murata's DISTORTION, is one of the hardest problems of text criticism, not only in electronic form, but also in paper forms), or with synchronization of multilingual corpora. I warmly recommend its close study.

The concluding talk of the conference was Jean Pierre Gaspart's wonderful injunction to keep PUSHING THE SGML PARADIGM. Taking as his starting point the aphorism "When I was seven, I received a 'hammer' and suddenly everything looked like a 'nail'." Gaspart observed that in some approaches, SGML becomes nothing more than a nail for someone else's hammer: an export mechanism for relational database management systems -- which are not intrinsically well suited to the storage and management of hierarchical data like SGML -- or an alternative notation for LISP S-expressions -- which Gaspart objected to on the grounds that S-expressions are trees, while SGML documents are not trees but graphs, in which containment is just one of many possible relations between nodes.

Instead of making SGML a nail for some other technological hammer, Gaspart proposed that full exploitation of the SGML paradigm will involve making SGML the hammer: that is, making SGML the central organizing principle of software systems. He gave several examples of this type of organization. In one, an object-oriented system for tracking court cases and scheduling sessions in a court of appeal is implemented by distributed programs which act upon SGML-encoded representations of the case and its status, and interact with each other by sending SGML encoded messages. The central task dispatcher is an SGML parser which treats incoming messages as a stack of entities to be parsed, and which fires appropriate subroutines as a side effect of parsing.Otherapplications include SGML-based systems for object-oriented data management, bill processing at the Belgian telecommunications authority, and a process-flow control system for the Belgian parliament. He concluded by suggesting that the occasion of the SGML formal review should be used to improve the language, and made a few concrete suggestions (including the preservation of CONCUR and SHORTREF). His concluding thought gave a memorable close to the conference, with which I will also close this trip report:

    Clarity, Precision and Ease of use does not mean
    Confinement, Verbosity and Futility.