SGML workshop at WWW'94 - a summary

The workshop on the use of SGML within WWW was held at the WWW '94 conference, on Wednesday, 25th May. There were 20 participants, including the convenor, Bert Bos. The call for participation is still on-line.

(For a report on the whole of the conference, seen through the eyes of Dan Connolly, see his trip report.)

OVERVIEW

The workshop concerned itself mainly with ways to publish SGML-coded documents through the Web. The focus was on textual material, such as books, magazines, and memos, but some of the proposed solutions are also applicable to other SGML-coded material, such as bibliographic databases, CAD drawings, etc.

Where a document originally came from and how or why they were marked up with SGML was not addressed. However, in some cases the origin of the documents has a bearing on the desired way of presenting them on-line. For example, people studying medieaval manuscripts want not only a transcript, but also very precise information about the position of words on the page, decorations between text lines, etc.

`DOWN-CONVERSION'

One approach that is already widely used is `down-conversion'. The document to be published is held in a very richly marked-up form, rich enough to support conversion to paper-based printing as well as to HTML.

Elsevier Publishers use this scheme. Their in-house DTD is a slight adaptation of the ISO Book DTD. Simple converters that they implemented themselves can produce various derived formats.

This seemed to be the preferred way of managing `normal' documents. The representatives of publishers that were present at the meeting all seemed to agree that this was the way to handle the bulk of the books and journals. Which DTD is the best was not clear, but maybe that is not a real issue, since publishers never give away there richest sources, anyway.

For the documents that are typically published in this way, it is not a problem that the output is poorer in mark-up than the original, although it was generally agreed that HTML is a little too limited; HTML+ would do fine, especially when HTML style sheets are available.

FAITHFUL REPRODUCTION

Some documents, most notably old manuscripts and commercial PR material, need to be reproduced on screen as close to the original as possible.

The best way to achieve this, short of sending PostScript/PDF, is to use style sheets. Unfortunately, not much can be said about the way in which they will be defined, except that the apparent need for them is so great, that it surely can't take long before they will be available.

Style sheets need to be standardized, so that authors of documents can specify the style they desire and link it (via a URL) to the documents. The style mechanism needs to be powerful enough to support variations in hardware and font substitutions (though fonts might be linked-in via URLs as well).

Using in-lined images for part of the document can solve some of the problems, especially the FIG tag of HTML+ can do a lot of good here, since it allows an image, its transcript and its hot-spots to be specified together.

The most extreme form of this would be to give the whole of the document as a single FIG. But there may be other formats that support this functionality more easily. There was a suggestion that Microcosm (Univ. of Southampton) could have the format for this.

PUBLISHING THE SOURCE ITSELF

Sometimes it is desirable to have the richest possible mark-up available to the client, for example when the client wants to process the document further with SGML tools. An example would be the publishing of documents in the TEI mark-up, that people might want to view with their browsers as well as save to file for local analysis.

There are different ways of doing this, the most general involve bypassing the normal browser and using an external viewer, possibly with another form of style sheet.

Overlaying HTML as an architectural form would allow the normal HTML browser and the HTML style sheets to be used again, provided it were made capable enough to parse general SGML documents, including DTDs. General solutions in this direction seem to be some way off still. Maybe HTML-4.0, the successor to HTML+, will be a `META-DTD', but that is just speculation, of course.


Jun 2, 1994, Bert Bos <bert@let.rug.nl>