[Mirrored from: http://www.nic.surfnet.nl/surfnet/projects/premium/premium.eng/comment.html]

Premium


[ taalwissel/ language switch]

Comment about the descriptions of SGML-software

The descriptions of the SGML-products in this report are full of abbreviations and references. This keeps them short and surveyable, but for outsiders (almost everyone, that is) it is incomprehensible, at first sight. This section is intended to clear up that cryptic formulations and explain why exactly these terms have been chosen to judge the software with.

Though SGML has been an international standard (ISO 8879) since 1986, only the past few years there is an obvious inte rest in it. Apparently, SGML is an idea that the time is ripe for. The stormy development of the World Wide Web (WWW) since 1993 has considerably contributed to that. It is to be expec ted that also for the next years, 'the Web' will be the most important drive behind SGML.

What are those features of SGML (and of WWW) that are now suddenly considered to be so important? First, it is the portability: because SGML is an international standard and because SGML-documents are normal textfiles, a SGML-file can be used on any computer, and still can 20 or 30 years hence. This portability has only become really important when every body got a network.

Incidentally, SGML is not the perfect solution for the portability-problem. An important aspect, for which a solution has been devised only recently, is the reproducing and transporting of non-western scripts. None of the examined programs can deal with them yet, but from 1996, the new standard ISO 10646 (better known as Unicode) will probably take a high flight. Incidentally, it is possible - also at this moment - to insert foreign tokens, by means of entities, but that is cumbersome: é becomes é, Greek text becomes &agr;&bgr;&ggr; (alpha, beta, gamma). Some products can join special fonts to such entities.

A second aspect of SGML is the separation of contents, struc ture and presentation. In principle, SGML serves for the former two only. Before the age of the computer, the presenta tion could not be separated. You could, of course, print a book in two different ways, but that was not usual. There may have been an abstract kind of information in the writer's or reader's mind which could do without a presentation, but that was not transferable.

The computer is able to deal with shapeless data; in fact must be able to, because it fails to deal with pictures. It can print, but cannot read what it has printed. A database is the clearest example: the data can be presented in many ways, but internally they do not have shape. Therefore, SGML is sometimes compared to a database for text. There is, however, a problem: the makers of SGML have cancelled the designing of the presentation-component at the the time, until 1995 as we know now. This means that the contents and structure of documents can be stored now, but if you also want to fix how it looks like for people, you are thrown on non-standardized, proprietary methods.

This is the problem that is labeled by the term style-sheets. A style-sheet is an abstract description of a layout. If a style-sheet is combined with a document ('applied to'), a formatted text results, a text that is readable by people, that is.

As has been said, there has come an ISO-standard for style- sheets in 1995, under the name DSSSL (Document Style Semantics and Specification Language, ISO 10179). For the coming years, DSSSL will probably be used by several software-manufacturers, but there is a problem: whereas SGML can be perfectly combined with the Web, DSSSL hardly can. Several people are therefore looking for a solution for real-time, on-screen style-sheets, the WWW Consortium (W3C) under that. At the end of 1995, it should be clear what the most important style-languages on the WWW will be.

A third aspect of SGML is its flexibility. SGML is really no format, but a recipe to make formats. And though we speak of an 'SGML-file' for easeness' sake, it is - strictly speaking - a file in a format that has been defined by means of SGML. HTML (HyperText Markup Language) is such a format. It is the most widely used format on the Web because it is simple and, moreover, it supports hypertext (see below).

Of course, we do not want to create a new structure for every document. There are some widely used formats (officially: Document Type Definitions, or DTDs). Besides HTML, these are TEI/TEI-Lite, CALS, Docbook and ISO 12083, for example.

CALS is of the American Department of Defence (DoD). It is complex and extensive, and unless you work for the DoD it is not recommendable. One part of CALS currently (Summer 1995) causes some debate, and that is the format for tables. This is because the W3C wants to add a format for tables to HTML shortly, and people who already have large files in CALS- format would like to maintain the CALS-tables. Hopefully this will not be seen through, for it is a terrible format that dates from the age that its makers didn't know what style- sheets are nor how to publish electronically.

A bigger contrast between CALS and TEI is hardly imaginable. TEI is extensive, too, but anything but complex. TEI (Text Encoding Initiative) was the name of a group researchers of linguistics and literature. In 1994, TEI removed itself, after finishing the TEI DTD, defined as TEI-P3, to distinguish from earlier versions P1 and P2. The TEI DTD consists of several modules, which together can code sort of any kind of text that an alpha-scientist will take in hand, from simple memos to textcritical editions, corpora for grammar-research and even hypertext.

TEI rapidly gains popularity and is - also outside the scientific world - considered to be the perfect example of a good DTD. Being an SGML-format, TEI is automatically compatible with most SGML-software, but several manufacturers have also built in special facilities. For instance, the hypertext-possibilities of TEI can therefore also be used in other DTDs.

Docbook has been made especially for computer-manuals; is very suitable for these, but is not of much importance for those who deal with texts of other kinds. ISO 12083 has been developed by American distributors, but is also used outside the U.S.A. It is a rather neutral format, suitable for books, periodicals and articles. The format is little detailed, because it is meant for printed matter after all, not for analysis or online consultation.

Hypertext has been made mention of before. Also Hypertext is an idea that the time is ripe for. Notably, it is an idea that already existed before the art of printing, but which has fallen into oblivion by the technical limitations of the book. Encyclopedies are similar to them, with their many references. The computer not only has brought back hypertext, but has also improved it considerably: not only can references to everything and anything be easily made and maintained, the computer also makes superfluous the annoying turning over of the pages.

The makers of SGML have made a standard for hypertext as well: HyTime (Hypermedia/Timebased language, ISO 10744, 1992). But neither this one is completely free of problems, though some parts of it are useful. There are two important competitors: one of them is TEI (mentioned already), especially its extended pointer (xptr), the other one is the method of WWW. The latter is called Uniform Resource Locator (URL) and is by far the most popular one, especially because it has such a wide scope. Almost anything that is on the Internet can be indicated via URL. On the other hand, URLs are not so accurate, for you can indicate a document, but mostly not to a part of it. TEI's xptr can do so. Unfortunately, URLs and xptrs are not compatible with HyTime, not yet at least, for a modification of HyTime has been promised for 1996. The adding of a new element to HyTime must lead to HyTime-application being able to recognize URLs and xptrs as such.

Quite a lot of files are needed for an SGML-system, so a good file-management-system is essential. Besides the documents themselves, there are different DTDs, communal files with entities (special symbols, mostly), stylesheets and often even more.

Because of the portability, many of the files have a (formal) public identifier (FPI), a name which is not dependent on the computer on which the file happens to be. This does mean, however, that the software has to translate the FPI into a concrete file-name. Fortunately, more and more programs are using the same method for that: a catalog file according the conventions that are made by SGML-Open, an organization of software-manufacturers. Most users will eventually own several SGML-programs, after all, and it would be nice if they could use each other's files.

The usage of FPIs makes the spreading of SGML-files easier. If a file is indicated by an FPI, the application can get that file from the most suitable location. Sometimes that will be a local disc, other times it is the network. After all, every copy is the same, independent of the location where it is stored.

On the PC (DOS and Windows), that situation has not been attained yet. That platform has to contend with a tradition of proprietary formats, suggested by limited memories and slow computers. There are still many small PCs in use and the tradition disappears only slowly.

A part of the software are transitional products, intended to convert documents in old formats (see below) to SGML. Many of the old formats were never intended for publishing electronically, and it takes a lot of trouble to distil the structural information from the visual information (bold type, italics, type size etc). As SGML is accepted by more manufacturers and users, the need for this conversion-software will decrease.

But that need will not disappear completely, and the term 'transitional product' therefore is too narrow. For instance, books that are scanned will always have to be converted on the basis of visual information, for there is no other information.

The selected products

There has been made a selection from the supply of SGML-products, in virtue of the name of the product and from former experience. Some of these could not be supplied in time. This is the list, arranged according to category:

Two kinds of users (which often overlap in the scientific world for that matter) have been assumed: readers and writers. In both cases they are final users. We will not go into the suitability of the software for distributors.

Initially, also software for document management (for instance, for libraries or computercentres) would be looked into, on the basis of a client server solution, preferably WWW. Two promising products, DynaWeb and OpenText (formerly: PAT) could unfortunately not be provided on time, at least not for a temporary license, but for a reasonable price. Efforts to test both systems as yet are in progress.

Viewers for hypertext

MDI (Multiple Document Interface) is the name that was devised by Microsoft for a characteristic aspect of many programs under MS-Windows, namely that windows can appear within other windows. In other window-systems, the phenomenon does not exist, which is just as well.

Suppose a program wants to show two (or more) windows with information. Then there are three options: (1) one window is shown, the user having the option to exchange the contents of the window; (2) two windows are opened; (3) one window is divided into two compartments. The choice depends on several valuations: is the information often needed? simultaneous? is the user experienced (an experienced user will appreciate the simultaneous presentation without getting confused)? is there a strong relation between the windows (in that case a divided window is preferred)?

Microsoft adds yet a fourth option to these: open three windows, one around the other two. This has only one (slight) advantage, namely that only one menu bar is needed. There is a host of drawbacks. The user is confronted with an extra window. The two inner windows look like common windows but cannot be moved arbitrarily, for they have to remain within the first window. For this reason, you want to make the outer one as large as possible, but all empty space in that window obstructs the view on other programs.

The differences can well be seen by the three programs DynaText, Explorer and Panorama. All three are for viewing hypertext. DynaText takes a middle position. Indeed, it uses MDI when showing two independent documents, but it uses a window with two compartments for strongly related windows. The number of types of windows, apart from popups, therefore remains restricted: there are 'collections' and 'books', and often one of each suffices.

Explorer has the worst interface of the three. It exclusively makes use of MDI and almost every action of the user results in another new window. Most of them have a strong relation with at least one of the others, but there is no visual link. Moreover, there are many different types of data which are difficult to distinguish from each other, like documents, tables of contents, lists of figures and footnotes.

Panorama is from the same manufacturer as Explorer is, but is much more recent. It has a much simpler interface, without the options being fewer (On the contrary, Panorama can do more). Panorama does not use MDI at all. There is always only one window, which can be divided into two compartments.

Every product has to revel, of course, and from several things - not only the interface - appears that the makers of Panorama have learnt a lot from experiences with Explorer and DynaText, as well as from other programs. Nevertheless, it would be well if software-designers would not gratuitously use MDI, only because it exists and has a formal sounding name.

Old documents

Old documents ('legacy docs') are all documents in formats other than SGML formats which are considered valuable enough to use also withing an SGML system. For instance, one wants to include old articles in WP or LaTeX in HTML-format.

There are several ways to deal with that. The first solution is to leave the documents as they are and hope that everyone has the software to view them. With most formats of textprocessors, that is a bad solution, for - except for the textprocessors themselves - there are no other programs which can read the files, because it was never intended that the files were circulated in electronic form.

The second solution is to convert them by hand. The layouts are replaced by SGML-elements by means of the initial textprocessor, after which the documents are stored in textformat. Naturally, one feels reluctant to do this. The original authors are long engaged on other articles and the administrator of the documents does not have the resources. This solution is possible only in exceptional cases.

The third solution is to use a conversionprogram which can convert large numbers of files, automatically and in batch. Unfortunately, most textprocessor-formats contain too little information to get this done completely correctly. The documents only contain layout, it is nowhere defined what the function of that layout is: is the text in italics a head-line or an important word? Is the Tab only to indent the first line of the paragraph, or does it function to make a table? Depending on the original format and on the initial author's way of working, the conversion- program has to be more or less adjusted.

There are conversionprograms which work half automatically, in an interactive context where a user can (and must) interfere if the program cannot go on. WordPerfect's IntelliTag is an example of this. Others work with a configuration-file, in which the specific characteristics of the text-to-be-converted can be defined (a kind of reversed style-file). There are some public domain programs for conversion to HTML which work like this. Others consist of a complete programming language, in which all the characteristics of the text-to-be-converted have to be defined in programcode. OmniMark probably is the best-known one of these.

The fourth option is to keep the original documents, but have the database execute the conversion in real-time. This means that a conversion takes place every time a document is retrieved.That conversion is done again with the help of a (reversed) stylesheet.


© SURFnet bv / SURFnet Premium / redactie@surfnet.nl / last modified 8 march1996