SGML Europe '96

[Mirrored from: http://www2.echo.lu/oii/en/sgml-96.html], by Martin Bryan

SGML Europe '96

The SGML Europe '96 conference held in Munich from 12th to 17th May 1996 had as its major focus the role of HyTime as a mechanism for managing the interconnection of components within compound document sets coded using both structured formats, such as those provided by SGML, and unstructured formats of the type provided by most word processors.

Highlights from the following presentations are covered in this report:

SGML document databases: requirements and capabilities, John Chesholm, CSW Informatics Ltd
Hybrid Distributed Databases (HDDB) and the Future of SGML, John McFadden, Software Exoterica
Harmonization of SGML and STEP, Hugh Tucker, Documenta ApS
SGML and STEP: Case studies from Sweden, Peter Bergstöm, EuroSTEP
A tables manifesto, Marcy Thompson, Passage Systems
SGML and the semantic representation of mathematics, Roy Pike, Clerk Maxwell Professor of Theoretical Physics, King's College, UK
The Corrigendum to HyTime, Charles Goldfarb
Why chose SGML for Automated Multimedia Publications?, Jean-Marc Bertinchamps,
SGML production workshops, Christian Rozet, ACSE International
They say the WWW is broken, Kate Atherley, Incontext Inc.
The Internet: Where Does SGML Fit, Kent J. Summers, Electronic Book Technologies
Web Publishing: SGML's role in practice, Eric van Herwijnen, NICE Technologies
Managing multilingual document sets, Martin Bryan, The SGML Centre
Does SGML have a role to play in Internet Publishing?, Laura Walker, Xsoft.

In addition this report summarises the seminar on CApH Topic Navigation Maps held in conjunction with SGML Europe '96.

As part of his talk on SGML document databases: requirements and capabilities, John Chesholm of CSW Informatics Ltd discussed the advantages of adopting a 'complete publishing package' approach, which could be set up with very little effort but which would provide only a limited degree of flexibility, with an approach based on intergrating components with rich APIs, which would take more effort to set up initially but provide more flexibility in the longer term. He stressed the need for integrated workflow control mechanisms as part of any complete solution.

While most SGML databases will support any document type definition (DTD) there are often wide differences in the ease with which new DTDs, or changes to existing ones, can be accomodated. SGML databases should make it easy to distinguish between data management parameters, workflow parameters and SGML attributes.

Because databases assign 'global identifiers' to the objects they contain, it is normally easier to link elements stored in a database than to link the same elements when stored in separate files. Global identifiers can be used to validate cross references, and to ensure that there are no 'dangling links' that point to anchors that are not part of the current document set.

John Chesholm postulated that it would be possible to maintain translations of documents as parts of a single document within a multilingual document database, the relevant language components being selected in response to user requests. Document databases should also be able to store variants of data without their having to be encoded as SGML marked sections.

John McFadden of Software Exoterica took as his subject Hybrid Distributed Databases (HDDB) and the Future of SGML. SGML developed in response to needs of publishers of traditional forms of information dissemination, based on the presentation of data on paper. HTML has attempted to transfer the same paradigm to the world of electronic data interchange. The interactivity that is possible through the web has changed many peoples view of the role of information as a tool. In particular, the web has popularised the concept of full text searching.

Interactivity within documents stems from their content rather than from the way that content is presented. A good SGML DTD should concentrate on the relationships between content elements rather than on how the data they contain should be presented.

Topics control the contents of an information set. The facts contained in individual data elements are generally associated with a single topic. Navigation between elements is normally done through topics, either via contents lists or indexes of key terms. It is topic relationships that need to be managed dynamically, not the relationship between data and topics.

Corporate data modelling provides a method for formally identifying the relationships between the various types of information generated within a company or industry group. At present modelling normally stops at the document level. A good corporate data model should go down to topic level. Relational modelling of data should be extended to this level. Below this level the data should be encoded as a hierarchically structured SGML fragment, which can be treated as a binary large object (BLOB) within a database.

Hybrid distributed databases can use relational databases to manage the interrelationships between objects at the corporate data model level and an SGML-based object model for presenting data at lower levels. They treat documents as a set of SGML microdocuments, each of which provides a window/frame of information that can be controlled through its local document type definition.

Hugh Tucker of Documenta ApS discussed the role of ISO TC184/SC4/WG3/T14, which is responsible for preparing standards relating to Product Documentation, in a presentation on the Harmonization of SGML and STEP. STEP is used for the electronic interchange of product data. Typically this data is produced by CAD/CAM systems and related design tools. The interchange protocols for product data are expressed using a modelling language called EXPRESS, which provides similar functionality to those provided by SGML DTD for managing documentation.

Product documentation is a component part of interchangeable product data. Until recently the only data type provided within STEP for interchange of product documentation was for a string coded using the ISO 8859 character set. EXPRESS is to be extended to allow structured data to be incorporated into a STEP interchange document as an SGML_string.

STEP provides facilities for the life-cycle control of data versions that can be used to manage SGML-encoded data sets. It has a Standard Data Access Interface (SDAI) that can be used to link EXPRESS-based applications to databases. The goal of the T14 group is to provide an information architecture to integrate product data and product documentation. This architecture should make it possible to both incorporate the latest product data within product documentation and to be able to query the contents of product documentation within EXPRESS. HyTime's document querying functions provide a mechnaism for describing queries into hierarchically structured SGML data through the SDAI.

STEP can be used to model the relationship between the set of SGML microdocuments that are to be presented to users as a 'document'. Peter Bergstöm of EuroSTEP provided three case studies of where the STEP concepts for managing SGML data are being prototyped within Scandanavia. The Swedish SWEDCALS initiative have used STEP to develop a server that generates HTML 'home pages' that describe the relationships between the product documentation components of a STEP database.

The ASTRA corporation consists of a set of independently managed pharmaceutical companies. To allow these companies to share their data resources a corporate Core Information Model (CIM) has been defined. Individual companies can define Domain Information Models (DIM) that can be used to include data objects stored within the CIM database as components of information accessible through local databases.

Because pharmaceutical companies have to retain data for a long time it is vital to develop a 'corporate memory' that can outlast those generating the data. STEP provides a method for modelling this corporate memory. SGML provides an application independent representation for data that will allow information to be reusable over long time periods. By combining these two ISO standards the ASTRA OPUS project will be able to future-proof its data.

The third project on the use of STEP and SGML was the Hagglund COTT project for managing the documentation of their new combat vehicle. STEP is used to manage the relationships between the physical, functional and representaion dimensions of information management. Data is created based on the physical characteristics of the vehicle. Fault-finding, however, depends on the functionality of the vehicle. STEP is used to map faults to data about a particular component stored in SGML format.

Marcy Thompson of Passage Systems Inc made a plea for the use of content-oriented tables. Existing table models, such as those provided for CALS and HTML, are based on a rectangular table format. This does not allow for the production of irregularly shaped tables, such as those required to represent periodic tables, triangular travel distance charts, circular astrological tables or the types of tables typically used in Asian languages.

Tables are really a physical representation of a set of logical relationships. In certain cases it is better to capture the data according to these logical relationships rather than simply capturing the physical relationships required to produce one particular representation of the table. Once the logical relationship has been properly captured the same data can be presented in a number of formats without recoding. This means that, for example, you can present the data in a wider than deeper form for on-screen presentation and in a deeper than wider format for on-paper presentation.

Searching physically orientated tables is often impossible. You need to know the relationship between entries in the table before you can perform a logical search on their contents. By adopting a logically-based table structure you can provide both enhanced searching and flexible presentation.

Roy Pike, Clerk Maxwell Professor of Theoretical Physics, King's College UK, discussed SGML and the semantic representation of mathematics. Most existing maths notations are based on the way in which the formulae are presented to readers, rather than the sequence in which operations must be performed to process the equation. Mathematicians in many countries are currently discussing how best to describe formulae in a way that will allow them to be both computed by computers and correctly presented on screen.

Roy Pike has developed such a semantic representation for mathematics using SGML as his coding scheme. This representation, which is intended for inclusion as part of a planned extension to ISO 12083, consists of just two basic element types, one of which defines a mathematical value while the other defines functions. Sets of functions that are specific to clearly identifiable areas of research will be developed, as part of the ISO 12083 work, for areas such as Algebra (with and without arrays), Trigonometry, Calculus, Statistics, Functional Analysis, Quantum Mechanics and High Energy Physics. Software to support this basic framework has already been developed.

Charles Goldfarb, the editor for the SGML and HyTime standards, explained the role of The Corrigendum to HyTime to be published shortly. As many of the concepts of HyTime were found to be applicable to a wide range of SGML applications these have been extracted from the main body of the text and placed into a General Facilities annex that can be easily referenced from other standards.

As part of this exercise the SGML properties that are shared by standards such as SGML, DSSSL and HyTime have been separated out of Annex A of the HyTime standard, extended to cover optional features of SGML and placed into a separate annex that explains the role of the extended SGML grove descriptions that were developed to describe the interrealtionships found in 'hypergroves' of structured documents.

As part of the work to integrate SGML with document exchange formats such as MIME the HyTime development team have also introduced a generalized technique for identifying files stored in any file store or data repository through a Formal System Identifier. As well as covering well known individual file identifiers, such as URLs and Unix and DOS path and file names, the new methodology will also allow users to identify files stored within compressed multidocument repositories, such as those found in TAR and ZIP file sets, as part of an SGML system identifier. Public domain software to support this functionality is already available as part of the well-known SP SGML document parsing application

Jean-Marc Bertinchamps of Electronic Data Processing SA, in a talk entitled Why chose SGML for Automated Multimedia Publications?, showed how EuroStat was using Microsoft's SGML Author to allow the European Commission to generate tables of statistical data for publication in formattable electronic form on the web using HTML and in printable format using PDF and other page composition formats. Filters have been developed to convert Word and other word processing formats to SGML. SGML document type definitions have been developed to allow data stored in EC statistical databases to be integrated with text in documents produced on word processors. Once the data has been converted and stored in SGML format it is relatively easy to write filters to convert it to HTML, PDF or other output formats.

Christian Rozet of ACSE International extended the use of the filtering paradigm by discussing how you can set up SGML production workshops. By taking a production line view of document production you can break document management into small steps, many of which can be automated using filters. When filtering from an input format to SGML manual intervention will typically be required to overcome inconsistencies in source documents. Filtering from SGML to output formats can done without manual intervention. When filtering from one environment to another it is necessary to ensure that relevant management information is transferred with the files in a form that will allow it to be reconstituted at a later date, or maintained in a form that will allow it to be reattached to exported data if it is imported at a later date.

Kate Atherley of Incontext Inc. opened the discussion on the relationship between SGML and the Internet with a talk entitled They say the WWW is broken. The WWW is based on the concept of allowing universal readership and authorship of documents. It was designed in such a way that people were not forced to use a particular word processor or markup format to create or view documents. Users were expected to negotiate a suitable interchange protocol before transmission.

To assist interchange it was agreed that all platforms should be able to support ASCII and HTML as data formats. Unfortunately recently there have been a number of manufacturer-defined extensions to HTML that have not been agreed with other suppliers. This has led to a state where documents coded using one manufacturer's version of HTML cannot be fully displayed by another manufacturer's products. A new version of HTML has been proposed to allow the most commonly used set of such extensions to be incorporated in a well defined version of HTML. Hopefully this will lead, over time, to less need for storing data in multiple HTML formats, one for each level and/or manufacturer.

Kent J. Summers from Electronic Book Technologies discussed The Internet: Where Does SGML Fit. The traditional publishing cycle is basically a set of one-to-many relationships that expands as it evolves from publisher to content to medium to customers. By contrast the Internet is based on a many-to-one paradigm. Many publishers make a lot of data available through a few mediums to one consumer.

The 'balkanization' of HTML has led users who do not want to restrict access to their data to users of a single type of braoser to adopt a 'worst common denominator' approach to web pubslishing. The emptiness of HTML leads to a constant demand for extensions that will allow specific applications to be made available over the WWW.

There is a great deal of risk in trying to manage web sites in such a way that they can deal with the fragmentation that the fast evolution of HTML is causing. HTML is closer to RTF than to logically structured SGML applications. SGML is a better way to manage data. Data stored in SGML format can easily be converted into HTML for presentation in a WWW browser, or into other formats suitable for printing or delivery on CD-ROM.

Many developers of WWW sites fail to take the costs of generating content into account when costing the site. Content provision typically forms between 55% and 65% of the cost of setting up a WWW site.

Eric van Herwijnen of NICE Technologies, in a paper entitled Web Publishing: SGML's role in practice, pointed out that the average company spent 6-10% of its gross revenues on publishing information, 5-10% on reading published informaton and over 15% of its time in finding information.

Internet helper proxies provide a more efficient way of processing data published in formats other than HTML on the Internet than trying to process such files through a Java applet. The size of PDF files makes them an inefficient use of Internet bandwidth. A more efficient approach is to capture the data in SGML format, converting to HTML for presentation purposes if the user does not have a proxy capable of processing SGML.

Martin Bryan of The SGML Centre, in a paper entitled Managing multilingual document sets, illustrated why systems based on finding matching points in document structures are not suitable for managing multilingual data sets. By studying a set of related documents he showed that documents are not static, and cannot be related using fixed links. As well as being able to identify relationships between large blocks of text, which may already have been assigned unique identifiers, many documents reference specific parts of a sentence or paragraph. As these references are not to complete objects, methods are needed to reference strings within data. In a multilingual document set you need to be able to move between the various linguistic equivalents of these strings. HyTime and CApH Topic Maps provide a mechanism for doing this.

Laura Walker of Xsoft closed the session on the role of SGML on the Internet with a paper entitled Does SGML have a role to play in Intranet Publishing? Intranets are company-specific file stores that use WWW tools behind firewalls that prevent access by unauthorised people. They can be used for broadcasting company-specific information or for providing a method for virtual workgroup communications.

Intranets can use HTML forms to provide a company-wide GUI for capturing data to be loaded into company databases, and then use HTML as a way of providing database reports in an easily browsable form. In such environments SGML can be seen as defining the schemas used in company data repositories in a form that can be used to capture data over the WWW.

Creating and managing data is by far the largest cost factor in electronic publishing. Intranet technology today is about information delivery. Until we add management functionality to Intranets they will not be able to provide the control of documents over time that is a prerequisite of company-wide data repositories.

In his review of why Commercial Publishing Loves SGML Dale Waldt of the Research Institute of America warned that this could be the first generation to leave no information behind it unless we can agree on standardized formats for data archives. Archiving requires meta-data to aid retrieval. SGML provides a good mechanism for adding meta-data to digital data. Cataloguing needs to be done by specialists who understand how archived data should be retrievable. HTML is not useful for this purpose as it concentrates on presentation rather than logical relationships. Mr. Waldt suggested that SGML's destiny is to preserve our cultural heritage.

CApH Topic Navigation Map Seminar

In conjunction with SGML Europe '96 the Graphic Communications Association held a seminar to explain the work that has been done, by the Committee for the Application of HyTime (CApH), to generate SGML architectural forms to identify topics and the relationships betweeen them. The seminar was conducted by Michel Biezunski of High Text in Paris.

The CApH Topic Navigation Map module originated as part of the Davenport SOFABED project to design a DTD for the electronic distribution of information related to components used in the manufacture of electronic circuits. The general applicability of this functionality made separate development sensible, so in 1993 a separate committee was set up, under the auspices of the Graphic Communications Association Research Institute, to prepare a set of SGML architectural forms for defining:

Topic Navigation Maps
Natural Language Relationships
Modification History
Access Policy
Multidimensional Table Representation
Geographical Information Systems.

To date only the work on topic navigation maps has been tackled in depth. Topic maps can be used to provide subsets of information databases that identify topics of specific interests to a particular user.

The seminar started by pointing out that indexes do not conform to the traditional concept of non-nesting information units because a single topic may be referenced in multiple ways within an index. For example, the following entries could occur in the same index:

Architectural forms
- definition as SGML general facility in an annex to HyTime
- use within HyTime
HyTime
- definition as SGML general facility in an annex to HyTime
- use of SGML architectural forms
SGML
- definition of architectural forms as SGML general facility in an annex to HyTime
- use within HyTime

HyTime provides facilities to integrate SGML documents with databases. HTML links typically fail to explain why a link has been created. HyTime independent links have to be assigned anchor roles that can be used to identify the role played by each end of a link. It also provides mechanisms for controlling the ways in which links are traversed.

M. Biezunski explained the role of SGML architectural forms within HyTime, and went on to explain how these techniques are used to define HyTime independent links. He explained how system designers can associate an explanation of the role associated with each of the anchors pointed to by the link.

CApH topic navigation map 'semantic assignments' contain a title that is used to identify the subject covered by the topic. In addition they are assigned a short mnemonic that is used by programs to identify the topic. Each topic can be assigned to one or more 'semantic universes', which identifies which topics form a well-defined set of related topics.

Users can define the relationships between topics using elements that conform to a CApH topic relation SGML architectural form. These relationships can be used to select the most relevant set of associated topics where a topic points to many other topics within a semantic universe.

Martin Bryan

[ Help] [ Frequently Asked Questions] [ Subject Index ] [ Text Search] [ Europa WWW server]

File created: May 1996

Reproduction is authorized, except for commercial purposes, provided the source is acknowledged.

webmaster@echo.lu

SGML Europe '96

CApH Topic Navigation Map Seminar

©ECSC-EC-EAEC, Brussels-Luxembourg, 1996