[This local archive copy mirrored from: http://www2.echo.lu/oii/en/SGML-97.html; see the canonical version of the document.]
OII Home Page
What is OII?
Whats new in OII?
OII Fora List
The main theme behind SGML Europe 97 was "where will SGML and its related standards go in the next decade". The 500+ attendees were provided with a wealth of information on the way in which the SGML standard is drastically changing the way in which data is being managed by globally orientated companies.
The full proceedings of the 60+ presentations will be published later this year by the Graphic Communications Association, who organized the conference. This report covers the following talks/panels on the role and development of standards:
It was appropriate that the speaker who started this panel session should be Tim Bray of Textuality Inc, who, at the close of last year's conference, made a plea for a simpler form of SGML for use of the Internet. Since then he has become co-editor of the specification for the Extensible Markup Language (XML) for the newly formed SGML on the Web group of W3C. His main prediction was that "One year from todaty you will be able to send SGML files to a standard web browser - thanks to XML". As a corollary to this, he pointed out that the SGML community needed to start worrying about the effect the introduction of such a large audience will have on current industry players. Existing tools will have to become easier to use if they are to survive in the expanded world of XML.
Paula Leenheer from Shell pointed out that SGML is currently used where legislation requires well structured, easily found, data. HTML has introduced a wide audience to the benefits of marking up documents in a non-proprietary form. The need to upgrade systems to cope with year 2000 effects will provide companies with a natural point at which to introduce new techniques to staff.
Peter Bergström of EuroSTEP predicted that the next generation of tools will hide the SGML markup from users. SGML will become a hidden markup language, in the same way that RTF is currently hidden from word processing users. He also stressed the advantages for interchangeable data models, of the type defined for STEP, that can be used by different types of databases.
Mr. Bergström also predicted that stand-alone document management systems will become part of the main company database, and that management of hypertext documents will become a key factor. The importance of the HyTime standard will become increasingly recognized in this respect.
Walter Schmitt-Rennekamp of Lufthansa predicted that direct access to manufacturer databases will become increasingly important. Data models, such as AECMA's SPEC 2000 aircraft database schema, are having an increasing impact on the way technical documentors work. They are slowly switching from converting word processing documents into SGML prior to loading a database to authoring directly in SGML. Integration of documentation with other manufacturing data will become increasingly important.
Henry Thompson from the Department of Artificial Intelligence at the University of Edinburgh pointed out that the amount of on-line data is growing much faster than the speed at which computers can process it. Multiprocessor analysis of data, using the untapped power of user machines, will become increasingly important. Self-describing structured documents are required to make this possible. SGML should be thought of as being concerned with information management rather than document management.
Sharon Adler of the Inso Corporation stressed the fact that "one size does not fit all". Users need choices. Some will use HTML with Cascading Style Sheets (CSS), some XML and DSSSL-O and some SGML and DSSSL. Tools that allow progression from simple word processing type applications onto complex SGML-based information management roles are required.
Charles Goldfarb used his annual Inventor's Technical Keynote session to provide a longer-term view of the future of SGML - taking as his theme the first 40 years of the use of role-based generalized markup. The concept of generalized markup has now been with us for 30 years. When the SGML Europe conference was last held in Spain, in 1987, it would have been impossible to have predict that SGML could be used to book a holiday in Majorca directly from a home in California 10 years later. The only safe prediction for 10 years from now is that SGML will no longer be used mainly for documentation; non-documentation uses of SGML are likely to become the norm.
Dr. Goldfarb introduced the WebSGML extensions to SGML that were approved at a meeting held, under the auspices of GCA, in the same venue the week before the conference. He also highlighted the importance of the SGML Extended Facilities Annex in the 2nd Edition of HyTime as an indication of the future of SGML.
Publishers, including those controlling broadcast distribution channels, want continuity, but they also want to restrict use of the web. One danger is that we will see the introduction of 'barbed-wire' fences to parts of the 'free range' World Wide Web (WWW) owned by commercially-orientated Internet Service Providers (ISPs) who wish to stop their users from having access to data they do not control. So-called 'free subscriptions' will be used to restrict the freedom of users to access all of the information provided on the WWW.
The current attempts to introduce 'push technology' can be seen as part of this "we know what's best for you" view of the world of information. It is important that the general public is made aware of the advantages of access to all of the information on the WWW.
SGML is increasingly becoming the middleware between applications. Its benefits need to be taught better. In particular the difference between style and structure needs to be better understood. Without style you cannot communicate structure. Thoughts can't travel without rendition, but the same set of structured data may need to be rendered in different ways for different audiences. By divorcing logical structure from presentation style, SGML allows information to be prepared in a form that allows its reuse by many different audiences.
Industrial data is data relating to product design, manufacturing, distribution and servicing. Typically documentation is produced as the end of the development/design process. The aim of STEP/SGML is to allow the capture of data as close to the point of decision making as possible.
Documentation has its own design/manufacture/distribute/service cycle, which needs to be managed in the same way as any other product.
The ISO 10303 Standards for the Exchange of Product Model Data (STEP) provides a system independent neutral format for exchanging industrial data. SGML provides a system independent neutral format for exchanging documentation. EXPRESS provides a language for describing databases using computer interpretable descriptions known as entities. HyTime provides a mechanism of referencing industrial data from within documentation. Between them these four standards can provide a system independent way of formally defining the information set required by a typical manufacturing company.
Integrating the four standards will be a multi-stage process. In the first stage STEP has tried to link together the separate databases used to store industrial data and documentation. A second stage should use HyTime links to interconnect data held in the separate databases. A third phase of development should allow industrial data and documentation to be held in the same database. Three EU-funded projects are currently studying the feasibility of integrated databases.
An SGML Application Profile is to be developed to describe 'generic documents' in a STEP database. This will allow the effectivity model of STEP to be applied to document production. For further information on this work you should contact http://www.eccnet.com/step.
Murray Malone explained how WAI is helping to extend the addressing capabilities of the SGML envelope. WAI is the acronym applied to W3C's Web Accessibility Initiative, which is being backed by the White House, Microsoft, SoftQuad and the Yuri Rubinsky Insight Foundation, among others. The aim of WAI is to make information more accessible to those disadvantaged by physical disabilities. WAI aims to educate product developers, content providers and educators about the needs of this community. To help this a tool called BOBBY has been developed which can be used to evaluate the accessibility of existing Internet sites.
Jeff Suttor of Sun explained how WAI was helping to reduce the amount of rekeying currently needed to convert documents into Braille or speech. When SGML (or HTML) are combined with the ICADD specifications for typing data to be converted to other formats, conversion can become an automated process.
Practice has shown that once ICADD-compatibility has been adopted the associated data becomes much more easily resuable. The cost of subsequent conversions is much less if the elements of the document have been correctly classified during creation.
The more consistent the design of a document structure the easier it is to convey the author's intentions to 'readers'. The use of features such as tables and, particularly, frames make it much harder to convey information to blind or partially sighted readers. For partially sighted readers it becomes impossible to use larger type sizes without either horizontal scrolling or complex wrapping of columns.
Sun's Java Foundation Classes (JFCs) will include accessibility classes for HTML files that are based on the ICADD specification.
Bertrand Malese of Grif highlighted the problems of providing access to mathematics for the blind. He was particularly concerned with how blind and partially sighted children could be taught mathematics with the aid of computers. Experiments in this area have been set up as part of the EU-funded TIDE research programme.
Describing mathematics in terms of how it is presented on paper is not adequate for translating it to Braille or speech. Braille is not good at two dimensional representation of things like fractions. This problem is made worse by the fact that each country has its own version of Braille. Speaking formulas can also prove ambiguous. If formula descriptions are associated with the elements used to define the contents of formulae the descriptions can be used to provide guidance to the student, or to the person typing in the data.
Jon Bosak of Sun, the chair of W3C's SGML on the Web committee, explained how XML represented a profile of four international standards combined with as set of relevant Internet standards. Part 1 of XML uses a subset of SGML to provide a generalized way of marking up documents to be transmitted over the WWW. Part 2 combines the linking techniques defined in HyTime with a subset of the addressing facilities of the Text Encoding Initiative (TEI) to provide a generalized linking mechanism for the WWW. Part 3 will define a subset of the DSSSL style language suitable for on-line applications, while Part 1 will deal with the add-ons required to deal with Web manifests, name spaces and behaviour.
XML should not be seen as replacing HTML or SGML. It is augmenting HTML and helping users get up to speed with SGML. XML allows WWW users to exchange database information, develop intelligent agents and provide better client-side manipulation of data.
XML has already been proposed for a number of Internet applications. In particular Microsoft have used it as the basis for their Open Financial Exchange (OFX), Web Collections and Open Channel Definition Format (ODF) proposals.
Jon probably made the quote of the week when he said "XML-links will move WWW hypertext systems into the 1970's". At last you can address multiple objects in a web link. At last link sets can be sold separately from the data being linked.
Tim Bray stressed that XML is a subset of SGML that uses a fixed syntax. As XML has adopted the ISO 10646 Bit Map Plane as its character set it can handle most languages of the world, but this requires that you know something about the way a document has been encoded prior to processing document instances. SGML processing instructions are used to ensure that this information can be found at the start of each XML document instance.
XML has been designed to allow documents to be processed without needing to refer to a formal SGML document type definition (DTD). (This speeds up the process of preparing files coming over the Internet for browsing.) To make this possible many of SGML's optional features have been discarded, and certain rules for ensuring the 'well-formedness' of transmitted documents have been specified. Errors must be reported: error recovery techiques are up to the application, but must conform to certain laid-down rules.
Michel Vulpe of Infrastructures for Information Inc (Canada) started off the session of data modelling and SGML with a talk entitled SGML made SIMPL. SIMPL stands for Structured Information Manufacturing and Processing Language.
After explaining the benefits that HTML, XML and SGML have provided those involved in text distribution, Mr. Vulpe pointed out that SGML document structures provide a way of conveying additional information by defining relative positions. Most information in manufacturing is concerned with the relationships between components. An SGML DTD is a way of formalizing such relationships. It can be useful without large amounts of data being associated with it.
Relational databases models (RDBMs) are typically normalized to third-normal form before data is added to them. The structure of object-oriented databases (OODBMs) is define using code. SGML provides an easy way of defining a structure (schema) that can be normalized for use in RDBMs or coded for use in OODBMs.
SGML should not be seen on the desktop, it should be buried in the underlying software. There is no need to have the markup embedded in the data. What is needed is an associative model that can add structure to data that has not been marked up. Associative relationships between data needs to be managed without altering the model. DTDs should be managed by the same people who are managing the database structure.
It was pointed out during the accompanying question session that most databases are networks rather than hierarchies. Some attendees felt that the SGML model was not rich enough to describe all relationships.
Frank Pieper from Mediaware BV (Holland) took Document structure independent data modelling as the title for his presentation. He stressed the need for a transformation process between the way in which data is stored in SGML documents and the way it is presented to users. DTDs should be designed with efficiency of storage and transformation in mind rather than concentrating on efficiency of presentation (as HTML does).
Different DTDs should be linked using well defined relationships. The same information object may need to be associated with different 'labels' in different DTDs.
While text mode file storage allows start-up costs to be kept low you will need to store any large volume of documents in a database before transforming them. DTD-based databases should be avoided as changes in the DTD must be reflected in the database, and vice versa. SGML-oriented databases are suitable if you don't need much structure flexibility. But for a more reusable solution problem-domain specific databases should be used to store data prior to transformation into documentation, etc.
Bruce Brown from Datalogics (US) proposed the adoption of a Bottoms-up pardigm shift. Due to SGML's lack of a theoretical foundations, document analysis can be thought of as more of an art than a science. There is no equivalent of relational calculus for SGML, probably because SGML has not been taught in most computer science schools. More theoretical work on how to optimize DTDs needs to be undertaken.
Documents are no longer the relevant data storage units. Reusable data fragments are the key to the use of SGML in the 21st century. It does not make sense to make every element a reusable object. Each DTD should have a clearly defined set of reusable storage objects.
The processing of legacy data so that the same appearance can be obtained on output is not compatible with the need for business re-engineering for information re-use. SGML needs to become 'information centric' - it needs to move away from defining documents to defining reusable information fragments.
SGML content models should be built from the bottom up. Processes need to be associated with elements to allow data to always be current. On-demand printing of personalized information sets will become the norm in the 21st century. Just-in-time printing reduces the need for document warehousing and distribution as well as reducing waste.
Authoring of fragments will require a different approach. It is important that fragments have sufficient meta-data associated with them to track the authoring/editing process and accurately determine which fragments should be chosen for which virtual documents. Presentation rules can only be attached to the data once the virtual document has been built and the presentation technique has been selected.
Lars-Olaf Lindgren from Texcel (Sweden) took as his subject Information modelling for document management. Businesses need to reduce the time needed to get documentation to market to meet falling times between design and manufacture. The information model must be designed for ease of management rather than ease of presentation. The model must meet the needs of the information creators.
Creation, storage and output may require different structures. For example, translation requires that you know what is the smallest part of information that has been changed to ensure that the least possible amount of retranslation is done. This can require a different segmentation paradigm from that which is required to efficiently reuse stored data fragments in a number of virtual documents. Fragment check-out/locking rules for re-authoring, on the other hand, may require yet another type of segmentation.
Mr. Lindgren gave an interesting example of where you would need a different model for capture, storage and reuse. When creating a set of questions and answers for a training manual you need to ensure that every question has an answer, but when presenting the course you need to make sure that the answer is not presented with the question. In the database they can be stored either in separate tables, or as pairs of objects in the same table, depending on the type of database being used and the way in which the questions and answers will be used.
The type of reuse will also affect data modelling. For web use you need a lot of small information units with good interconnections. For a printed document you want large blocks of continuously readable text. For a CD-ROM you might want to be able to display text, questions and answers in separate windows, at which time synchronization of data elements will be important.
Paula Angerstein from Texcel (US) started this session with a paper entitled Why you do (or don't) need HyTime in your document management system. When developed in the late 1980's HyTime was very much ahead of the times. Whilst originally designed for interconnecting the static data sets stored on CD-ROMs, it is ideally suited to the management of dynamic data sets.
HyTime is typically referenced in terms of its hyperlinking and data locating facilities. Few people have yet taken advantage of its time-based functions. Hyperlinks typically have behavioural semantics, so you need to be able to associated processes with links. For good link management you need to be able to define multiple links to a document without modifying the source data. Context-sensitive links are required if links are to be used in different information sets. Link end (anchor) resolution has to be independent of the data the anchor is associated with.
Link management today cannot be done in a standardized way. HTML has no methodology for managing its embedded links. The XML link specification is still not complete. General purpose HyTime engines are still not available. Specialist document management systems provide you with control of a restricted data set, but are not interoperable.
HyTime is defined as a set of SGML architectural forms that can be associated with any application-specific DTD. The architectural forms for links reference two or more link ends that can be resolved, at access time, to specific anchors. Query locations allow dynamic finding of link ends.
XML links have a class-identification attribute that acts as an architectural form, TEI-based query locators, a system-dependent mechanism for identifying files (URLs) and attributes for specifying behaviour-related processing.
Most SGML repositories use system-dependent repository identifiers that can be used to identify link ends in a locally efficient manner. These identifiers typically only address storage elements, not the entities they are contained within or their sub-elements or components of their data. They rarely allow context-sensitive links to be defined.
Link management should be devolved to specialists with suitable indexing skills. These specialists will need a point-and-click mechanism for defining links. They should not be concerned with how the selected data is identified internally. Link sets need version histories, and these need to be tied to the histories of the fragments that make up the indexed documents.
Patricia François from Aerospatiale and Phillippe Futtersach from Électricité de France explained why they had adopted a HyTime-based approach for their projects in a paper entitled Hypermedia databases. Starting as two separate projects, Aerospatiale and EDF came to realise that they needed to adopt the same approach for different reasons. Aerospatiale were concerned about longevity and how to reuse the data on a wide range of customer platforms. Both organizations had to meet regulatory controls, and had a complex set of relationships between data that change over time.
Formal object-oriented analysis (OOA) methods were used to define requirements. Modelling was done in a way that was independent of the database finally chosen (the French O2 object-oriented database). Navigation of content is a key factor for accessing multimedia objects. SGML needed to be supplemented by HyTime's linking facilities to provide the level of interconnection needed. HyTime functions were introduced incrementally on a base of plain SGML.
Separate editing and browsing suites were developed. A mixture of context-sensitive and full-text searching can be used to identify relevant fragments from the database. Java applets are used to capture queries, which can include options like stem searching.
Henry Thompson looked at ways of Using XML for stand-off markup. Stand-off markup is the separation of markup from the data that is marked up. It allows read-only data to be marked up. It also allows more than one structure to be applied to the same data without having to overload the source with markup or worry about boundary clashes.
The British National Bibliography contains some 2GB of data. Multiple types of linguistic analysis need to be performed on the whole database. The best way to do this is by adding hypertext links from the analysis structure to the data. The types of links required are not the normal "follow me" type. They need to be associated with behaviour for such things as inclusion, replacement and inverse replacement (selective mending).
XML links support multiple linking elements and light-weight TEI-based querying, with support for spans and user-determined link semantics. As such it provides a good starting point for stand-off linking. A free XML analysis tool is available at http://www.ltg.ed.ac.uk/software/xml.
Michel Biezunski of High Text (France) presented the background to ISO's Topic Navigation Map project, highlighting how it was a natural developement out of the Davenport DOCBOOK/SOFABED project and the work on the Conventions for the Application of HyTime (CApH). He showed how topic navigation maps can link together sets of links as well as data related to a well defined topic. Finally he showed how a topic navigation map could be used to find your way around the CD-ROM version of the proceedings of the conference.
Martin Bryan of The SGML Centre (UK) then quickly walked through the recent changes to ISO/IEC CD 13250, which had been updated the preceding weekend to take account of the latest changes to HyTime and XML. Three new options were now available for linking together data relating to a particular topic. It has to be decided which of these is the best, or whether multiple methods should be allowed. The new HyLinks architectural form would allow all details of a topic to be stored as attributes of a single element. The new VarLink architectural form would allow XML-like links to multiple locations, while using the XML extended link architectural form you would need a separate locator element for each use of the topic.
It is hoped that the new Topic Relationship architectural form could be defined in such a way that it can be used to record details about any database schema. The STEP EXPRESS model for schema representation is to be investigated to see if it can provide the basis for a general-purpose SGML architectural form that the Topic Relationship form can be derived from.
Hasse Haitto from Synex Information AB (Sweden) looked at the ways in SGML has been evolving over the last 10 years. He noted that the SGML standard itself had not changed; it was the way in which people were using the standard that had changed. From its initial base as a means of interchanging published documents it has now become a general purpose language for describing the inter-relationships between different pieces of data.
Modular DTDs designed to deal with documents that will be generated as as set of fragments that will be combined into virtual documents are becoming the norm. Reusable data fragments need to be associate with management meta-data that can be queried to identify the fragments required.
Entity managers are the key to the flexible use of SGML. Originally people used system-specific entity identifiers. The advent of a standardized form of catalog for mapping public identifiers to system-specific entity names has created a portable form of entity management. Entity managers that allow encryption/decryption and late-binding of delivered data will follow from the Formal System Identifiers added by the SGML Extended Facilities Annex during 1997.
Indirected addressing of the type seen in HyTime will become increasingly important to the SGML community. XML is a minimalist approach to the use of SGML. HyTime addressing is much more generalized than the URL-based XML form of addressing. Advanced applications will typically look to HyTime to provide the forms of indirection they need. Mr. Haitto quoted Joan Smith's declaration that "HyTime is the application of SGML that is destined to take information processing into the next millenium."
SGML subdocuments provide a natural level for information reuse as they are a natural unit at which to assign a different set of style semantics. SGML's link feature provides a natural way of associating meta-data with documents and subdocuments. These 'forgotten features' of SGML are now becoming much more appreciated than they were when they were first introduced a decade ago.
Groves are the most important of the recent extensions to our use of SGML. They provide a model for exchanging information between applications in a standardized way. As well as making structured document queries based on SDQL possible, groves also make it feasible to develop industry standard APIs, which will speed the development of new tool sets.
After reviewing the history of the review process, Dr. Goldfarb went on to detail some of the changes that have already been accepted for the next extension to the SGML standard. In addition to those already published as part of the SGML Extended Facilities annex to the HyTime standard and the Extended Naming Rules Technical Corrigendum published in 1996, the following additions will be defined in a new WebSGML Adaptations Technical Corrigendum to be approved during 1997:
Once these extensions have been fast-tracked to meet the needs of the XML community a full review of SGML will be undertaken. Among the extensions already agreed as part of this review are:
Lynne Price and David Peterson showed examples of how many of these new features could be used to improve the flexibility of existing SGML applications.
Those wishing to keep abreast of developments should keep an eye on http://www.SGMLsource.com/Goldfarb/8879rev.
The closing keynote was given by Eliot Kimber, who took as his theme the role of SGML in the 21st Century. SGML is not being 'threatened' by HTML, XML, Java or PDF. As people are getting more sophisticated in their requirements they are becoming to realise that they need more of SGML.
It was a momentous year for the SGML community. First there was the publication of the DSSSL standard, closely followed by Jame's Awesome DSSSL Engine (JADE), the first public domain DSSSL processor for converting SGML into HTML, RTF, ... Closely followed by this was the 2nd edition of HyTime, which incorporated the key features of DSSSL (groves, SDQL, etc) into the new SGML Extended Facilities annex alongside features from HyTime, such as property sets and lexical typing, which are useful to all SGML applications. The development of XML, including XML-links, has further clarified the way in which SGML applications should interact with one anoother.
One of the surprises of the year has been the way in which the WWW community have taken to XML. Despite being very incomplete the XML-link proposal has already been identified as meeting the needs of a large community of web users.
While SGML was designed in the age of 'computer dinosaurs' its concepts are still valid in today's slim-line client-server environment. What needs to happen is that the high-cost/high-benefit approach to existing SGML becomes a low-cost/high-benefit approach of the type propounded by XML. In particular start-up costs need to be lowered, and the learning curve needs to be reduced.
SGML groves will make it easier to build plug-and-play SGML tools based on a common API. SGML Open shows how vendors can work together to improve interchangeability and system interfaces. The main constraint on how fast development can occur will be the human endurance of those trying to improve the standard!
This information set on OII standards is maintained by Martin Bryan of The SGML Centre and Man-Sze Li of IC Focus on behalf of European Commission DGXIII/E.
File last updated: July 1997