SGML Europe 97
The Next Decade - Pushing the Envelope
Barcelona, May 13th-15th 1997
The main theme behind SGML Europe 97 was "where will
SGML and its related standards go in
the next decade". The 500+ attendees were provided with a wealth of
information on the way in which the SGML standard is drastically changing the
way in which data is being managed by globally orientated
companies.
The full proceedings of the 60+ presentations will be published later this
year by the Graphic Communications Association, who organized the conference.
This report covers the following talks/panels on the role and development of standards:
It was appropriate that the speaker who started this panel session should be
Tim Bray of Textuality Inc, who, at the close of last year's conference,
made a plea for a simpler form of SGML for use of the Internet. Since then
he has become co-editor of the specification for the
Extensible Markup Language (XML) for the newly formed SGML on the Web
group of W3C. His main prediction was that "One year
from todaty you will be able to send SGML files to a standard web browser -
thanks to XML". As a corollary to this, he pointed out that the SGML community
needed to start worrying about the effect the introduction of such a large
audience will have on current industry players. Existing tools will have to
become easier to use if they are to survive in the expanded world of XML.
Paula Leenheer from Shell pointed out that SGML is currently used where
legislation requires well structured, easily found, data. HTML has
introduced a wide audience to the benefits of marking up documents in
a non-proprietary form. The need to upgrade systems to cope with year 2000
effects will provide companies with a natural point at which to introduce
new techniques to staff.
Peter Bergström of EuroSTEP predicted that the next generation of
tools will hide the SGML markup from users. SGML will become a hidden
markup language, in the same way that RTF
is currently hidden from word processing users. He also stressed the
advantages for interchangeable data models, of the type defined for
STEP, that can be used by different types of
databases.
Mr. Bergström also predicted that stand-alone document management systems
will become part of the main company database, and that management of
hypertext documents will become a key factor. The importance of the
HyTime standard will become increasingly
recognized in this respect.
Walter Schmitt-Rennekamp of Lufthansa predicted that direct access to
manufacturer databases will become increasingly important. Data models,
such as AECMA's SPEC 2000 aircraft database schema, are having an
increasing impact on the way technical documentors work. They are
slowly switching from converting word processing documents into SGML prior
to loading a database to authoring directly in SGML. Integration of
documentation with other manufacturing data will become increasingly important.
Henry Thompson from the Department of Artificial Intelligence at the
University of Edinburgh pointed out that the amount of on-line data
is growing much faster than the speed at which computers can process it.
Multiprocessor analysis of data, using the untapped power of user
machines, will become increasingly important. Self-describing structured
documents are required to make this possible. SGML should be thought of as being
concerned with information management rather than document management.
Sharon Adler of the Inso Corporation stressed the fact that "one size
does not fit all". Users need choices. Some will use
HTML with Cascading Style Sheets (CSS),
some XML and DSSSL-O and some SGML
and DSSSL. Tools that allow progression
from simple word processing type applications onto
complex SGML-based information management roles are required.
Charles Goldfarb used his annual Inventor's Technical Keynote session to
provide a longer-term view of the future of SGML - taking as his theme
the first 40 years of the use of role-based generalized markup. The concept
of generalized markup has now been with us for 30 years. When the SGML Europe
conference was last held in Spain, in 1987, it would have been impossible to
have predict that SGML could be used to book a holiday in Majorca directly
from a home in California 10 years later. The only safe prediction for 10
years from now is that SGML will no longer be used mainly for documentation;
non-documentation uses of SGML are likely to become the norm.
Dr. Goldfarb introduced the WebSGML extensions to SGML that were approved
at a meeting held, under the auspices of GCA, in the same venue the
week before the conference. He also highlighted the importance of the
SGML Extended Facilities Annex in the 2nd Edition of HyTime as an
indication of the future of SGML.
Publishers, including those controlling broadcast distribution channels,
want continuity, but they also want to restrict use of the web. One
danger is that we will see the introduction of 'barbed-wire' fences
to parts of the 'free range' World Wide Web (WWW) owned by
commercially-orientated Internet Service Providers (ISPs) who wish
to stop their users from having access to data they do not control.
So-called 'free subscriptions' will be used to restrict the freedom
of users to access all of the information provided on the WWW.
The current attempts to introduce 'push technology' can be seen as part
of this "we know what's best for you" view of the world of information.
It is important that the general public is made aware of the advantages
of access to all of the information on the WWW.
SGML is increasingly becoming the middleware between applications.
Its benefits need to be taught better. In particular the difference between
style and structure needs to be better understood. Without style you
cannot communicate structure. Thoughts can't travel without rendition, but
the same set of structured data may need to be rendered in different ways
for different audiences. By divorcing logical structure from presentation
style, SGML allows information to be prepared in a form that allows its reuse
by many different audiences.
Industrial data is data relating to product design, manufacturing, distribution
and servicing. Typically documentation is produced as the end of the
development/design process. The aim of STEP/SGML is to allow the capture
of data as close to the point of decision making as possible.
Documentation has its own design/manufacture/distribute/service cycle,
which needs to be managed in the same way as any other product.
The ISO 10303 Standards for the Exchange of Product Model Data (STEP) provides
a system independent neutral format for exchanging industrial data. SGML
provides a system independent neutral format for exchanging documentation.
EXPRESS provides a language for describing databases using computer interpretable
descriptions known as entities. HyTime provides a mechanism of referencing
industrial data from within documentation. Between them these four standards
can provide a system independent way of formally defining the information set required
by a typical manufacturing company.
Integrating the four standards will be a multi-stage process. In the
first stage STEP has tried to link together the separate databases used
to store industrial data and documentation. A second stage should use HyTime
links to interconnect data held in the separate databases. A third phase
of development should allow industrial data and documentation to be held in
the same database. Three EU-funded projects are currently studying the
feasibility of integrated databases.
An SGML Application Profile is to be developed to describe 'generic documents'
in a STEP database. This will allow the effectivity model of STEP to be
applied to document production. For further information on this work
you should contact http://www.eccnet.com/step.
Murray Malone explained how WAI is helping to extend the addressing
capabilities of the SGML envelope. WAI is the acronym applied to W3C's
Web Accessibility Initiative,
which is being backed by the White House, Microsoft, SoftQuad and the
Yuri Rubinsky Insight Foundation, among others.
The aim of WAI is to make information more accessible to those disadvantaged
by physical disabilities. WAI aims to educate product developers, content
providers and educators about the needs of this community. To help this
a tool called BOBBY has been developed
which can be used to evaluate the accessibility of existing Internet sites.
Jeff Suttor of Sun explained how
WAI was helping to reduce the amount of rekeying currently needed to
convert documents into Braille or speech. When SGML (or HTML) are
combined with the ICADD specifications for typing data to be converted to
other formats, conversion can become an automated process.
Practice has shown that once ICADD-compatibility has been adopted
the associated data becomes much more easily resuable. The cost of
subsequent conversions is much less if the elements of the document have been
correctly classified during creation.
The more consistent the design of a document structure the easier it is
to convey the author's intentions to 'readers'. The use of features such
as tables and, particularly, frames make it much harder to convey
information to blind or partially sighted readers. For partially sighted readers
it becomes impossible to use larger type sizes without either horizontal
scrolling or complex wrapping of columns.
Sun's Java Foundation Classes (JFCs) will include accessibility classes for
HTML files that are based on the ICADD specification.
Bertrand Malese of Grif highlighted the problems of providing access to
mathematics for the blind. He was particularly concerned with how blind
and partially sighted children could be taught mathematics with the aid
of computers. Experiments in this area have been set up as part of the
EU-funded TIDE research programme.
Describing mathematics in terms of how it is presented on paper is not
adequate for translating it to Braille or speech. Braille is not good at
two dimensional representation of things like fractions. This problem
is made worse by the fact that each country has its own version of Braille.
Speaking formulas can also prove ambiguous. If formula descriptions are
associated with the elements used to define the contents of formulae
the descriptions can be used to provide guidance to the student, or to
the person typing in the data.
Jon Bosak of Sun, the chair of W3C's SGML on the Web committee, explained
how XML represented a profile of four international standards combined with
as set of relevant Internet standards. Part 1 of XML uses a subset of
SGML to provide a generalized way of marking up documents to be transmitted
over the WWW. Part 2 combines the linking techniques defined in HyTime with a
subset of the addressing facilities of the Text
Encoding Initiative (TEI) to provide a generalized linking mechanism
for the WWW. Part 3 will define a subset of the DSSSL style language
suitable for on-line applications, while Part 1 will deal with the add-ons
required to deal with Web manifests, name spaces and behaviour.
XML should not be seen as replacing HTML or SGML. It is augmenting HTML
and helping users get up to speed with SGML. XML allows WWW users to exchange
database information, develop intelligent agents and provide better client-side
manipulation of data.
XML has already been proposed for a number of Internet applications. In
particular Microsoft have used it as the basis for their Open Financial
Exchange (OFX), Web Collections and Open Channel Definition Format (ODF)
proposals.
Jon probably made the quote of the week when he said "XML-links will move
WWW hypertext systems into the 1970's". At last you can address multiple
objects in a web link. At last link sets can be sold separately from the
data being linked.
Tim Bray stressed that XML is a subset of SGML that uses a fixed syntax.
As XML has adopted the ISO 10646 Bit
Map Plane as its character
set it can handle most languages of the world, but this requires that you
know something about the way a document has been encoded prior to processing
document instances. SGML processing instructions are used to ensure that
this information can be found at the start of each XML document instance.
XML has been designed to allow documents to be processed without needing
to refer to a formal SGML document type definition (DTD). (This speeds up
the process of preparing files coming over the Internet for browsing.)
To make this possible many of SGML's optional features have been discarded,
and certain rules for ensuring the 'well-formedness' of transmitted documents
have been specified. Errors must be reported: error recovery techiques are
up to the application, but must conform to certain laid-down rules.
Michel Vulpe of Infrastructures for Information Inc (Canada) started off
the session of data modelling and SGML with a talk entitled SGML made
SIMPL. SIMPL stands for Structured Information Manufacturing and
Processing Language.
After explaining the benefits that HTML, XML and SGML have
provided those involved in text distribution, Mr. Vulpe pointed out that
SGML document structures provide a way of conveying additional information
by defining relative positions.
Most information in manufacturing is concerned with the relationships between
components. An SGML DTD is a way of formalizing such relationships. It can
be useful without large amounts of data being associated with it.
Relational databases models (RDBMs) are typically normalized to third-normal form before
data is added to them. The structure of object-oriented databases (OODBMs)
is define using code. SGML provides an easy way of defining a structure (schema) that
can be normalized for use in RDBMs or coded for use in OODBMs.
SGML should not be seen on the desktop, it should be buried in the underlying
software. There is no need to have the markup embedded in the data. What is
needed is an associative model that can add structure to data that has not
been marked up. Associative relationships between data needs to be managed
without altering the model. DTDs should be managed by
the same people who are managing the database structure.
It was pointed out during the accompanying question session that most databases
are networks rather than hierarchies. Some attendees felt that the SGML model
was not rich enough to describe all relationships.
Frank Pieper from Mediaware BV (Holland) took Document structure
independent data modelling as the title for his presentation.
He stressed the need for a transformation process between the way in which
data is stored in SGML documents and the way it is presented to users.
DTDs should be designed with efficiency of storage and transformation
in mind rather than concentrating on efficiency of presentation (as HTML does).
Different DTDs should be linked using well defined relationships.
The same information object may need to be associated with different 'labels'
in different DTDs.
While text mode file storage allows start-up costs to be kept low
you will need to store any large volume of documents in a database before transforming them.
DTD-based databases should be avoided as changes in the DTD must be
reflected in the database, and vice versa. SGML-oriented databases are
suitable if you don't need much structure flexibility. But for a more
reusable solution problem-domain specific databases should be used to store
data prior to transformation into documentation, etc.
Bruce Brown from Datalogics (US) proposed the adoption of a Bottoms-up
pardigm shift. Due to SGML's lack of a theoretical foundations,
document analysis can be thought of as more of an art than a science.
There is no equivalent of relational calculus for SGML, probably because
SGML has not been taught in most computer science schools. More theoretical
work on how to optimize DTDs needs to be undertaken.
Documents are no longer the relevant data storage units. Reusable data
fragments are the key to the use of SGML in the 21st century. It does
not make sense to make every element a reusable object. Each DTD should
have a clearly defined set of reusable storage objects.
The processing of legacy data so that the same appearance can be obtained
on output is not compatible with the need for business re-engineering for
information re-use. SGML needs to become 'information centric' - it needs
to move away from defining documents to defining reusable information fragments.
SGML content models should be built from the bottom up. Processes need to
be associated with elements to allow data to always be current. On-demand
printing of personalized information sets will become the norm in the
21st century. Just-in-time printing reduces the need for document warehousing
and distribution as well as reducing waste.
Authoring of fragments will require a different approach. It is important
that fragments have sufficient meta-data associated with them to track the
authoring/editing process and accurately determine which fragments should
be chosen for which virtual documents. Presentation rules can only be
attached to the data once the virtual document has been built and the
presentation technique has been selected.
Lars-Olaf Lindgren from Texcel (Sweden) took as his subject Information
modelling for document management. Businesses need to reduce the time
needed to get documentation to market to meet falling times between design
and manufacture. The information model must be designed for ease of management
rather than ease of presentation. The model must meet the needs of the
information creators.
Creation, storage and output may require different structures. For example,
translation requires that you know what is the smallest part of information
that has been changed to ensure that the least possible amount of retranslation
is done. This can require a different segmentation paradigm from that
which is required to efficiently reuse stored data fragments in a number
of virtual documents. Fragment check-out/locking rules for re-authoring, on
the other hand, may require yet another type of segmentation.
Mr. Lindgren gave an interesting example of where you would need a different
model for capture, storage and reuse. When creating a set of questions and
answers for a training manual you need to ensure that every question has
an answer, but when presenting the course you need to make sure that the
answer is not presented with the question. In the database they can be
stored either in separate tables, or as pairs of objects in the same table,
depending on the type of database being used and the way in which the
questions and answers will be used.
The type of reuse will also affect data modelling. For web use you need a
lot of small information units with good interconnections. For a printed
document you want large blocks of continuously readable text. For a CD-ROM
you might want to be able to display text, questions and answers in separate
windows, at which time synchronization of data elements will be important.
Paula Angerstein from Texcel (US) started this session with a paper entitled
Why you do (or don't) need HyTime in your document management system.
When developed in the late 1980's HyTime was very much ahead of the times.
Whilst originally designed for interconnecting the static data sets
stored on CD-ROMs, it is ideally suited to the management of dynamic data sets.
HyTime is typically referenced in terms of its hyperlinking and data locating
facilities. Few people have yet taken advantage of its time-based functions.
Hyperlinks typically have behavioural semantics, so you need to be able to
associated processes with links. For good link management you need to be
able to define multiple links to a document without modifying the source data.
Context-sensitive links are required if links are to be used in different
information sets. Link end (anchor) resolution has to be independent of
the data the anchor is associated with.
Link management today cannot be done in a standardized way. HTML has no methodology
for managing its embedded links. The XML link specification is still not
complete. General purpose HyTime engines are still not available. Specialist
document management systems provide you with control of a restricted data
set, but are not interoperable.
HyTime is defined as a set of SGML architectural forms that can be associated
with any application-specific DTD. The architectural forms for links
reference two or more link ends that can be resolved, at access time, to
specific anchors. Query locations allow dynamic finding of link ends.
XML links have a class-identification attribute that acts as an architectural
form, TEI-based query locators, a system-dependent mechanism for identifying
files (URLs) and attributes for specifying behaviour-related processing.
Most SGML repositories use system-dependent repository identifiers that can
be used to identify link ends in a locally efficient manner. These identifiers
typically only address storage elements, not the entities they are contained
within or their sub-elements or components of their data. They rarely
allow context-sensitive links to be defined.
Link management should be devolved to specialists with suitable indexing skills.
These specialists will need a point-and-click mechanism for defining links.
They should not be concerned with how the selected data is identified internally.
Link sets need version histories, and these need to be tied to the histories
of the fragments that make up the indexed documents.
Patricia François from Aerospatiale and Phillippe Futtersach from
Électricité de France explained why they had adopted a
HyTime-based approach for their projects in a paper entitled Hypermedia
databases. Starting as two separate projects, Aerospatiale and EDF
came to realise that they needed to adopt the same approach for different
reasons. Aerospatiale were concerned about longevity and how to reuse the
data on a wide range of customer platforms. Both organizations had to meet
regulatory controls, and had a complex set of relationships between data
that change over time.
Formal object-oriented analysis (OOA) methods were used to define
requirements. Modelling was done in a way that was independent of
the database finally chosen (the French O2 object-oriented database).
Navigation of content is a key factor for accessing multimedia objects.
SGML needed to be supplemented by HyTime's linking facilities to
provide the level of interconnection needed. HyTime functions were
introduced incrementally on a base of plain SGML.
Separate editing and browsing suites were developed. A mixture of
context-sensitive and full-text searching can be used to identify
relevant fragments from the database. Java applets are used to
capture queries, which can include options like stem searching.
Henry Thompson looked at ways of Using XML for stand-off markup.
Stand-off markup is the separation of markup from the data that is marked up.
It allows read-only data to be marked up. It also allows more than one
structure to be applied to the same data without having to overload the source
with markup or worry about boundary clashes.
The British National Bibliography contains some 2GB of data. Multiple types
of linguistic analysis need to be performed on the whole database.
The best way to do this is by adding hypertext links from the analysis
structure to the data. The types of links required are not the normal "follow
me" type. They need to be associated with behaviour for such things as
inclusion, replacement and inverse replacement (selective mending).
XML links support multiple linking elements and light-weight TEI-based
querying, with support for spans and user-determined link semantics.
As such it provides a good starting point for stand-off linking. A free
XML analysis tool is available at http://www.ltg.ed.ac.uk/software/xml.
Michel Biezunski of High Text (France) presented the background to ISO's
Topic Navigation Map project, highlighting how it was a natural developement
out of the Davenport DOCBOOK/SOFABED project and the work on the
Conventions for the Application of HyTime (CApH). He showed how topic
navigation maps can link together sets of links as well as data related to
a well defined topic. Finally he showed how a topic navigation map could
be used to find your way around the CD-ROM version of the proceedings of the
conference.
Martin Bryan of The SGML Centre (UK) then quickly walked through the recent changes to ISO/IEC CD 13250,
which had been updated the preceding weekend to take account of the latest
changes to HyTime and XML. Three new options were now available for
linking together data relating to a particular topic. It has to be
decided which of these is the best, or whether multiple methods should be
allowed. The new HyLinks architectural form would allow all details of a
topic to be stored as attributes of a single element. The new VarLink
architectural form would allow XML-like links to multiple locations, while
using the XML extended link architectural form you would need a separate
locator element for each use of the topic.
It is hoped that the new Topic Relationship architectural form could be
defined in such a way that it can be used to record details about any
database schema. The STEP EXPRESS model for schema representation is to
be investigated to see if it can provide the basis for a general-purpose
SGML architectural form that the Topic Relationship form can be derived from.
Hasse Haitto from Synex Information AB (Sweden) looked at the ways in
SGML has been evolving over the last 10 years. He noted that the SGML
standard itself had not changed; it was the way in which people were
using the standard that had changed. From its initial base as a means
of interchanging published documents it has now become a general purpose
language for describing the inter-relationships between different pieces of
data.
Modular DTDs designed to deal with documents that will be generated as
as set of fragments that will be combined into virtual documents are
becoming the norm. Reusable data fragments need to be associate with
management meta-data that can be queried to identify the fragments required.
Entity managers are the key to the flexible use of SGML. Originally people
used system-specific entity identifiers. The advent of a standardized form
of catalog for mapping public identifiers to system-specific entity names
has created a portable form of entity management. Entity managers that
allow encryption/decryption and late-binding of delivered data will follow
from the Formal System Identifiers added by the SGML Extended Facilities Annex
during 1997.
Indirected addressing of the type seen in HyTime will become increasingly
important to the SGML community. XML is a minimalist approach to the use of
SGML. HyTime addressing is much more generalized than the URL-based XML
form of addressing. Advanced applications will typically look to HyTime to
provide the forms of indirection they need. Mr. Haitto quoted Joan Smith's
declaration that "HyTime is the application of SGML that is destined to
take information processing into the next millenium."
SGML subdocuments provide a natural level for information reuse as they are
a natural unit at which to assign a different set of style semantics.
SGML's link feature provides a natural way of associating meta-data with
documents and subdocuments. These 'forgotten features' of SGML are
now becoming much more appreciated than they were when they were first
introduced a decade ago.
Groves are the most important of the recent extensions to our use of SGML.
They provide a model for exchanging information between applications in a
standardized way. As well as making structured document queries based on
SDQL possible, groves also make it feasible to develop industry standard APIs,
which will speed the development of new tool sets.
After reviewing the history of the review process, Dr. Goldfarb went on to
detail some of the changes that have already been accepted for the next
extension to the SGML standard. In addition to those already published as
part of the SGML Extended Facilities annex to the HyTime standard and the
Extended Naming Rules Technical Corrigendum published in 1996, the following
additions will be defined in a new WebSGML Adaptations Technical Corrigendum
to be approved during 1997:
- declared attribute values that can be shared by different attributes
- multiple attribute list declarations for a single element
- declaration of global attributes
- impliable document type definitions
- impliable element, attribute and entity declarations
- parsing without DTDs
- referencing of SGML declarations as external entities
- optional removal of capacity and quantity constraints
- unbundling of short tag options to allow control of individual options
- predefined (character set independent) entities for replacement of delimiters
- simplified white space handling
- optional end-tags for empty elements
- hexadecimal numeric character references for ISO 10646 character sets
- use of Internet domain names in formal public identifiers
- new conformance classes such as tag-valid and integrally stored.
Once these extensions have been fast-tracked to meet the needs of the XML
community a full review of SGML will be undertaken. Among the extensions
already agreed as part of this review are:
- more granularity to the SGML Declaration to allow options to be
selected individually
- reference to an externally stored SGML declaration with facilities
to override individual options from the external set
- no presumption as to which alphabetic characters and digits make
up the default name set
- use of the letter B in short reference strings, with alternative
mechanism for identifying blank (whitespace) sequences
- more than one language qualifier to a formal public identifier
- new classes for public text identifiers, including removal of
restrictions on where attributes can be defined
- restrictions on where marked sections will be valid, and the
keywords that will be valid in each circumstance
- the addition of an OR option for marked section keywords
- references to unique IDs can be qualified by the name of the
document entity they are part of
- assignment of default attribute value for current attributes
- assignment of notations to internal entities as well as external ones
- omission of notation declarations for external entities when this
information can be provided by the entity manager
- requiring processing instructions to begin with a notation name
- requiring entity names to end with the REFC delimiter
- modularization of name spaces used by SGML.
Lynne Price and David Peterson showed examples of how many of these
new features could be used to improve the flexibility of existing SGML
applications.
Those wishing to keep abreast of developments should keep an eye on
http://www.SGMLsource.com/Goldfarb/8879rev.
The closing keynote was given by Eliot Kimber, who took as his theme
the role of SGML in the 21st Century. SGML is not being 'threatened' by
HTML, XML, Java or PDF. As people are getting more sophisticated in their
requirements they are becoming to realise that they need more of SGML.
It was a momentous year for the SGML community. First there was the
publication of the DSSSL standard, closely followed by Jame's Awesome DSSSL
Engine (JADE), the first public domain DSSSL processor for converting
SGML into HTML, RTF, ... Closely followed by this was the 2nd edition of
HyTime, which incorporated the key features of DSSSL (groves, SDQL, etc)
into the new SGML Extended Facilities annex alongside features from HyTime,
such as property sets and lexical typing, which are useful to all SGML
applications. The development of XML, including XML-links, has further
clarified the way in which SGML applications should interact with one anoother.
One of the surprises of the year has been the way in which the WWW community
have taken to XML. Despite being very incomplete the XML-link proposal
has already been identified as meeting the needs of a large community of
web users.
While SGML was designed in the age of 'computer dinosaurs' its concepts
are still valid in today's slim-line client-server environment. What needs
to happen is that the high-cost/high-benefit approach to existing SGML becomes
a low-cost/high-benefit approach of the type propounded by XML. In particular
start-up costs need to be lowered, and the learning curve needs to be reduced.
SGML groves will make it easier to build plug-and-play SGML tools
based on a common API. SGML Open shows how vendors can work together to
improve interchangeability and system interfaces. The main constraint on
how fast development can occur will be the human endurance of those trying
to improve the standard!
Martin Bryan
|