[Mirrored from: http://www.techapps.co.uk/iihb_sgml.html]
This chapter gives a short introduction to the Standard
Generalized Mark-up Language (SGML)[1]. SGML
is an ISO standard for defining document structures for the application
of mark-up schemes. It provides a consistent and precise manner
of applying mark-up for describing the component parts of a document,
enabling the exchange of revisable documents between different
computer systems.
The SGML standard grew out of development work on generic coding
and mark-up languages in the early 1970s. Various lines of research
merged into the ISO subcommittee Text Description and Processing
Languages, which produced the standard in 1986.
SGML is an open-ended definition reflecting the diversity of information
needs in industry. It does not directly define any types of content
data, and thus does not restrict the type of data contained in
a document. It is flexible enough to be able to describe any logically
structured set of information, whether it be a form, memo, letter,
report, book, encyclopaedia, dictionary or even a spreadsheet
or database.
SGML itself is not a mark-up scheme - it does not define mark-up
tags
nor does it provide a template for a particular type of document
- rather it denotes a way of describing any mark-up scheme. By using SGML,
many mark-up schemes can be developed, one for each document type or
class. This is both a strength and weakness for SGML.
Its flexibility to specify document structures for any application
has led to its adoption by almost all industrial and institutional
areas and it is today by far the most widespread method of encoding
and interchanging structured documents. However, this has meant
that many organisations have developed their own ways of using
SGML (in the form of Document Type Definitions, explained in Section
6.2.1) which are then used as restrictive ad hoc standards.
This has led to a certain amount of confusion and duplication
of effort. However, today the situation is stabilising and consensus
between groups is producing more and more de facto and
de jure SGML standards.
Applications use SGML to define mark-up schemes. For example,
a group of publishers could define a mark-up scheme for describing
textbooks. Each publisher could then implement this textbook mark-up
scheme to fit their own system. As long as documents conformed
to this scheme, they could be interchanged at will. Any
author writing to conform to the scheme could submit articles
to any publisher. The article could then be automatically formatted
using the publisher's own style.
In SGML-based environments, document formatting is handled at
document presentation time, rather than at the time data is entered.
The Document Style Semantics and Specification Language (DSSSL)
can be used to interchange rules that control the presentation
of SGML-encoded data. (See Chapter 7.)
Systems integrators exchanging large volumes of industrial information
among various hardware/software platforms often choose SGML for
encoding document information, particularly if their applications
need an internationally accepted and widely supported basis. SGML
is perhaps best known for its use in the publishing industry and
for its part in the US Department of Defence's CALS (Continuous
Acquisition and
Life-cycle Support program to create Interactive Electronic
Technical Manuals (IETMs). The TEI (Text Encoding Initiative -
a widely supported international effort by academic researchers
to standardise the encoding of literary and other texts, has encoded
gigabytes of data using SGML defined formats. SGML applications
can provide the necessary structure for OII exchanges that support
communities of mutual interest, even when they involve many participants
and several human languages.
SGML's ability to express context-free document descriptions makes
it as intrinsically well-suited to describing multimedia documents
as for newspapers or textbooks. Its suitability has led many CD-ROM,
electronic book, hypertext, and hypermedia vendors to use SGML
as their document preparation standard. Perhaps the largest set
of SGML encoded files are those used on the World Wide Web. HTML,
the hypertext mark-up language used in Web documents, is an application
of SGML.
A special advantage of SGML encoded documents is the flexibility
that the coding gives them to be reused for different purposes,
and to interact with systems that didn't even exist when they
were created.
One of the reasons for the success of SGML is that it appears
to be rather simple. It extends the idea of marking up text, which
is a well known technology (see Annex), in a fashion that adds
several powerful features without appearing to add any complexity.
The ingenuity comes from the idea of providing a definition of
the mark-up as a separate declaration which is known as the Document
Type Definition. The key to understanding SGML is to understand
the concept of a DTD.
Before a DTD can be created it is necessary to analyse typical
examples of the documents to be encoded to determine the role
of each of the information elements to be marked up. The aim of
this document analysis is to identify nested sets of non-overlapping
information containers (elements), starting with an outermost
container
element that is the published document. Each type of container
(element) is assigned a generic identifier which acts as a class
name. For each container class a set of rules is developed to
show what may be placed in the container and the order in which
contained objects must be entered. The relationships between each
of the information
containers is then documented in a form suitable for processing
by a computer.
SGML has powerful facilities for co-ordinating various documentary materials. It can link files to form composite documents and identify where illustrations are to be
incorporated in the text. DTDs can include provisions for graphics files, SPDL or PostScript printing, or special processing codes for mathematics. Applications can also include special character sets for national languages. Special constructs can also be included, for example, if a document is published as part of an on-line database (in addition to being published as a book), it could include container element for user comments and questions.
SGML parsers are designed to read and interpret DTDs and to construct
the logical hierarchical (tree) structure described by the DTD.
All documents marked up using the mark-up scheme described by
the DTD must conform to this structure. An SGML parser validates
a document by parsing it into its parse tree and comparing
this parse tree to the structure defined by the DTD.
Thus to parse a document with an SGML parser requires that first,
the DTD is interpreted by the parser, and then a document is read
in and parsed into its parse tree and thereby checked for compliance.
One can call an SGML parser, a meta-parser, as it constructs
specific parsers for each particular type of document described
by a DTD.
SGML provides that any conforming installation can interpret mark-up
schemes specified in a DTD. Any document encoded in an SGML mark-up
scheme can be interpreted by an SGML parser and its structure
faithfully reconstructed in the parse tree. SGML parsers can be
integrated with other software to provide access to the structure
of SGML documents and for checking documents for DTD compliance.
An SGML document normally consists of three items: the SGML Declaration,
the DTD and the text of the document itself.
To allow the computer to correctly identify each component of
a document - what is a tag, what is content, etc. - SGML uses
an SGML Declaration to tell the computer which codes it
should use to identify the start and end of mark-up sequences.
The SGML standard contains a default SGML Declaration that is
used when no other declaration is supplied by the document preparer.
(This default declaration does not, however, allow many of the
optional features of SGML to be used within associated documents.)
The Document Type Definition (DTD) describes each element of the
document in a format that the computer can understand, and defines
the relationships between the entities, elements and attributes
that make up a document.
As SGML does not define a default set of mark-up tags, the specification
of suitable sets of tags is left to users or special interest
groups such as trade associations. This
is desirable as different groups of users may have their own preferred
terminology; what might be called a <CHAP> or <chapter>
by one group might be called <section> by another. Commonly
used examples of DTD define:
When creating documents, systems that understand SGML provide
users with lists of the tag names that are valid at each point
in a document, and will automatically add relevant delimiters
to each piece of mark-up. Where the data capture system does not
understand SGML, users must either map the word processor's coding
scheme to the relevant SGML tag set or manually enter the SGML
tags as part of the text.
Because SGML tag sets are based on the logical structure of documents
they are somewhat easier to understand, and remember, than physically-based
layout document mark-up languages. Typically an SGML-encoded memo
might be marked-up as follows:
<from>Martin Bryan<R>
<date>5th November<R>
<subject>Cats and Dogs<R>
<para>>Please remember to keep
all cats and dogs indoors tonight.
Users who have access to the full power of SGML's short reference
feature (see Section 6.10) will be able to use intrinsic features
of the document to reduce the number of SGML mark-up tags needed
to encode the document.
© Technology Appraisals Limited 1996