What is SGML?

This chapter gives a short introduction to the Standard Generalized Mark-up Language (SGML)^[1].SGML is an ISO standard for defining document structures for the application of mark-up schemes. It provides a consistent and precise manner of applying mark-up for describing the component parts of a document, enabling the exchange of revisable documents between different computer systems.

The SGML standard grew out of development work on generic coding and mark-up languages in the early 1970s. Various lines of research merged into the ISO subcommittee Text Description and Processing Languages, which produced the standard in 1986.

SGML is an open-ended definition reflecting the diversity of information needs in industry. It does not directly define any types of content data, and thus does not restrict the type of data contained in a document. It is flexible enough to be able to describe any logically structured set of information, whether it be a form, memo, letter, report, book, encyclopaedia, dictionary or even a spreadsheet or database.

SGML itself is not a mark-up scheme - it does not define mark-up tags nor does it provide a template for a particular type of document - rather it denotes a way of describing any mark-up scheme. By using SGML, many mark-up schemes can be developed, one for each document type or class. This is both a strength and weakness for SGML.

Its flexibility to specify document structures for any application has led to its adoption by almost all industrial and institutional areas and it is today by far the most widespread method of encoding and interchanging structured documents. However, this has meant that many organisations have developed their own ways of using SGML (in the form of Document Type Definitions, explained in Section 6.2.1) which are then used as restrictive ad hoc standards. This has led to a certain amount of confusion and duplication of effort. However, today the situation is stabilising and consensus between groups is producing more and more de facto and de jure SGML standards.

Scope and Application

SGML provides a coherent and unambiguous syntax for describing whatever a user chooses to identify within a document. It is a meta language developed from the requirements that:

a wide variety and range of text processing systems
be able to accept SGML documents
choice of character set and national language not be
confined
choice of data stream or file organisation be unlimited
marked up elements coexist with other data
mark-up be human as well as machine understandable.

Applications use SGML to define mark-up schemes. For example, a group of publishers could define a mark-up scheme for describing textbooks. Each publisher could then implement this textbook mark-up scheme to fit their own system. As long as documents conformed to this scheme, they could be interchanged at will. Any author writing to conform to the scheme could submit articles to any publisher. The article could then be automatically formatted using the publisher's own style.

In SGML-based environments, document formatting is handled at document presentation time, rather than at the time data is entered. The Document Style Semantics and Specification Language (DSSSL) can be used to interchange rules that control the presentation of SGML-encoded data. (See Chapter 7.)

Systems integrators exchanging large volumes of industrial information among various hardware/software platforms often choose SGML for encoding document information, particularly if their applications need an internationally accepted and widely supported basis. SGML is perhaps best known for its use in the publishing industry and for its part in the US Department of Defence's CALS (Continuous Acquisition and Life-cycle Support program to create Interactive Electronic Technical Manuals (IETMs). The TEI (Text Encoding Initiative - a widely supported international effort by academic researchers to standardise the encoding of literary and other texts, has encoded gigabytes of data using SGML defined formats. SGML applications can provide the necessary structure for OII exchanges that support communities of mutual interest, even when they involve many participants and several human languages.

SGML's ability to express context-free document descriptions makes it as intrinsically well-suited to describing multimedia documents as for newspapers or textbooks. Its suitability has led many CD-ROM, electronic book, hypertext, and hypermedia vendors to use SGML as their document preparation standard. Perhaps the largest set of SGML encoded files are those used on the World Wide Web. HTML, the hypertext mark-up language used in Web documents, is an application of SGML.

A special advantage of SGML encoded documents is the flexibility that the coding gives them to be reused for different purposes, and to interact with systems that didn't even exist when they were created.

SGML Concepts

A document inherently contains two types of information: structure and content. The structure is determined by the mark-up (tags). The content is the text contained between the tags. Mark-up denotes and encodes a document's logical element structure (chapters, sections, paragraphs, etc.) and provides the structure information needed to manipulate the document. Using SGML mark-up documents are structured hierarchically, like a tree, with the top level being for example, the BOOK, which is divided into CHAPTERs, which are divided into SECTIONs, and so on.

One of the reasons for the success of SGML is that it appears to be rather simple. It extends the idea of marking up text, which is a well known technology (see Annex), in a fashion that adds several powerful features without appearing to add any complexity. The ingenuity comes from the idea of providing a definition of the mark-up as a separate declaration which is known as the Document Type Definition. The key to understanding SGML is to understand the concept of a DTD.

SGML Document Type Definitions (DTDs)

The SGML language is used for creating programs called DTDs which specify how to mark-up a document to indicate its structure. Each DTD is a descriptive program defining the structure of a particular class of documents. DTD "programs" written using SGML define the logical element structure of an application's documents. The element names are used as mark-up tags to surround the content for each structural element.

Before a DTD can be created it is necessary to analyse typical examples of the documents to be encoded to determine the role of each of the information elements to be marked up. The aim of this document analysis is to identify nested sets of non-overlapping information containers (elements), starting with an outermost container element that is the published document. Each type of container (element) is assigned a generic identifier which acts as a class name. For each container class a set of rules is developed to show what may be placed in the container and the order in which contained objects must be entered. The relationships between each of the information containers is then documented in a form suitable for processing by a computer.

SGML has powerful facilities for co-ordinating various documentary materials. It can link files to form composite documents and identify where illustrations are to be

incorporated in the text. DTDs can include provisions for graphics files, SPDL or PostScript printing, or special processing codes for mathematics. Applications can also include special character sets for national languages. Special constructs can also be included, for example, if a document is published as part of an on-line database (in addition to being published as a book), it could include container element for user comments and questions.

SGML Parser

As a process-independent language, SGML lets you validate any document instance before beginning potentially costly processing. The programs that validate SGML documents are called SGML parsers.

SGML parsers are designed to read and interpret DTDs and to construct the logical hierarchical (tree) structure described by the DTD. All documents marked up using the mark-up scheme described by the DTD must conform to this structure. An SGML parser validates a document by parsing it into its parse tree and comparing this parse tree to the structure defined by the DTD.

Thus to parse a document with an SGML parser requires that first, the DTD is interpreted by the parser, and then a document is read in and parsed into its parse tree and thereby checked for compliance. One can call an SGML parser, a meta-parser, as it constructs specific parsers for each particular type of document described by a DTD.

SGML provides that any conforming installation can interpret mark-up schemes specified in a DTD. Any document encoded in an SGML mark-up scheme can be interpreted by an SGML parser and its structure faithfully reconstructed in the parse tree. SGML parsers can be integrated with other software to provide access to the structure of SGML documents and for checking documents for DTD compliance.

The Components of SGML

SGML denotes physically stored documents as a set of nested entities. Each SGML entity contains one or more information containers or elements, e.g., paragraphs, sections, headings, chapters etc. of logically related data. Elements can be assigned attributes (properties) describing their contents.

An SGML document normally consists of three items: the SGML Declaration, the DTD and the text of the document itself.

To allow the computer to correctly identify each component of a document - what is a tag, what is content, etc. - SGML uses an SGML Declaration to tell the computer which codes it should use to identify the start and end of mark-up sequences. The SGML standard contains a default SGML Declaration that is used when no other declaration is supplied by the document preparer. (This default declaration does not, however, allow many of the optional features of SGML to be used within associated documents.)

The Document Type Definition (DTD) describes each element of the document in a format that the computer can understand, and defines the relationships between the entities, elements and attributes that make up a document.

As SGML does not define a default set of mark-up tags, the specification of suitable sets of tags is left to users or special interest groups such as trade associations. This is desirable as different groups of users may have their own preferred terminology; what might be called a <CHAP> or <chapter> by one group might be called <section> by another. Commonly used examples of DTD define:

mark-up tags for encoding books and journal articles according to the rules specified by the American Association of Publishers (AAP) and the European Working Group on SGML
mark-up tags for encoding electronic manuals for computer systems (the Davenport DOCBOOK DTD)
mark-up tags for encoding aircraft and car maintenance manuals (the ATA100 and J2008 DTDs)
mark-up tags for electronically encoding literary works and artistic performances (the DTDs developed as part of the international Text Encoding Initiative).

How is SGML Used?

To mark-up a document using a mark-up scheme that has been defined in a DTD, document creators need understand:

how mark-up tags are distinguished from text, and
the order in which the various elements making up the document must be entered.

When creating documents, systems that understand SGML provide users with lists of the tag names that are valid at each point in a document, and will automatically add relevant delimiters to each piece of mark-up. Where the data capture system does not understand SGML, users must either map the word processor's coding scheme to the relevant SGML tag set or manually enter the SGML tags as part of the text.

Because SGML tag sets are based on the logical structure of documents they are somewhat easier to understand, and remember, than physically-based layout document mark-up languages. Typically an SGML-encoded memo might be marked-up as follows:

<to>All staff<R>

<from>Martin Bryan<R>

<date>5th November<R>

<para>>Please remember to keep all cats and dogs indoors tonight.

Users who have access to the full power of SGML's short reference feature (see Section 6.10) will be able to use intrinsic features of the document to reduce the number of SGML mark-up tags needed to encode the document.

Return to Home Page |

E-mail us your comments |

Order form

Technology Appraisals Ltd
webmaster@techapps.co.uk
Phone +44 (0)181 893 3986
Fax +44 (0) 181 744 1149
82 Hampton Road, Twickenham
TW2 5QS UK