[Mirrored from: http://dlar.fcit.monash.edu.au/~dfoott/2800/lect08.html]

COT1800 Public Networks
Lecture 8 SGML

Standard Generalised Markup Language

`Found IT,' the Mouse replied rather crossly: `of course you know what "it" means.'

`I know what "it" means well enough, when I find a thing,' said the Duck: `it's generally a frog or a worm. The question is, what did the archbishop find?'

Banal Markup

All printed texts are encoded. There are conventions that are applied to printed text to indicate all sorts of functions and operations within the text. Often these relate to the speech origin of text, with marks to indicate breathing, pauses for effect, and so on.

Punctuation marks, use of capitalization, disposition of letters around the page, even the spaces between words, might be regarded as a kind of markup, the function of which is to help the human reader determine where words and concepts end or to identify broad structural features.

Document Markup

Editorial Markup to Typesetting

Over a period of time, a standard set of symbols was developed and used by copy editors to communicate with typesetters.

As typesetting functions became computerized, text formatting languages were written. A typesetter would convert the copy editor's markup into the appropriate markup for the text formatting language being used.

Word Processing

One of the earliest applications of computing was to the setting up of text. Because the rules were able to be codified they leant themselves to a computing operation. In particular the typesetter's tasks, though requiring considerable human skill, were able to be performed easily and quickly by machine.
Early text or type setting programs were the parent of word processors.
Some general features of word processors [as mark-up applications] include

Conversion Utilities

The Generic Coding Concept

Historically, electronic manuscripts contained control codes or macros that caused the document to be formatted in a particular way ('specific coding').

In contrast, generic coding, which began in the late 1960s, uses descriptive tags(for example, 'heading', rather than 'format-17').

3 Important Stages

William Tunnicliffe, chairman of the Graphic Communications Association (GCA) Composition Committee, during a meeting at the Canadian Government Printing Office in September 1967 emphasised the separation of the information content of documents from their format.

Also in the late 1960s, a New York book designer named Stanley Rice proposed the idea of a universal catalog of parameterized 'editorial structure' tags.

Norman Scharpf, director of the GCA, recognized the significance of these trends, and established a generic coding project in the Composition Committee

Charles Goldfarb

In 1969, Charles Goldfarb was leading an IBM research project on integrated law office information systems.

Goldfarb, with Edward Mosher and Raymond Lorie invented the Generalized Markup Language (GML) as a means of allowing the text editing, formatting, and information retrieval subsystems to share documents.


GML (which, not coincidentally, comprises the initials of its three inventors) was based on the generic coding ideas of Rice and Tunnicliffe.

Instead of a simple tagging scheme, however, GML introduced the concept of a formally-defined document type with an explicit nested element structure.

Development of SGML as an International Standard

In 1978, the American National Standards Institute (ANSI) committee on Information Processing established the Computer Languages for the Processing of Text committee

Goldfarb was asked to join the committee and eventually to lead a project for a text description language standard based on GML.

International Standard

Early Applications of SGML

3 Elements of SGML

You can break a typical document into three layers: structure , content, and style.

SGML separates these three aspects, but deals mainly with the relationship between structure [DTD] and content [tagged material].

Structure - the DTD [Document Type Definition]


SGML Declaration

Advantages of SGML


Many tags can be declared but typical are:

What it looks like

This should look fairly familiar to you. HTML, the markup that is used in WWW documents, is a specific instance of SGML - a limited DTD is available, and inherent in each Web browsing software application, and the markup uses a limited set of SGML tags.

<text> <tPage>
<dTitle type=main>PARADISE LOST</dTitle>
<dAuthor>John Milton</dAuthor>
<div type='book' n=B1>
<head>BOOK I.</head>
<l> Of Mans First Disobedience, and the Fruit
<l>Of that Forbidden Tree, whose mortal tast
<l>Brought Death into the World, and all our woe,

A typical DTD segment

This is from the Text Encoding Initiative Document Type Definition, a specific set of DTDs which have developed to handle the majority of literary modes [novels, poetry, drama, etc]. The intention is that it may not be necessary to transmit the DTD with the document if it is simply announced in the header that the TEI DTD is used, and the TEI DTD is widely distributed.
The TEI DTDs allow the production of proper scholarly editions of works, complete with alternate readings, and allow the document to be presented as an authorised edition would appear. This is moving electronic publishing into the realms of being able to match printed editions. The material presented on screen [and, if necessary, printed] will look like the recognised print edition.

<!DOCTYPE ota SYSTEM "ota.dtd" [
<!ENTITY % OTAents system "unixlat0.dtd">
<!ENTITY file1 SYSTEM "plost.1827">
<ota n="1827" crdate="1993-02-02" update="1993-07-19">
<title>Milton's "Paradise Lost": electronic edition</title>
Public Domain TEI edition prepared at the Oxford Text Archive
Filesize uncompressed: 498 Kbytes.<pubStmt>