[Mirrored from: http://dlar.fcit.monash.edu.au/~dfoott/2800/lect08.html]

COT1800 Public Networks
Lecture 8 SGML

Standard Generalised Markup Language

`Found IT,' the Mouse replied rather crossly: `of course you know what "it" means.'

`I know what "it" means well enough, when I find a thing,' said the Duck: `it's generally a frog or a worm. The question is, what did the archbishop find?'

Banal Markup

All printed texts are encoded. There are conventions that are applied to printed text to indicate all sorts of functions and operations within the text. Often these relate to the speech origin of text, with marks to indicate breathing, pauses for effect, and so on.

Punctuation marks, use of capitalization, disposition of letters around the page, even the spaces between words, might be regarded as a kind of markup, the function of which is to help the human reader determine where words and concepts end or to identify broad structural features.

Document Markup

Document markup is
- the process of adding codes to a document to identify the structure or the format in which it is to appear.
- a communication form that has existed for many years. Until the computerization of the printing industry, markup was primarily done by a copy editor writing instructions on a manuscript for a typesetter to follow.
  These included instructions on the type of font to be used, the size to be used, whether bold or italics were to be used, to indicate underlining if it were necesary, to indicate spacing, hyphenation, indentation and so on.
  Every facet of the appearance of the printed work was represented in code to the typesetter.

Editorial Markup to Typesetting

Over a period of time, a standard set of symbols was developed and used by copy editors to communicate with typesetters.

As typesetting functions became computerized, text formatting languages were written. A typesetter would convert the copy editor's markup into the appropriate markup for the text formatting language being used.

Word Processing

One of the earliest applications of computing was to the setting up of text. Because the rules were able to be codified they leant themselves to a computing operation. In particular the typesetter's tasks, though requiring considerable human skill, were able to be performed easily and quickly by machine.
Early text or type setting programs were the parent of word processors.
Some general features of word processors [as mark-up applications] include

Each word processing program had/has its own [proprietary] method of markup.
For a long time this meant that documents prepared using one word processing tool were not able to be read by others. Often documents were migrated in a "lowest common denominator" form, often as ASCII code with formatting lost, and then reformatted in the new environment.
Most electronic devices which store text for later recall and output use some form of markup.
Even when text is stored as unformatted ASCII code there are still some markup elements retained - spaces, commas and other punctuation, line-feed and carriage-return characters, at least at the end of paragraphs, beginning and end of document markers, etc.
The markup may or may not be apparent to the user.
The markup may be visible, hidden, entered by the user, or automatically generated.
In some cases document markup takes the form of alphanumeric text characters and in other cases it is stored as binary data.
In most cases, some form of delimiter is used to indicate the beginning of the markup and optionally the end of the markup.
Although the power of a word processing program to format a document is impressive, it is also a burden when one wants to switch from one platform [software or hardware] to another.

Conversion Utilities

To move between two word processors 2 convertors are needed, but for 4 the number is 12 [the general formula is n*(n-1)]
If a common [intermediate] standard is accepted the number of convertors, using the common language, is 2n
Clearly there are significant advantages in the existence and use of a common base encoding system, even if only in terms of moving documents between platforms while retaining the inherent structure of the document.

The Generic Coding Concept

Historically, electronic manuscripts contained control codes or macros that caused the document to be formatted in a particular way ('specific coding').

In contrast, generic coding, which began in the late 1960s, uses descriptive tags(for example, 'heading', rather than 'format-17').

3 Important Stages

William Tunnicliffe, chairman of the Graphic Communications Association (GCA) Composition Committee, during a meeting at the Canadian Government Printing Office in September 1967 emphasised the separation of the information content of documents from their format.

Also in the late 1960s, a New York book designer named Stanley Rice proposed the idea of a universal catalog of parameterized 'editorial structure' tags.

Norman Scharpf, director of the GCA, recognized the significance of these trends, and established a generic coding project in the Composition Committee

Charles Goldfarb

In 1969, Charles Goldfarb was leading an IBM research project on integrated law office information systems.

Goldfarb, with Edward Mosher and Raymond Lorie invented the Generalized Markup Language (GML) as a means of allowing the text editing, formatting, and information retrieval subsystems to share documents.

GML

GML (which, not coincidentally, comprises the initials of its three inventors) was based on the generic coding ideas of Rice and Tunnicliffe.

Instead of a simple tagging scheme, however, GML introduced the concept of a formally-defined document type with an explicit nested element structure.

Development of SGML as an International Standard

In 1978, the American National Standards Institute (ANSI) committee on Information Processing established the Computer Languages for the Processing of Text committee

Goldfarb was asked to join the committee and eventually to lead a project for a text description language standard based on GML.

International Standard

In 1985, a draft proposal for an international standard was published
The international SGML Users' Group was founded in the the UK by Joan Smith, who became its first president.
A draft international standard was published in October 1985, and approved in 1986 (ISO 8879:1986).
Other national standard bodies have also adopted the standard, usually unaltered from the ISO standard. Australia is no exception.

Early Applications of SGML

Two early applications were developed with much broad participation:
- the Electronic Manuscript Project of the Association of American Publishers (AAP),
- the documentation component of the Computer-aided Acquisition and Logistic Support (CALS) initiative of the US Department of Defense.

3 Elements of SGML

You can break a typical document into three layers: structure , content, and style.

SGML separates these three aspects, but deals mainly with the relationship between structure [DTD] and content [tagged material].

Structure - the DTD [Document Type Definition]

describes the structure of a document, much like a database schema describes the types of information it handles and the relationships between fields.
provides a framework for the elements that constitute a document.
A DTD also specifies rules for the relationships between elements; eg. "a chapter heading must be the first element after the start of a chapter"

Content

The information itself: content includes titles, paragraphs, lists, tables, graphics, and audio.
Identifying the content's position within the DTD structure is called "tagging."
Creating an SGML document involves inserting tags around content. These tags mark the beginning and end of each part of the structure.
There are, of course, authoring tools that enable tags to be "automatically" added to the text.

SGML Declaration

Declares the Document to be an SGML document
Specifies if a particular syntax has been used
Defines character sets that are used

Advantages of SGML

Machine/platform independence
Portability over time
Directly appropriate for information retrieval and hypertext retrieval
International standard and character sets

What it looks like

This should look fairly familiar to you. HTML, the markup that is used in WWW documents, is a specific instance of SGML - a limited DTD is available, and inherent in each Web browsing software application, and the markup uses a limited set of SGML tags.

<text> <tPage>
<dTitle type=main>PARADISE LOST</dTitle>
<byLine>by
<dAuthor>John Milton</dAuthor>
</byLine>
</tPage>
</front>
<body>
<div type='book' n=B1>
<head>BOOK I.</head>
<l> Of Mans First Disobedience, and the Fruit
<l>Of that Forbidden Tree, whose mortal tast
<l>Brought Death into the World, and all our woe,

A typical DTD segment

This is from the Text Encoding Initiative Document Type Definition, a specific set of DTDs which have developed to handle the majority of literary modes [novels, poetry, drama, etc]. The intention is that it may not be necessary to transmit the DTD with the document if it is simply announced in the header that the TEI DTD is used, and the TEI DTD is widely distributed.
The TEI DTDs allow the production of proper scholarly editions of works, complete with alternate readings, and allow the document to be presented as an authorised edition would appear. This is moving electronic publishing into the realms of being able to match printed editions. The material presented on screen [and, if necessary, printed] will look like the recognised print edition.

<!DOCTYPE ota SYSTEM "ota.dtd" [
<!ENTITY % OTAents system "unixlat0.dtd">
%OTAents
<!ENTITY file1 SYSTEM "plost.1827">
]>
<ota n="1827" crdate="1993-02-02" update="1993-07-19">
<header>
<fileDesc>
<titlStmt>
<title>Milton's "Paradise Lost": electronic edition</title>
<edStmt>
Public Domain TEI edition prepared at the Oxford Text Archive
<extent>
Filesize uncompressed: 498 Kbytes.<pubStmt>

COT1800 Public Networks Lecture 8 SGML