[Mirrored from: http://dlar.fcit.monash.edu.au/~dfoott/2800/lect08.html]
COT1800 Public Networks
Lecture 8 SGML
Standard Generalised Markup Language
`Found IT,' the Mouse replied rather crossly: `of course
you know what "it" means.'
`I know what "it" means well enough, when I find
a thing,' said the Duck: `it's generally a frog or a worm. The
question is, what did the archbishop find?'
Banal Markup
All printed texts are encoded. There are conventions that are applied to printed text to indicate all sorts of functions and operations within the text. Often these relate to the speech origin of text, with marks to indicate breathing, pauses for effect, and so on.
Punctuation marks, use of capitalization, disposition of
letters around the page, even the spaces between words, might
be regarded as a kind of markup, the function of which is to help
the human reader determine where words and concepts end or to
identify broad structural features.
Document Markup
- Document markup is
- the process of adding codes to a document to identify the
structure or the format in which it is to appear.
-
a communication form that has existed for many years. Until the
computerization of the printing industry, markup was primarily
done by a copy editor writing instructions on a manuscript for
a typesetter to follow.
These included instructions on the type of font to be used, the size to be used, whether bold or italics were to be used, to indicate underlining if it were necesary, to indicate spacing, hyphenation, indentation and so on.
Every facet of the appearance of the printed work was represented in code to the typesetter.
Editorial Markup to Typesetting
Over a period of time, a standard set of symbols was developed
and used by copy editors to communicate with typesetters.
As typesetting functions became computerized, text formatting
languages were written. A typesetter would convert the copy editor's
markup into the appropriate markup for the text formatting language
being used.
Word Processing
One of the earliest applications of computing was to the setting up of text. Because the rules were able to be codified they leant themselves to a computing operation. In particular the typesetter's tasks, though requiring considerable human skill, were able to be performed easily and quickly by machine.
Early text or type setting programs were the parent of word processors.
Some general features of word processors [as mark-up applications] include
- Each word processing program had/has its own [proprietary] method of markup.
For a long time this meant that documents prepared using one word processing tool were not able to be read by others. Often documents were migrated in a "lowest common denominator" form, often as ASCII code with formatting lost, and then reformatted in the new environment.
- Most electronic devices which store text for later recall
and output use some form of markup.
Even when text is stored as unformatted ASCII code there are still some markup elements retained - spaces, commas and other punctuation, line-feed and carriage-return characters, at least at the end of paragraphs, beginning and end of document markers, etc.
- The markup may or may not be apparent to the user.
- The markup may be visible, hidden, entered by the user, or
automatically generated.
- In some cases document markup takes the form of alphanumeric
text characters and in other cases it is stored as binary data.
- In most cases, some form of delimiter is used to indicate
the beginning of the markup and optionally the end of the markup.
-
Although the power of a word processing program to format a document
is impressive, it is also a burden when one wants to switch from one platform [software or hardware] to another.
Conversion Utilities
- To move between two word processors 2 convertors are
needed, but for 4 the number is 12 [the general formula is n*(n-1)]
- If a common [intermediate] standard is accepted the number of convertors,
using the common language, is 2n
Clearly there are significant advantages in the existence and use of a common base encoding system, even if only in terms of moving documents between platforms while retaining the inherent structure of the document.
The Generic Coding Concept
Historically, electronic manuscripts contained control codes or
macros that caused the document to be formatted in a particular
way ('specific coding').
In contrast, generic coding, which began in the late 1960s, uses
descriptive tags(for example, 'heading', rather than 'format-17').
3 Important Stages
William Tunnicliffe, chairman of the Graphic Communications Association
(GCA) Composition Committee, during a meeting at the Canadian
Government Printing Office in September 1967 emphasised the separation
of the information content of documents from their format.
Also in the late 1960s, a New York book designer named Stanley
Rice proposed the idea of a universal catalog of parameterized
'editorial structure' tags.
Norman Scharpf, director of the GCA, recognized the significance
of these trends, and established a generic coding project in the
Composition Committee
Charles Goldfarb
In 1969, Charles Goldfarb was leading an IBM research project
on integrated law office information systems.
Goldfarb, with Edward Mosher and Raymond Lorie invented the Generalized
Markup Language (GML) as a means of allowing the text editing,
formatting, and information retrieval subsystems to share documents.
GML
GML (which, not coincidentally, comprises the initials of its
three inventors) was based on the generic coding ideas of Rice
and Tunnicliffe.
Instead of a simple tagging scheme, however, GML introduced the
concept of a formally-defined document type with an explicit nested
element structure.
Development of SGML as an International Standard
In 1978, the American National Standards Institute (ANSI) committee
on Information Processing established the Computer Languages for
the Processing of Text committee
Goldfarb was asked to join the committee and eventually to lead
a project for a text description language standard based on GML.
International Standard
- In 1985, a draft proposal for an international standard was
published
- The international SGML Users' Group was founded in the the
UK by Joan Smith, who became its first president.
- A draft international standard was published in October 1985,
and approved in 1986 (ISO 8879:1986).
Other national standard bodies have also adopted the standard, usually unaltered from the ISO standard. Australia is no exception.
Early Applications of SGML
- Two early applications were developed with much broad participation:
- the Electronic Manuscript Project of the Association of American
Publishers (AAP),
- the documentation component of the Computer-aided Acquisition
and Logistic Support (CALS) initiative of the US Department of
Defense.
3 Elements of SGML
You can break a typical document into three layers: structure
, content, and style.
SGML separates these three aspects, but deals mainly with the
relationship between structure [DTD] and content [tagged material].
Structure - the DTD [Document Type Definition]
-
describes the structure of a document, much like a database schema
describes the types of information it handles and the relationships
between fields.
- provides a framework for the elements that constitute a document.
-
A DTD also specifies rules for the relationships between elements;
eg. "a chapter heading must be the first element after the
start of a chapter"
Content
- The information itself: content includes titles, paragraphs,
lists, tables, graphics, and audio.
- Identifying the content's position within the DTD structure
is called "tagging."
-
Creating an SGML document involves inserting tags around content.
These tags mark the beginning and end of each part of the structure.
There are, of course, authoring tools that enable tags to be "automatically" added to the text.
SGML Declaration
- Declares the Document to be an SGML document
- Specifies if a particular syntax has been used
- Defines character sets that are used
Advantages of SGML
- Machine/platform independence
- Portability over time
- Directly appropriate for information retrieval and hypertext
retrieval
- International standard and character sets
Tags
Many tags can be declared but typical are:
- <abstract>
- <author>
- <cit> for citation
- <hdref> for cross reference to heading
- <li> for a list of items
- <top1> for a top level topic etc
What it looks like
This should look fairly familiar to you. HTML, the markup that is used in WWW documents, is a specific instance of SGML - a limited DTD is available, and inherent in each Web browsing software application, and the markup uses a limited set of SGML tags.
<text>
<tPage>
<dTitle type=main>PARADISE LOST</dTitle>
<byLine>by
<dAuthor>John Milton</dAuthor>
</byLine>
</tPage>
</front>
<body>
<div type='book' n=B1>
<head>BOOK I.</head>
<l> Of Mans First Disobedience, and the Fruit
<l>Of that Forbidden Tree, whose mortal tast
<l>Brought Death into the World, and all our woe,
A typical DTD segment
This is from the Text Encoding Initiative Document Type Definition, a specific set of DTDs which have developed to handle the majority of literary modes [novels, poetry, drama, etc]. The intention is that it may not be necessary to transmit the DTD with the document if it is simply announced in the header that the TEI DTD is used, and the TEI DTD is widely distributed.
The TEI DTDs allow the production of proper scholarly editions of works, complete with alternate readings, and allow the document to be presented as an authorised edition would appear. This is moving electronic publishing into the realms of being able to match printed editions. The material presented on screen [and, if necessary, printed] will look like the recognised print edition.
<!DOCTYPE ota SYSTEM "ota.dtd" [
<!ENTITY % OTAents system "unixlat0.dtd">
%OTAents
<!ENTITY file1 SYSTEM "plost.1827">
]>
<ota n="1827" crdate="1993-02-02" update="1993-07-19">
<header>
<fileDesc>
<titlStmt>
<title>Milton's "Paradise Lost": electronic
edition</title>
<edStmt>
Public Domain TEI edition prepared at the Oxford Text
Archive
<extent>
Filesize uncompressed: 498 Kbytes.<pubStmt>