[This local archive copy is from the official and canonical URL, http://www.softquad.com/sgmlinfo/primintr.html; please refer to the canonical source document if possible.]


Introduction to the SGML PRIMER

SoftQuad's Quick Reference Guide to the Essentials of the Standard: The SGML Needed for Reading a DTD and Marked-Up Documents and Discussing Them Reasonably.

Copyright (c) 1990, 1991, 1995 SoftQuad Inc. All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means - electronic, mechanical, recording, or otherwise - without the prior written consent of the publisher, excepting brief quotes used in connection with reviews written specifically for inclusion in a magazine or newspaper.

This is just the introduction from the booklet. The body of the book is also available. Soft cover hard copies of this booklet are available from SoftQuad at a nominal cost.

SGML: The End-User's Eye View

This brief overview introduces many of the major concepts of SGML, each of which is covered separately in this booklet. Terms in boldface are words or phrases which are defined very precisely in the standard itself. Definitions or descriptions here are not a substitute for those in the official document, ISO8879:1986.

The Standard Generalized Markup Language is necessarily sophisticated: It is providing a much-needed service by allowing the exchange of information at any level of complexity among software, hardware, storage and presentation systems (including database management and publishing applications) without regard to the manufacturer's name on the label. And it is doing all this with the authority of an International Standard. At the same time, SGML's strength is that it reflects the way people work today while nudging us all gently towards concepts about information handling.

Two Ways of Looking at the Document's Structure

In general, people who create the component pieces of electronic documents are actually thinking of them in two ways at once:

Two Kinds of Content

In any communication, two levels of information are being passed: what we think of as the content, and other, subtler information, about that content. That other information -- boldface in a book, underlining on a hand-written memo, shouting in a face-to-face conversation -- may be thought of as markup. Its job is to express information (in general reflecting hierarchical structure) that is useful to a human or computer for processing the content.

SGML makes exactly the same distinction, dividing what is contained in a document into content, or data (made up, naturally, of data characters, which are the letters of the alphabet, the numbers, punctuation, and so on) and markup (made up of markup characters, which, by an important coincidence, are also letters, numbers, punctuation characters).

Markup is not a new idea. Traditionally, designers marked up raw manuscripts with instructions to a typesetter who did whatever was required to make titles appear big, bold and centered, to make paragraphs a certain width with an indent, and so forth. Those instructions would appear as a string of gibberish, meaningful only to the machine being used to set the type. Often they would contain "control-codes" that could baffle and halt anyone else's typesetting system.

At the same time, those instructions, embedded in the flow of text, guaranteed the long-term uselessness of the information. If it was to be revised and republished, it had to go to the same typesetting system, probably outdated by the time of the revision. If someone wanted to change the design, it meant someone (generally someone else) having to go into the files to edit every occurrence of whatever cryptic instructions made the title 36-point bold Times Roman centered.

Using computerized global search-and-replace techniques couldn't work because the same instructions might appear in a variety of places that were not logically related. If you wanted to turn every foreign language term from italic to bold, you would accidentally but automatically convert all the italic book titles and emphasized text too.

(We've written this section in the past tense, but, in fact, procedural markup -- whereby an operator uses cryptic machine-dependent instructions to tell a system to perform an action such as switch fonts, embolden, center -- is still the prevalent technique today. Not for long. Descriptive markup, its opposite, identifies the elements within the document instance which make up its logical structure and solves many of the problems listed above.)

Two Improvements on Old Style Markup

SGML begins by defining a character set, generally based on the ASCII standard characters, which can be sent, safely, to any system. Peculiar and special characters (bullets and boxes, math symbols and so forth) are turned into ASCII representations -- entity references -- that get converted by the receiving system into whatever it needs to reproduce those characters. This means no peculiar "control" or "alternate" characters are used.

The second improvement came about as the creators of SGML realized that the places where markup traditionally had to be inserted in a document matched the elements of its logical structure. For example: text size changed because a title had begun; a typeface changed because an emphasized term appeared; a horizontal line was drawn to set off a table or chart.

SGML then went the next step and said "All markup will be logical, and instead of cryptic codes, element names (lodged inside 'tags') can be inserted into text to indicate the beginnings and ends of logical objects."

From the user's point of view, then, we know that markup will be mixed in with the data and that all of it will be represented using standard characters which are available consistently on all (or nearly all) computers.

Separating the Wheat from the Other Wheat

Clearly it's crucial to distinguish between the two types of content. This is done in SGML by inserting delimiter characters which let software recognize that certain characters should be read in TAG mode (and perhaps specific actions taken or translations made into typesetting languages) and others in CON (for content) mode and passed over to the application for processing.

Characters used as delimiters must be carefully chosen: They shouldn't show up too often in regular content. ISO 8879 describes a base set which includes open and close angle brackets to set off start-tags (the < > characters with an element name inside) and an ampersand followed by a name followed by a semi-colon to set off entities such as graphic images or special characters (&bullet; for instance).

What Does It All Mean? How Does It All Work?

This is not madness. All databases have to have an internal representation that indicates where the "name" field (for example) ends and the "address" field begins. Each wordprocessing or desktop publishing software product has some internal markup language that initiates centering or emboldening and so on. The tricky part was in coming up with an approach to markup that would allow interchange among all of them.

In the earliest days of the committee work that led to the creation of SGML, the central topic was "generic coding", the development of a system of universal, machine-independent codes whereby, for instance, <P> would always mean paragraph, and <H1> would always indicate a first-level heading. The intention was to specify a set of tags that would work for a very large number of documents.

The principle of generic coding is sound, but the project was a bit overwhelming: There are simply too many types of documents with too many different kinds of elements in them.

And a second problem appeared: What about mistakes? Is there some way that the computer can help ensure that element names are keyed correctly? Can it help with the more difficult task of checking that users keyed in the codes in the right places?

Interestingly, there was one answer to both problems, and it came from the world of computer programming.

Many computer languages provide a programmer with a set of "primitives", basic operations that can be put together in a header file to define a specific set of commands which the program itself will use.

The committee members adopted exactly this approach. SGML turned out not to be a set of standardized codes, but instead, a language that could be used to create a document type definition (commonly referred to as a DTD) that defines precisely those elements (and other constructs -- we'll get to them) needed for one document or for a group of similarly structured documents.

The element definitions -- formally called element declarations -- have two critical functions: They indicate the "official" name of an element, which will appear inside delimiters as a tag (<chapter> for example); and they describe what each element may contain, the content model.

A chapter might be described as starting with a chapter title which would be followed by any number of paragraphs, perhaps interspersed with headings. The element declaration for this example would be:

<!ELEMENT   chapter   (chptitle, (para | heading)+) >

SGML provides the syntax for this declaration. Any SGML system would recognize this because of the <!ELEMENT> as an element declaration. It would recognize the comma as meaning "followed by", the vertical bar as "or" (as in "paragraphs or headings") and the plus sign as "one or more". The parentheses provide grouping, just as they do in elementary school arithmetic.

Now the next step would be to go on and declare the contents of the chapter's subelements, "chptitle", "para" and "heading". We can do them together if they have the same content model. The > ends the declaration of this element.

<!ELEMENT (chptitle | para | heading)      (#PCDATA) >   

The SGML reserved name, PCDATA, is recognized by the system as meaning that "chptitle", "para" and "heading" don't have any subelements of their own. Rather, they contain what is termed parsed character data -- the actual letters, numbers, punctuation and special characters that make up content.

At this point the user would create the document, based on the relationships and using the markup declared in the DTD, set off with appropriate delimiters from the character data:

<chapter><chptitle>My Summer Vacation</chptitle>

<para>It was a dark night, not stormy at all, no hint 
   of a storm, really ....</para>

<para>A pirate ship appeared on the horizon...</para>
   ...</chapter>

As you may have guessed, tags that begin with the open-angle-slash shown here as the </ delimiter are end-tags. The content of the chapter title is fully contained between its start- and end-tags.

In addition to declaring the element names and allowed contents, the DTD may also include a list of entities, the objects described earlier which represent machine-independent coding for the bullets, special characters or external files which each system will have its own way of incorporating on screen or on paper.

Sometimes there's not enough information in an element name to allow it to be used in accordance with some individual requirements. Perhaps our fictional user wants the chapter's opening paragraph to be classified as top secret. An attribute for a paragraph might be defined as follows:

<!ATTLIST para secrecy (topsec|public) "public">

The "public" in quotes represents the default value. All paragraphs in which the user doesn't specify topsec or public will have the value "public" anyway.

<para secrecy=topsec>It was a dark night, not stormy at 
   all, no hint of a storm, really ...

<para>A pirate ship appeared on the horizon ...</>

Notice that in some circumstances markup minimization can be used to save some keystrokes. (As examples, the first paragraph's end-tag has been omitted since the next paragraph's start-tag implies the end of the first paragraph element. In addition, the element name "para" has been omitted from the </>.) There are various minimization techniques available for end-tags as well as for start-tags and attribute specifications.

In an SGML Document, There are No Surprises

Picture the many levels at which life is made easier. You tell someone you're sending an SGML document. He or she knows:

Every step has been smooth. Because the system is SGML, each component establishes the values and parameters for the following one. The only markup that appears has been declared in the DTD. The syntax of the DTD has been indicated by the SGML declaration. And the standard defines that.

The real benefit of this flow is that computers can follow it to check whether documents follow the rules designed for them. SGML (in spite of being human-readable) is a computer language and is very precise. This means that a computer program -- a validating SGML parser -- can read the SGML declaration and learn its rules, then read the DTD and learn the rules of the markup, and then determine whether the document instance meets those rules.

Once I Send Information Out to be Processed, What Happens?

This is validation. Automatic. By a machine. And as far as ensuring that the content you're sending to a database or to the typesetter won't hiccup or burp, it can't be beat.

The parser's job is to read in SGML and separate the data from the markup. It recognizes when markup has been minimized and will expand that. If your content includes references to the spreadsheet for Chapter Two and the graphic of the organization chart for Chapter Six, it will instruct the system how to find those entities. If the graphic is in some special data content notation produced by a drawing program, the parser will arrange to have the image brought in (in this case to be published). If your content includes special directions for your publishing system in its own internal language -- SGML calls these processing instructions -- they will be passed right through to the application. If you've used the SGML marked section construct and indicated that some parts of your document are not to appear in this published version, the parser will know not to send them on. If you're using the SGML comment declaration construct to pass notes and messages back and forth among the writers and editors, the parser will know not to send them on to the receiving application either. All this and more.

All this and more, and most important, invisibly. This list -- and it is only part of what a parser and an SGML system do -- represents actions you can count on, without human involvement (except, of course, to clean up human errors, a process made considerably easier than it might be by the rules, established in the DTD and enforced by the sending application).

What Do You Mean 'Enforced'?

A new generation of software is appearing and will continue to appear, software that lives and breathes SGML, that takes advantage of the DTD to guide users in building documents that are well-structured; that takes advantage of the structure to give users functionality that we never had before. (A quick plug in passing: Our company's SoftQuad Author/Editor is an example of just such a product.)

Soon, if you're a user, much of what you've learned in this overview will become second nature to you, the complications masked behind intuitive interfaces but with SGML's powerful and flexible constructs at your fingertips.

Your spreadsheet software will have an option to export SGML files, as will your wordprocessor and hypertext software. This new generation of software will work directly with DTDs and offer you a logic-driven, structure-based, objects-and-attributes-based approach to information handling far richer than the templates and style sheets and index card interfaces of today.

"Exchange of information at all levels of complexity" is a lofty ambition for any standard. But a standard designed as a language for building applications will succeed.

The task of the rest of this booklet, then, is to introduce you to the constructs of the language that are used in building applications, to give you some flavor of what is possible with SGML.