[Archive copy from: http://www.personal.u-net.com/~sgml/sgml.htm]

An Introduction to the Standard Generalized Markup Language (SGML)

by Martin Bryan of The SGML Centre

This file gives a very brief overview of the most commonly used parts of ISO's Standard Generalized Markup Language. For a full description of SGML refer to the author's textbook on the subject, SGML - An author's guide to the Standard Generalized Markup Language (SGML), which is published by Addison Wesley (ISBN 0 201 17535 5).

What is SGML?

SGML is a Standard Generalized Markup Language defined in ISO standard 8879:1986. SGML takes the concept of descriptive markup beyond the level of other markup languages. By defining the role of each piece of text in a formal model, users of programs based on the SGML can check that each element of text is used in the correct place. SGML allows computers to check, for example, that users do not accidentally enter a third-level heading without first having entered a second-level heading.

Once a formal model has been defined for a particular type of document it becomes possible to off-load a large part of the document markup task to the computer. By giving the computer sufficient clues to determine where it is within the model, it is possible to set up a system to automatically add appropriate markup to a file.

SGML also allows users to:

link files together to form composite documents
identify where illustrations are to be incorporated into text files
create different versions of a document in a single file
add editorial comments to a file
provide information to supporting programs.

When used in conjunction with specially written data retrieval and document formatting programs, these techniques allow integrated document production systems to be developed.

It is important to note, however, that SGML is not:

a predefined set of tags that can be used to markup documents
a standardized template for producing particular types of documents.

SGML was not designed to be a standardized way of coding text: in fact it is impossible to devise a single coding scheme that would be suit all languages and all applications. Instead SGML is formal language that can be used to pass information about the component parts of a document to another computer system. SGML is flexible enough to be able to describe any logical text structure, whether it be a form, memo, letter, report, book, encyclopedia, dictionary or database.

The components of SGML

SGML is based on the concept of document being composed of a series of entities. (`Entity' is the English spelling of the French word `entité', the Teutonic equivalent of which is `thing'. Those familiar with modern programming techniques will be probably be more comfortable using the word `object'. All these terms are synonymous.) Each entity can contain one or more logical elements. Each of these elements can have certain attributes (properties) that describe the way in which it is to be processed. SGML provides a way of describing the relationships between these entities, elements and attributes, and tells the computer how it can recognize the component parts of a document.

SGML differs from other markup languages in that it does not simply indicate where a change of appearance occurs, or where a new element starts. SGML sets out to clearly identify the boundaries of every part of a document, whether it be a new chapter, a piece of boilerplate text, or a reference to another publication. But SGML does not presume that it will be told where everything starts and ends. Instead it provides rules that allows the computer to recognize where the various elements of a text entity start and end. By careful use of these rules the amount of coding that needs to be entered by a human operator can be reduced to a bare minimum.

To allow the computer to do as much of the work as possible, SGML requires users to provide a model of the document being produced. This model, called a Document Type Definition (DTD), describes each element of the document in a form that the computer can understand. The DTD shows how the various elements that make up a document relate to one another.

To allow the computer to correctly identify where each part of a document starts and ends SGML requires that the user declares, in an SGML Declaration, how the computer is to identify markup, and what codes have been used to identify and delimit markup sequences.

How is SGML used?

To use a markup tag set that has already been defined by a trade association or similar body, users need to know how markup tags are delimited from normal text and in which order the various elements should be used in. Systems that understand SGML can provide users with lists of the elements that are valid at each point in the document, and will automatically add the required delimiters to the name to produce a markup tag. Where the data capture system does not understand SGML, users must either map their local coding scheme onto the relevant SGML tag set or enter the SGML tags manually for later validation.

Because SGML tag sets are based on the logical structure of the document they are somewhat easier to understand, and remember, than physically based markup schemes. Typically a memo might be coded as:

<to>All staff
<from>Martin Bryan
<date>5th November
<subject>Cats and Dogs
<text>Please remember to keep all cats and dogs indoors tonight.

When processed by an SGML document analyzer (often referred to by the technical term `parser') this file will be fully coded to identify the start and end of each element in the file. When this is done the file will probably have the following form:

<memo><to>
All staff
</to><from>
Martin Bryan
</from><date>
5th November
</date><subject>
Cats and Dogs
</subject><text><para>
Please remember to keep all cats and dogs indoors tonight.
</para></text></memo>

At first sight this file is somewhat daunting, but in this special form the file is ideal for a computer to follow, and therefore to process. The start and end of each component of the file has been clearly identified by a start-tag (e.g. <to>) and an end-tag (e.g. </to>). The tags have been placed on their own lines to make them easier to find, a trick made possible by the special rules that SGML uses to process carriage returns at the start and end of text elements.

Notice that at this point nothing has been said about the format of the final document. From the neutral format provided by SGML you can either chose to print the text onto a pre-printed form, or to generate a new form, positioning each element of the document where needed. This process is not part of the SGML document analysis; it is a separate function carried out by another program. (ISO are currently developing a Document Style Semantics and Specification Language (DSSSL) that will allow the format required for each element in an SGML-coded file to be specified.)

Defining your own tag sets

To create tag sets users must define a Document Type Definition that formally identifies the relationships between the various elements that form their documents. For a simple memo the DTD might take the form:

<!DOCTYPE memo [
<!ELEMENT memo O O ((to & from & date & subject?), text) >
<!ELEMENT text - O (para+) >
<!ELEMENT para O O (#PCDATA) >
<!ELEMENT (to, from, date, subject) - O (#PCDATA) >
]>

This model tells the computer that a memo consists of a group of elements, <to>, <from>, <date> and, optionally, <subject>, that can occur in any order, which must be followed by the text of the memo. The <text> element of the memo is itself made up of a number of repeated paragraphs, at least one of which must be present (this is indicated by the + immediately after para). In this simplified example a paragraph has been defined as a leaf node that can contain parsed character data (#PCDATA), i.e. data that has been checked to ensure that it contains no unrecognized markup strings. In a similar way the <to>, <from>, <date> and <subject> elements have been declared to be leaf nodes in the document structure tree.

The Os and hyphens inserted between the name of each element and its model show where markup tags can be omitted. There are always two entries. The first shows whether the start-tag of the element can be omitted; the second shows when the end-tag can be omitted. An O indicates that omission is permitted; a hyphen indicates that the tag must always be present. In the above examples none of the end-tags is compulsory (though they will always be recognized if entered) while the start-tags may only be omitted for the <para> and <memo> elements. It should be noted, however, that only the first paragraph in the text element can have its start-tag omitted. The presence of any subsequent paragraphs must be clearly identified either by the presence of a <para> start-tag.

Where the position of an element in the model is variable the element can be defined as an exception to the model. For example, to allow figures, and references to figures, to occur anywhere in the text, but not in the heading, the model definition for the <text> element could be modified to read:

<!ELEMENT text - O (para+) +(figure, figref) >

where <figure> is defined as:

<!ELEMENT figure  - - (graphic, caption) -(figref) >
<!ELEMENT graphic - O EMPTY >
<!ELEMENT caption - O (#PCDATA) >

Here <graphic> is an empty element that acts as a place holder for the graphical part of the figure while <caption> identifies the text associated with the illustration. Both <figure> and <figref> can occur anywhere within the <text> element, but <figref> has been excluded from within figures.

Defining the attributes of elements

Where elements can have variable forms, or need to be linked together, they can be given suitable attributes. For example, it might be decided that the <subject> field of a memo could optionally be printed in bold or italics. A suitable attribute list declaration might, in this case, be:

<!ATTLIST subject font (bold|italic|normal) "normal" >

This tells the computer that the <subject> start-tag can be amended to read <subject font=bold> or <subject font=italic> if a variant font is required. If no such change is requested the program is to use the default value to make the tag read <subject font="normal">. If the short tags option is available the entries can be reduced further by omitting the attribute name, the specified value being a unique identifier of the element concerned. This means that when the computer sees <subject bold> or <subject italic> it will automatically add the font= prefix.

One especially important type of attribute is the unique identifier. Because it is unique it can be used to provide a cross reference between two points in the document. For example, you can ensure that a unique identifier is assigned to each figure by adding an attribute list declaration of the following form to the DTD:

<!ATTLIST figure id ID #REQUIRED >

This tells the computer that every <figure> element must be entered with a unique identifier within the start-tag, e.g. as <figure id="fig1"> rather than just <figure>.

Unique identifiers can be referred to within the text by use of attributes that form identifier references. Typically a figure reference element might be defined as:

<!ELEMENT figref - O (#PCDATA) >
<!ATTLIST figref refid IDREF #CONREF >

The special #CONREF keyword allows users to either ask the computer to generate a cross reference, by entering the relevant unique identifier as part of the start-tag (e.g. <figref refid="fig1">), or to enter the relevant text themselves (e.g. <figref>(see Figure 1 on page 13)</figref>).

Incorporating standard and non-standard text elements

SGML also contains techniques for adding standard (boilerplate) text to a file, and for handling characters that are outside the standard character set, but which are available on certain output devices.

Commonly used text can be declared within the DTD as a text entity. A typical text entity definition could take the form:

<!ENTITY company "The SGML Centre" >

Once such a declaration has been made in the DTD users can use an entity reference of the form &company; in place of the full sequence. An advantage of using this technique is that, should the name of the company referred to by the mnemonic change later, only the entry in the DTD needs to be changed as the entity reference will automatically call in the latest definition.

Text stored in another file it can also be incorporated into a file using entity references. In this case the entity declaration identifies the file containing the text to be recalled from disc, e.g.:

<!ENTITY appendix SYSTEM "C:\book4\appendix.doc" >

and the entity reference (&appendix;) shows where the file is to be added to the main text stream.

Where non-standard characters are required special system-dependent entities can be declared to show how the characters can be generated. A typical entry might read:

<!ENTITY plusmn SDATA "&amp;#27;A" >

When the string ± is encountered in the text the computer will replace it by the code whose decimal value is 27 (Escape) followed by the letter A.

Illustrations, tables and other special elements

SGML provides a number of techniques for handling non-standard document elements. Where the coding scheme of an element of the file such as an illustration differs from that used for normal text the contents of the element can be treated as an entity with a special notation, e.g.:

<!ENTITY fig1 SYSTEM "c:\book12\figures\fig1" NDATA pstscrpt >

Alternatively details of the relevant notation can be defined as an attribute of an element, e.g.:

<!ATTLIST graphic type NOTATION (tex|eqn|pstscrpt) "pstscrpt" >

In both these situations a notation declaration is required to tell the program what to do with the non-SGML data. Typically this takes the form of a call to a program, e.g.:

<!NOTATION pstscrpt SYSTEM "c:\dos\pstscrpt.bat" >

Sometimes it will prove necessary to call in documents that are coded using a different DTD from the main file. Providing such files are stored with their DTD's they can be incorporated into the file as a subdocument entity by using an entity declaration of the form:

<!ENTITY table1 SYSTEM "c:\tables\table34.may" SUBDOC>

Where tables are created and output on a line-by-line basis they can be flagged as a special type of character data by use of a CDATA keyword, e.g.:

<!ENTITY table2 SYSTEM "c:\tables\table69.jun" CDATA tabs15 >

where tabs15 is a notation name showing the way in which the externally stored table has been coded.

NOTE: In many SGML-based systems the identification of entities and their incorporation into files will be a background process that will not be seen by end-users.

Using SGML coded text

While SGML coded text is clearly delimited, and may contain instructions for formatting the text, it will be necessary to pass the SGML-coded file(s) to one or more applications before outputting a document or adding it to a database. The SGML LINK option can be used to control links with applications, though it should be pointed out that many SGML programs do not support this optional feature.

Where the receiving application, such as a database, has a clearly defined structure the tagging scheme used for data input can be automatically linked (mapped) to the data structure required by the application, the computer checking that all required elements of the structure are present before passing the file onward for processing.

Where the receiving application has no clearly defined structure, implicit or simple links can be used as controls on the processing of individual elements or complete documents respectively. Alternatively external routines can be used to convert the clearly defined boundaries of SGML elements and entities into the coding required by the receiving application.

Data stored using non-SGML notations will need appropriate application software to process it, but the SGML-coded file will correctly identify where each piece of such data belongs in the completed document.

NOTE: Where cross-references are page dependent it will be up to the text formatter to generate any necessary page references.

SGML-coded files are, by their nature, ideal for storing in databases. Because such files are both object-orientated and hierarchical in nature they can be adopted to virtually any type of database, though care sometimes needs to be taken to ensure that enough structural data is retained in the database to reconstruct the original file.

By storing data in the clearly defined format provided by SGML you can ensure that your data will be transferable to a wide range of hardware and software environments. New techniques in programming and processing data will not affect the logical structure of your document's message. If more detail needs to be added to the file all you need to do is to update the model and then add new markup tags where required in the document instance. If a completely new style is required then the existing document model can be linked to the new one to provide automatic updating of document structures.

Webmaster: mtbryan@sgml.u-net.com