[This local archive copy is from the official and canonical URL http://www.sq.com/resources/content_sgml_primer.html [related version?] previously http://www.softquad.com/sgmlinfo/primbody.html; please refer to the canonical source document if possible.]
SoftQuad's Quick Reference Guide to the Essentials of the Standard: The SGML Needed for Reading a DTD and Marked-Up Documents and Discussing Them Reasonably.
Copyright (c) 1990, 1991, 1995 SoftQuad Inc. All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means - electronic, mechanical, recording, or otherwise - without the prior written consent of the publisher, excepting brief quotes used in connection with reviews written specifically for inclusion in a magazine or newspaper.
This book offers an introduction to SGML markup -- the stuff that adds 'intelligence about itself' to content -- and to Document Type Definitions (DTDs), the sets of declarations and strategies that application designers use to describe the structures of types of documents -- memos, for example, or technical manuals, or corporate financial reports.
Begin with The End-User's Eye View to get the whole picture.
Or jump right into the specifics with
While parts of this Primer were created using SGML, the sample DTD was created only to illustrate the vocabulary and syntax of markup declarations.
Treat this document as real basic. Its job, in any situation, whether conference or cocktail party, is to help you recognize and quickly disarm SGMLese.
Four basic typographic conventions are used to explain and illustrate SGML constructs throughout this book. Sample document instances appear either as ASCII text or in the on-screen form used by SoftQuad's SGML-sensitive wordprocessor, Author/Editor.
The fundamental principle of any computer language -- and of any standard -- should always be No Surprises. SGML lets people analyze the storage, retrieval and processing requirements of collections of related or similar documents (classes of documents), and create all the mechanisms necessary to describe the structures of those documents using formal markup declarations.
Simply put, markup declarations create or establish the markup used in the document to set off structures clearly and unambiguously.
SGML declarations are generally of the form:
The first parameter always establishes a name that may henceforward be used to indicate, reference or represent the contents of the other parameters. (It's no different than saying "Let 'x' have a certain value or meaning, so that I can now use 'x' in my work.")
The keyword can be any one of:
DOCTYPE which assigns the name in the first parameter to a set of declarations. These declarations may be right there enclosed in square brackets or in another file identified in the next parameter (or in a combination of these places). This set comprises the document type declaration subset, and may include any of the other kinds of markup declarations.
ELEMENT to declare an object (element type) within the logical structure of the document; here the first parameter is declared to have as content everything in a following parameter, the content model. (For example: a "chapter" is declared to have content which consists of "title followed by any number of paragraphs".)
ATTLIST to associate an element type with a set of characteristics that may be applied to one specific instance of that element. (For example: An element "figure" is associated with an unique identifier so that references to it can calculate its page number or show it in a separate window.)
ENTITY to allow a short string of text to stand for a longer string, or to point to a file stored externally to the current file. (For example: Let "tsp" be used as shorthand for "The SGML Primer")
NOTATION to associate the first parameter, which names a "content notation" for non-SGML data (CGM for graphics, for example; SCORE or "MIDI Files" format for music or whatever), and the second parameter which instructs the system in how to handle such notation.
SHORTREF to name a set of associations between short strings of characters (or a single character) and markup.
USEMAP to activate the set of SHORTREFS named in the first parameter with the ELEMENT(s) named in the second one. (SHORTREF and USEMAP declarations are not covered in this booklet. These constructs are most useful when you attempt to use SGML with typewriters or old-fashioned word-processors or in retrofitting, the turning of old electronic files into SGML-encoded text. SoftQuad contends that software tools designed specifically for creating or editing SGML can avoid them.)
The exception to the declaration structures described here is the comment declaration which instead of a keyword begins and ends with two hyphens.
In the following section, a fictional DTD that describes this booklet and a sample document instance created with this DTD illustrate how markup declarations are written and how they work together to define the structure of a document such as this booklet.
The DTD begins with the DOCTYPE declaration which assigns the name booklet to the set of markup declarations, the document type declaration subset, which follows. Since certain kinds of entities must be declared before they can be used, it is customary to list all ENTITY declarations together at the beginning of the DTD.
Each type of declaration is illustrated in a declaration diagram and explained in greater detail in the following sections.
<!-- Comment declarations can help clarify a DTD. This DTD, for example, strings together
some of the declarations that describe this booklet. --> <!DOCTYPE booklet [<!ENTITY tsp "The SGML Primer" -- the title we'll use -- > <!ENTITY % declars "element|attlist|entity|notation" > <!ELEMENT booklet - - (title, (text|dtd|instance|diagram)*) > <!NOTATION pict SYSTEM "pictView" > <!ELEMENT dtd (%declars;) -- parameter entity shorthand -- > <!ATTLIST dtd type (silly|serious) serious > <!ELEMENT (title|instance|%declars;) (#PCDATA) > <!ELEMENT diagram EMPTY > <!ATTLIST diagram graphic NOTATION (pict|cgm) #REQUIRED > <!ELEMENT text (para)* > <!ELEMENT para (#PCDATA|quote|emph)* > <!ELEMENT (quote|emph) (#PCDATA)> ]>
The sample document instance which follows contains the text of the document incorporating the markup as specified in the Booklet DTD.
Why the term "document instance"? Because we are referring to a particular document which is one instance of the many possible documents that could be created in accordance with the Booklet DTD.
Markup within the document instance is called descriptive markup. Descriptive markup identifies the elements within the document instance which make up its logical structure.
Note the tag names for elements in the document instance below are, naturally, the names declared as markup in the DTD. This illustration is of the SoftQuad Author/Editor screen where tags are chosen from a menu and represented as icons to ensure they can never be out of context or typed incorrectly.
The element declaration above defines the element type to be booklet and the content model to be a title followed by the subelements: text or dtd or instance or diagram. The | says "or". The * says that the subelements can appear zero or more times. Thus the subelements can occur and recur in any order -- diagram could precede text; dtd follow instance; text can be repeated many times interspersed with diagram or dtd subelements -- which is already what happens in this booklet.
Within the element declaration, the first parameter indicates the element name referred to in SGML as the element type. Element names consist of one name start character followed by zero or more name characters, up to 8 characters in total. Name start characters are a-z and A-Z. Name characters are a-z, A-Z, 0-9, period, and hyphen. Names are not case sensitive.
The element type parameter may consist of one element, or a group of elements.
SGML's optional markup minimization features allow markup in the document instance to be significantly reduced by a variety of techniques including shortening or omitting tags when the parser can infer them from the content model of the current open element.
Two extra characters, entered between the name and the content of the element declaration, define whether or not a tag can be omitted. The first character represents the start-tag; the second character represents the end-tag.
These minimization symbols are:
O (the letter, not the number zero) to indicate that the tag may be omitted under certain clearly defined circumstances; (It is possible for the start-tag to be omissible on some but not all occurrences of a particular element.)
- (a hyphen) to indicate that the tag is required.
<!ELEMENT flood - - (water+, rainbow) > <!ELEMENT rainbow - O EMPTY > <!ELEMENT water O O (rain & (lightng, thunder))> <!-- FLOOD has required start- and end-tags. WATER sometimes needs
neither: The start-tag on the first water can be omitted because WATER
is required in a FLOOD. Each new WATER start-tag will imply the previous WATER end-tag
until the last one where the RAINBOW start-tag also forces the WATER end-tag.
Because start-tags can be omitted only where the element is required,
the parser will not infer a WATER start-tag from the start-tag of
its three sub-elements.--> <!-- Since RAINBOW has no content - it's just
a placeholder for terrific visuals and maybe some
orchestral music -- it can't have an end-tag. -->
Element content is described in a content model. A content model contains one model group, which is made up of any or all of:
elements specified as allowable or required in this declaration;
elements permitted in elements in which the current element is allowable, and specified as "flowing through" to all sub-elements; (These are called inclusions or, more formally, inclusion exceptions.)
raw data (signified by the keyword #PCDATA).
? optional -- element or model group appears once or not at all.
* optional and repeatable -- element or model group may appear zero or more times (that is, not at all or any number of times).
+ required and repeatable -- element or model group appears once or more.
, sequential -- (e.g. a, b, . . . a must be followed by b; b is followed by . . .) Note: in a,b?, a may be followed by b.
& "and" -- the elements on either side of the ampersand must occur, but may appear in any order (e.g. a & b means a and b; or b and a).
| "or" -- any one of the elements or model groups must appear (e.g. a | b means either a or b).
<!-- Public document type definition for floods
before God's promise to Noah --> <!ELEMENT flood (water+) > <!ELEMENT water EMPTY > <!-- The first word in this declaration tells us we're
defining an ELEMENT, a structural object which may contain
other objects, information (generally in the form of letters
and numbers), or nothing. The + is an occurrence indicator
which says that "one or more" of these must appear in the
element being defined. --> <!-- The element FLOOD, therefore, may contain any "number"
of WATERs and may, in fact, go on forever. WATER, in turn,
must be defined. Here we're indicating that WATER has no
subelements and contains no information. It is an
EMPTY element). --> <!-- Notice there are no minimization symbols. Unless "omittag"
is expressly requested in the SGML declaration, the symbols may<
be left out. -->
<!-- DTD for floods after God's promise to Noah --> <!ELEMENT flood (water+, rainbow, (dove|goose)?) > <!ELEMENT (rainbow|dove) EMPTY > <!ELEMENT water (rain & wind & (lightng, thunder)) > <!ELEMENT (rain|wind|lightng|thunder) (#PCDATA) > <!-- When God promised that never again would a flood go
on forever, the DTD had to be re-written. Now SGML
connectors appeared on the face of the earth.> <!-- The comma indicates "sequence", that WATER must be
followed by a RAINBOW element, which means the flood is
incomplete without one. The question mark says the bird
following the RAINBOW is "optional".--> <!-- In the name group (DOVE|GOOSE) the vertical bar
means "or". Only one or the other may appear. In the
content model for the element WATER, the ampersand
says "all these must occur, in any order"<
(LIGHTNG-followed-by-THUNDER always appear in that
order, either before or after WIND and RAIN. -->
An SGML parser does not process anything in a DTD declaration that is placed between the two pairs of hyphens which make up a comment open delimiter and a comment close delimiter. Accordingly, comments provide a way to pass valuable information "behind the scenes".
Comments may stand alone in the DTD as the only content of a declaration or they may appear almost anywhere inside other markup declarations. They allow developers to add notes in the text as to why certain decisions were made and can give guidance to others who have to use the DTD.
You can have as many comments as you want within a single markup declaration but each comment must be enclosed within pairs of hyphens.
You should not put comment delimiters (--) inside a comment.
Comments may also appear inside a document instance. Within the comments, authors and editors, a team of writers, clients -- anyone interested in the document -- can converse or leave messages for themselves or others. The parser knows to ignore these; they will never be output in final form.
Software can determine whether to treat comments specially during input. In the following Author/Editor example, the comment appears boxed on the screen, but is converted to SGML syntax to send to a publishing system, where SGML requires that it be ignored. It does not appear in the output.
A declaration that begins with the word ATTLIST allows you to qualify an element by declaring a list of its attributes. The DTD for the following menu avoids the declaration of a new element for every kind of dressing on the salad. The attribute declaration is attached to an element and is composed of one or more attribute definitions. Every attribute definition is composed of a name, a declared value and a default value.
<!-- Lunch at the steakhouse --> <!ELEMENT lunch (meal)+ -- one meal per person --> <!ELEMENT meal (appetiz?, steak, dessert?,custname, whopays) +(drink) > <!-- The plus sign after the content model followed by
one or more elements within parentheses declares an
"inclusion". An inclusion indicates that the elements
can appear anywhere in the element to which they are
attached and in any of its subelements. You can have
one or more DRINK elements any time during MEAL. --> <!ELEMENT appetiz (soup | salad) > <!ELEMENT soup EMPTY --soup of the day --> <!ELEMENT salad EMPTY> <!ATTLIST salad kind NAME #REQUIREDdressing (french | 1000isl | bluechse) #REQUIRED> <!-- The declared value for the attribute DRESSING is
called a "name token group", a series of values separated
by a vertical bar (|). They represent the only possible
values for the attribute. the declared value NAME requires
a value usually comprising up to 8 letters and numbers.
REQUIRED means that one value must be specified for the
attribute. --> <!ELEMENT steak EMPTY> <!ATTLIST steak cook (rare|medrare|medium) "medrare" side (potato|fries|rice) "fries"> <!--The value between quotes is used to force a default
value in case no value is specified for the attribute. --> <!ELEMENT dessert (cake|applepie)> <!ELEMENT (cake|applepie) EMPTY> <!ATTLIST cake kind CDATA #REQUIRED> <!-- CDATA means any letters, numbers, punctuation,
spaces, or other special characters. This gives you
the possibility of creating long and unique values
for an attribute. This means you can ask for any<
cake you want; if the system doesn't recognize the
name, you may not get it. --> <!ATTLIST applepie hot (hot|warm|cool) #IMPLIED icecream (yes|no) #IMPLIED <!-- IMPLIED is an equivalent for optional. The
APPLEPIE comes the way the waiter or waitress prefers
unless you specify otherwise. --> <!ELEMENT drink (water|beer|cola)> <!ELEMENT (water|beer|cola) EMPTY> <!ATTLIST water kind (tap) #FIXED tap> <!-- The declared value for WATER is fixed by the
system. You can order any kind of fancy water you
want, but all they've got here is tap water. --> <!ATTLIST beer number NUMBER #REQUIRED> <!ATTLIST cola type (regular | diet) #CURRENT> <!-- Beer is ordered by NUMBER from the beer list.
The Default Value CURRENT says that the first time
the element COLA appears, a type must be specified.
That value will be used for the next occurrences
unless you specify another. In other words, you
don't have to re-specify the type of COLA every
time you ask for a refill.--> <!ELEMENT (custname | whopays) (#PCDATA) > <!ATTLIST custname account ID #IMPLIED> <!ATTLIST whopays charge IDREF #REQUIRED> <!-- The customer's name is not critical data within
the content of the element CUSTNAME: The cashier could
make a typo or use a nickname and the computer would
have difficulty tracking the charge. Instead a unique
identifier is requested - but only if the customer has
an account. However, WHOPAYS has a required IDREF. The
system knows that WHOPAYS may have the same value for
many people's meals (when someone buys for the whole
table) but each CUSTNAME will have a unique ID. The
system will check that the value of the IDREF will be
a legitimate ID. The example below illustrates only
one of several MEALS in the LUNCH. A separate MEAL in
the same LUNCH must include a CUSTNAME with an
ACCOUNTID of "ALEXF" for the CHARGE IDREF to work -->
The first parameter of any attribute list declaration names the element or elements being associated with the list. The second parameter is a name or names for the attribute or set of attributes to be associated with the element.
The third parameter -- the declared attribute value -- may be a name token group, the actual values anticipated (potato|fries |rice), for example) or a keyword that specifies the sort of value that is needed:
CDATA zero or more valid SGML characters
ENTITY currently declared general entity name
ENTITIES list of ENTITY names
ID unique name
IDREF unique name reference value
IDREFS list of unique name reference values
NAME string of 1-8 characters (8 is a "default" limit); starting with a-z or A-Z, followed by a-z or A-Z, hyphen, period
NAMES list of NAME values; each string separated by one or more spaces, tabs or returns ("separators")
NMTOKEN same as NAME except that it can also start with 0-9, hyphen, period.
NMTOKENS list of NMTOKEN values; each separated by a separator+
NOTATION a notation name that identifies the data content notation of the element's content
NUMBER a string of 1-8 characters consisting of the digits 0-9
NUMBERS list of NUMBER values; each separated by a separator+
NUTOKEN string of 1-8 characters beginning with 0-9 followed by a-z, A-Z, 0-9, hyphen, period
NUTOKENS list of NUTOKEN values; each separated by a separator+
An attribute's default value may be a LITERAL STRING, the actual characters needed (fries for example) or a keyword. Most common keywords are #REQUIRED, which is obvious; and #IMPLIED, which effectively means optional.
The declaration above tells the parser: When you come across the entity reference &tsp; in the document instance, substitute everything between the quote marks including blank spaces. In this case, the substitution would result in the title of our booklet, The SGML Primer.
When it comes right down to it, entities are things. An entity may be a short string of characters declared as standing for a longer string of characters; it may be a way of referencing a whole other marked-up file from within the current file; it may be a placeholder for a graphic or other non-SGML data that will be inserted when the document is being viewed or printed; or it may be a special symbol or character which doesn't appear on your terminal or which may have different identifiers on different systems.
Entities Automate Global Search and Replace: Because entities enable the easy substitution of one thing for another, they are extremely useful for something volatile, such as product name during the development stage.
Entities Aid Standardization: Replacement text is defined in only one place, the entity declaration. Entities are particularly useful for complicated text, scientific terms, or expressions which need fancy formatting.
Entities may be declared either in the DTD or in the instance -- strictly speaking, in that part of a file which is still part of the DTD, the declaration subset. (This allows for the addition of local declarations to the DTD, which may be shared by a wide group of users.)
Entities are referenced in the document when the declared entity is used, in context, surrounded by markup delimiters (& and ; respectively).
Let's start with the following declaration:
<! ENTITY copyr SYSTEM "c:\boilertxt\copyr.sgm" SUBDOC>
When ©r; is referenced in the document instance, the keyword SYSTEM alerts the system that the text to be included can be found in a file on the C drive. The keyword SUBDOC indicates the referenced external entity is tagged SGML text with its own DTD. This means that the main DTD doesn't have to include element (and other) declarations for constructs which appear only in the incorporated SUBDOC file. However, there's an interesting alternative: If the SUBDOC parameter isn't there, the file will be parsed just as if it had been typed directly into the document in which the entity reference appears.
The following morsel of document incorporates a boilerplate copyright notice which may be used over and over in books by many authors: (The SMALLBOOK DTD has been declared elsewhere.)
<!DOCTYPE smallbook [<!ENTITY copyr SYSTEM "c:\boilertxt\copyr.sgm"> <!ENTITY author "Matthew Markup"> <!ENTITY date "1995" ]> <smallbook><frontm><ti> . . . </ti> <au>&author;</au> <pubfm> ©r;</pubfm> </frontm> . . .
The file copyr.sgm has no DTD of its own -- all its elements must be declared in the SMALLBOOK DTD. It looks something like this:
. . . . Copyright &date; by &author; . . .
When the book is parsed, the entity reference &author; in both files is replaced by the name "Matthew Markup". The author's name not only appears in the book's front matter after the title, but the boilerplate text in the copyright notice is being automatically updated without ever being edited! (It is output as "Copyright 1995 by Matthew Markup.")
Accordingly, the file copyr.sgm may be referenced by many books, each of which declares its own replacement value for &date; and &author;.
Parameter entities are used only within markup declarations. They are identified by the % sign after the keyword ENTITY.
<!ENTITY % declars "element | attlist | entity | notation" > <!ELEMENT dtd (%declars;) > <!ELEMENT (title | instance | %declars;) (#PCDATA) >
In the example above, the first declaration tells the parser: Let declars stand for everything between the pair of double quotes. Accordingly, when you come across %declars; in the DTD (that is, when you run into the entity with its start and end delimiters), act as if everything between the literals -- in this case, the four subelements element, attlist, entity, and notation -- had been freshly typed in. (Note that these elements are being used as examples only. Their names are the same as the SGML declarations they illustrate.)
There are two reasons to be particularly vigilant with parameter entities. They can be nested, so a replacement may have to occur within the replacement text itself; and you can get lazy. Application designers will use a parameter entity within a number of content models, and while it may be completely appropriate for many situations, unless the replacement text is examined in each of these new contexts, inappropriate structures may get built.
Note that parameter entities must be declared before they are referenced. It's difficult to resist including the following example. The theory and practice of SGML discourages the use of any kind of element structure which counts elements. However, this example does, if only to show off enthusiastically the use of parameter entities:
<!-- Public DTD for Noah's flood --> <!ENTITY % m.week "water, water, water, water, water" -- five working days of rain --> <!ENTITY % m.wkend "water, water" -- a two-day weekend of rain --> <!ELEMENT flood - - (%m.week;, %m.wkend;, %m.week;, %m.wkend;, %m.week;, %m.wkend;, %m.week;, %m.wkend;, %m.week;) > <!ELEMENT water - O EMPTY > <!-- As you can see, this DTD takes advantage of
parameter entities to declare %m.week; and %m.wkend;
which can now be used throughout the DTD in place
of the longer constructs that appear within quotes
in the declarations. The main feature of this
approach is that when certain structures need to be
changed, this may be done in just one place rather
than each spot where the parameter entity is used.
If suddenly there were a three-day weekend, only
the entity declarations would have to be changed.
The parameter entity construct becomes increasingly
useful as you work with more complex DTDs. -->
In processing an SGML file for output, some data may require special treatment. A notation declaration is used to specify any special techniques that may be required when processing the document. Typical examples would be mathematical formulae or graphics files.
The first step in using notation constructs is to identify types of data content notation that may be required, and declare them as available. The keyword NOTATION establishes the legitimacy of whatever parameter follows as a content notation that the system will now recognize. The keyword SYSTEM associates instructions or meaning with the new notation.
<!NOTATION tex SYSTEM "/usr/bin/tex" > <!NOTATION eqn SYSTEM "/usr/bin/eqn" >
The second step is to determine whether an element encoded in such a notation consists of SGML data characters or non-SGML data.
Broadly speaking, SGML data is made up of the characters you type, the same characters used for the markup and content, but, in all likelihood, with special meaning in the notation. (In the first example following, "over" has meaning to the math processor EQN, but not in SGML.
Non-SGML data, on the other hand, could be anything, an image produced with graphic software, a digital music recording, a clip of video. Naturally, it's unreasonable to expect to pass contents on to an SGML parser that may cause it to hiccup or crash. Accordingly, you cannot embed non-SGML data in an SGML document. It must reside in external files which can be referenced using entities. (An element with SGML data content, however, could either be embedded in an SGML document directly -- within an appropriate element -- or stored as a separate entity and referenced.)
Building on the notation declarations in the previous section, here an element (MATH) is declared with a data content notation of either TEX or EQN:
<! ELEMENT math (#PCDATA) > <! ATTLIST math type NOTATION (tex | eqn) #REQUIRED>
In the document instance, the attribute identifies which notation to expect for this particular instance of the element. The markup effectively says "Deal with this element's contents using a special process stored as /usr/bin/eqn."
<math type="eqn">(3 over 4) over 10</math>
A second example demonstrates the use of a notation for non-SGML data in an external entity.
<!NOTATION pict SYSTEM "pictView"> <!ENTITY sysmod SYSTEM "/usr/gfx/sysmodel" NDATA pict>
Here the entity declaration first creates an entity name. One keyword says the entity is on the system (and a parameter says where the file is located). A second keyword NDATA associates the entity with a notation -- one could use any declared notation at this point -- and its parameter says which. By itself, the content notation will have no inherent or obvious meaning to the system. The entity declaration is simply the standardized means to say, for example, "This drawing of the SGML System Model is stored in PICT". (Where entities need to have meaning beyond a specific system, or a machine-independent means of identification, SGML has a formal public identifier construct to accomplish this.)
The last step is to create some sort of pointer in the document where the entity is to be pulled in. The reference sysmod points to an entity declaration which associates the identifiers of the System Model graphic file with the notation for PICT. That, in turn, points to a notation declaration which establishes PICT as being usable on this system. Then a software application takes over to actually render the PICT image.
<!ELEMENT artwork CDATA > <!ATTLIST artwork filenm ENTITY #REQUIRED>
These declarations set the stage for the pointer in the document instance: <artwork filenm="sysmod">
Instead of using an attribute, the entity reference &sysmod; can be placed in the document wherever it needs to be called. The advantage of using the element/attribute combination is that other information -- scaling, cropping, positioning details, for example -- can also be specified in the attribute list.
Marked sections allow for the creation of content which is to be processed in special ways -- sometimes thrown away, sometimes hiding markup from the parser. In the example below, two versions of an arithmetic text book -- the students' and the teacher's -- are being produced from one file. The marked section construct lets you mark up text for each and lets the two versions share common text.
<equation> 2 + 2 =index.html <![ IGNORE [ 4 ]]> </equation>
The parameter entity construct can be used with marked section declarations. First the appropriate parameter entities must be declared:
<!ENTITY % teacher "IGNORE"> <!ENTITY % student "INCLUDE">
Now, in the document instance, every piece of "teacher only" content is marked as such.
<equation> 2 + 2 =index.html <![ %teacher; [ 4 ]]> </equation>
When it comes time to print the teacher's edition, the word IGNORE is changed to INCLUDE in the entity declaration and the parser does the rest.