SGML Declarations

Declaration of a Concrete Syntax

The SGML standard defines the SGML language indirectly, using what it calls "delimiter roles", rather specific characters, when defining the syntax of the language. It does not use a more direct method like assigning specific characters to represent markup functions (as is done in the C programming language). As an example, a typical end tag in an SGML document "</elemname>" contains two delimiters and a name. "</;" is the delimiter which indicates the start of an end tag; its role is called "etago". ">" is the delimiter which indicates the end of a tag; its role is called "tagc". Whenever the standard refers to a delimiter, it refers to it by name, not by using a specific character assignment to that role. Within the concrete syntax definition, we assign characters or character strings to delimiter roles, thus inventing a new concrete syntax but preserving the basic character of SGML.

In addition to defining delimiters, a concrete syntax can also describes

bit combinations that should not be used in documents ("shunned" characters),
bit combinations used for the function characters (SPACE, record end, record start, etc.),
naming rules (for example, which characters can be used in names),
reserved names used within SGML (for example, PCDATA in a content model) and
various quantities (for example, the maximum length of names).

Reference Concrete Syntax

In the SGML standard, one concrete syntax has been defined. It is called the reference concrete syntax. It performs two functions:

It provides a syntax for encoding SGML specifications that would otherwise not be well defined, such as SGML declarations and system declarations.
It defines a baseline of support that is required of all conforming SGML systems. All conforming SGML systems must be able to support the reference concrete syntax.

When the reference concrete syntax is discussed, there is often a misconception. Because the ISO 646 character set was used to define the characters that are used in the syntax, it is often inferred that anything encoded using the reference concrete syntax must be encoded using that character set. As we shall see, that is not the case. The syntax defines what characters or character strings are associated with the various syntax roles but characters are defined by their meaning, not their bit combination. Therefore, a document may use the reference concrete syntax when encoded in EBCDIC, ASCII or any other character set which contains all the significant SGML characters used in the concrete syntax.

How is a syntax defined?

Before defining any parts of a syntax, a character set, the syntax-reference character set, must be defined and its role understood.

A syntax can be used in markup only after the roles in the syntax are assigned to characters. Remember that the term "character" refers not to the bit combination used to represent a character but to the meaning of the character. Since it is easier to specify bit combinations in the form of character numbers or in character strings, we use them as an intermediate step in the definition. So when defining a syntax, we define roles in terms of bit combinations that the system maps to the character intended. We must tell the system how to map these bit combinations to characters, just as we did when defining the document character set. This is done using the syntax-reference character set. Unlike the document character set however, this mapping is used only within the scope of the syntax definition. For example, assume that in the syntax-reference character set, the bit combination B'01001100' (76 decimal) is mapped to the character "less-than sign". Using ISO 646 as the base set, this would be done by specifying:

BASESET "ISO 646-1983//CHARSET International Reference Version (IRV)//ESC 2/5 4/0" DESCSET : 76 1 60 -- Less-Than Sign -- : Now, everywhere that "less-than sign" is to have special meaning in the syntax, the character number 76 or a character reference L should be used (depending on the part of the syntax being specified, as we shall see). This character is commonly mapped to the STAGO delimiter role. This specification looks like this: DELIM GENERAL SGMLREF STAGO "<" -- "<" -- The syntax-reference character set provides a means for mapping syntax roles to characters (not bit combinations!) and is not used anywhere else. Why was the syntax-reference character set necessary? Why wasn't the document character set used? As we shall see shortly, some of the roles defined in a syntax are function or control characters, not graphic characters. Using the bit combinations of these control characters directly to define the SGML function roles they fulfill is not reliable. Instead, these function characters are defined by specifying decimal digits (like the character string "76"). If this representation referenced the bit combination for control character as defined in the document character set, the syntax would now be bound to the document character set and could not be translated to another character set without modifying the syntax definition itself. On the other hand, by using the syntax-reference character set, it turns out that if character numbers or numeric character references are used for all role definitions, the syntax definition is independent of the document character set and can be translated freely. This significant advantage outweighs the additional complexity of the approach.

The fact that there is a syntax-reference character set leads to a constraint on the document character set: there must be one and only one bit combination in the document character set for every character used in the syntax role definitions or in the naming rules. If this were not true, one would either not be able to use a delimiter or name character (if there were no bit combination in the document character set) or it could lead to a very confusing syntax (if multiple bit combinations mapped to a single character).

Now that we understand how the syntax-reference character set is used, we can now learn how to define a concrete syntax.

Defining the Concrete Syntax

"Concrete syntax" is the SGML term used to refer to the mappings of characters to delimiter roles, control characters that have special meaning, naming rules, reserved names and quantities used in the markup of the document. In other words, the concrete syntax describes what characters are used to start and end tags, entity references and all the other special, SGML-defined, character sequences found in markup. All of these items may be changed. In fact, the use of this capability can define a syntax which looks quite different from anything one normally thinks of as SGML data. This is a very powerful facility that can be very useful if used wisely.

To begin with, the start of the syntax definition is identified with the keyword "SYNTAX". There are two alternatives for what follows. Either a public syntax may be referenced using a public identifier or a full syntax specification may follow. The reference to a public concrete syntax is used if there is already a defined syntax which describes the syntax to be used within the document. The most commonly used such public syntax is the reference concrete syntax. If a public syntax is identified, the keyword SWITCHES is allowed. The purpose of this keyword is to identify a pair of character numbers which have been switched. This allows reuse of public syntaxes if there are only trivial differences. Here is an example of such a specification:

SYNTAX PUBLIC "ISO 8879:1986//SYNTAX Reference//EN" SWITCHES 10 13

In this example, the syntax is declared to be the reference concrete syntax with the line feed and carriage return character meanings reversed (which may have some use on Unix-based systems).

If a public syntax is not referenced, a full syntax definition specification must follow. A syntax definition consists of the following parts:

Shunned character number identification
Syntax-reference character set
Function character identification
Naming rules
Delimiter set
Reserved name use
Quantity set

Each of these parts will be described in the sections that follow.

Shunned Character Number Identification

The Shunned Character Number Identification portion of the syntax definition specifies the characters that the designer of the syntax determines should not be used in any document character set using the syntax. This is done because the indicated bit combinations may cause problems for processing systems using the data. This is not an absolute prohibition, however. Shunned bit combinations may be used to represent markup or minimum data characters (RS, RE, SPACE, lower case letter, upper case letter, digit or special) but should not be used for other ordinary data characters. The reason for these exceptions is that markup characters will not be passed to the application by the parser and therefore will not cause problems. A keyword that may be specified here is "CONTROLS". This keyword indicates that any character number that is used as a control character in the system character set is a shunned character as well.

The standard indicates that shunned characters should not be translated when translating from one character set to another. This particular note has always troubled me because within EBCDIC, the tab character is encoded differently than in ASCII and both bit combinations are typically indicated as shunned characters. It is an indication to me that care must be exercised when specifying characters as shunned.

The following example shows a typical specification for this section:

SHUNCHAR CONTROLS 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 127 255 This example defines two graphic characters as shunned characters, 127 and 255. While these are graphics in some character sets, these two character encodings may cause problems to some systems and were therefore identified as shunned. In most cases, this is excessive. If any bit combination is not allowed in a character set, that bit combination may be defined as a non-SGML character in the document character set and therefore be prohibited from occurring in the document.

Note that the bit combinations are identified using character numbers (decimal number representations of the bit combination).

Syntax-Reference Character Set

After the shunned character number identification, the next section defines the syntax-reference character set. The function of this character set is to define the character set used when specifying character numbers or when using bit combinations to represent numbers, as was described above in detail.

The specification of the syntax-reference character set is identical to the specification of the document character set which was discussed in the first article. It uses the BASESET specification to identify a known character set. The DESCSET specification is then used to define the syntax-reference character set in terms of the base character set.

Function Character Identification

The function character identification section identifies the character numbers in the syntax-reference character set that represent several important function characters: RS, RE and SPACE.

In addition to these functions, other special function characters may be defined. To do so, you must define a name for the function and assign a function class and a character number. There are five function classes, here is their definition:

FUNCHAR: Identifies characters that may have some significance to the system but has no SGML function defined. One reason for using this is to allow the function character to be entered using character references, &[numsign]funchar where funchar is the name of the function character as identified here, rather than entering the bit combination directly.
MSOCHAR: Identifies characters that inhibit the recognition of markup in the data which follows. The MSICHAR function character is used to restore markup recognition.
MSICHAR: Identifies characters that restore markup recognition when it was suppressed by the use of a character defined as a MSOCHAR character.
MSSCHAR: Identifies characters that inhibit the recognition of markup for the character which immediately follows the function character in the same entity.
SEPCHAR: Identifies characters that are allowed as separators (like RE, RS and SPACE) and will be replaced by SPACE in all contexts in which RE is replaced by SPACE. One commonly defined such function character is TAB.

Defining Character Sets For Use in the Far East

The MSSCHAR, MSICHAR and MSOCHAR function classes may be used to define function characters that inhibit the interpretation of markup. This is desirable in instances where characters are encoded with escapes since the bit combinations that follow the escape may be erroneously interpreted as markup. IBM's Double Byte Character Sets on mainframe and workstation computers are examples of where these function classes must be used to inhibit the erroneous detection of markup characters in double byte encoded data. Care must be used when defining such functions, since markup recognition is inhibited, entity references and all other markup is not recognized. Note that even if markup recognition is inhibited, all bit combinations in an SGML document (even those found in the second byte of a double byte character) must be SGML characters as defined in the document character set definition or the parser will issue non-SGML character error messages.

Example

Here is an example of the identification of function characters:

FUNCTION RE 13 RS 21 SPACE 64 TAB SEPCHAR 5 SO MSOCHAR 14 SI MSICHAR 15

In this example, character number 13 (carriage return) has been defined as RE (record end), character number 21 (line feed) has been defined as RS (record start) and character number 64 (space) has been defined as SPACE. In addition, character number 5 (tab) has been defined as the function TAB with the function class SEPCHAR. These represent what I would expect to see in an EBCDIC character set definition. The SO (shift out) function is defined as an MSOCHAR function class and is the EBCDIC SO character, which escapes to double byte encoding in mixed (single and double byte encoded) EBCDIC data. The SI (shift in) function is defined as an MSICHAR function class and is the EBCDIC SI character which returns to single byte encoding in mixed data.

Naming Rules

The Naming Rules definitions allow you to specify characters (in addition to alphabetic characters and digits) to be used in names and as name start characters. (Name characters are characters that are allowed in anything which requires an SGML name, like an element name, entity name, attribute name, attribute value in a token list, etc. A name start character is a special subset of name characters that are the only characters allowed to start a name.) It also allows the specification of uppercasing rules for the added name characters. The categories that may be defined are:

LCNMSTRT

Lowercase name start characters

These characters are considered lowercase and may be used as the first character in a name. The order of the characters in the parameter literal argument must correspond to the uppercase name start definition found in UCNMSTRT.

UCNMSTRT

Uppercase name start characters

These characters are considered uppercase and may be used as the first character in a name. These characters must correspond to those found in LCNMSTRT. Characters may occur multiple times in this definition to allow different lowercase characters to map to the same uppercase character. For letters which do not distinguish between lower and uppercase, the same character is used in both LCNMSTRT and UCNMSTRT.

LCNMCHAR

Lowercase name characters

These characters are considered lowercase and may be used within a name. The order of the characters in the parameter literal argument corresponds to the uppercase name character definition found in UCNMCHAR.

UCNMCHAR

Uppercase name characters

These characters are considered uppercase and may be used in names. They correspond to the characters found in LCNMCHAR in the same way that UCNMSTRT characters corresponded to LCNMSTRT characters.

NAMECASE

Determines the extent of uppercase substitution during markup processing. It allows you to determine whether case may be used to differentiate entity names, element names, attribute names, etc. There are two different specifications possible for NAMECASE:

GENERAL: Determines whether all names, name tokens, number tokens and delimiter strings (besides entity names and references) will be folded to uppercase.
ENTITY: Determines whether all entity names and entity references will be folded to uppercase.

You must specify either a "YES" or a "NO" for both.

Here is an example of a typical naming rules section:

NAMING LCNMSTRT "" UCNMSTRT "" LCNMCHAR "" -- ".-" -- UCNMCHAR "" -- ".-" -- NAMECASE GENERAL YES ENTITY NO

In this example, no name start characters have been added to the defined minimum set (upper and lowercase letters). That is why the LCNMSTRT and UCNMSTRT have null character strings as arguments. The minimum set of name start characters is defined in production 53 of the standard. The name characters (which at a minimum contain the name start characters and the digits, 0 through 9) have two additional characters added, the period and the hyphen. Case is not an issue for these characters so they are specified using the same character numbers in both LCNMCHAR and UCNMCHAR. The NAMECASE specification indicates that all names except entity names (in declarations and references) will be folded to uppercase.

Notice that the LCNMSTRT and the others take a character string called a parameter literal, not character numbers. This allows these characters to be specified directly, which is much easier than using the character number. For this to be valid, however, the bit combination used by the character in the literal must be the same in the document character set (the bit combinations used to encode the document) and the syntax-reference character set. If they are not, the character string specification is misleading since the bit combination will be interpreted as if it were in the syntax-reference character set, not the document character set (and the two may be quite different). A method for avoiding this problem is to use numeric character references in the literal strings to specify the bit combination used in the syntax reference character set. This is shown in the above example. Notice that if this method is used, the syntax definition specification may be translated to a different character encoding without affecting the assignment of characters to roles in the syntax.

Delimiter Set

The delimiter set definitions assign delimiter roles to specific characters or character strings. There are two classes of delimiters defined:

General delimiters defining the characters used to represent the delimiter roles in the document
Short reference delimiters defining the character strings that are available for use in short reference maps for mapping these strings to entity references.

General delimiter definitions

The keyword GENERAL introduces the section where general delimiters may be defined. A list of all the general delimiters is found in Figure 3 of the SGML standard (it can be found in clause 9.6.1). All SGML delimiters may be changed. If a delimiter is not changed in the SGML declaration, it is assigned to the delimiter defined in the SGML reference concrete syntax. The GENERAL keyword must be followed by the keyword SGMLREF to remind us of this fact.

Short Reference Delimiter Definitions

The SHORTREF keyword introduced the section where the short reference delimiter strings are be defined. Short reference delimiters are character strings that the SGML parser can map to entity references. Notice that for short references to be recognized in a document, a short reference mapping must also be identified that maps the short reference delimiter to an entity reference. This mapping is done in the DTD, not in the SGML declaration. There is one general rule about the recognition of delimiters that is especially important in short reference string definition: when one delimiter is longer than another but is equal for the length of the shorter delimiter, the longer delimiter will be recognized if it is matched. It does not matter if the shorter delimiter is not mapped, it need only be defined. This can lead to unexpected results if not understood. Care must be exercised when defining short reference delimiters.

Following the SHORTREF keyword, two keywords may be specified. The SHORTREF keyword NONE indicates that no delimiter strings are defined as short reference strings except those defined in the list which follows. If no strings are listed, then no short reference strings are defined and the short reference facilities may not be used in the DTD. If the keyword SGMLREF is used, the delimiters defined are in addition to those found in the reference concrete syntax.

Other rules for defining delimiters

When changing delimiter definitions you must keep a few rules in mind:

A delimiter must differ from any other delimiter which may be recognized in the same mode. This prevents the declaration from defining an ambiguous condition.
The use of a name start character or a number in a delimiter string is discouraged. This prevents you from using "<e" as the end tag open delimiter since this would make it impossible to have any elements whose name begins with an "e".
Delimiter strings must be less than NAMELEN characters in length.

Example

Here is an example of a delimiter set definition:

DELIM GENERAL SGMLREF ETAGO "" -- ":/" -- PIO "" -- ";" -- REFC "" -- "." -- STAGO "" -- ":" -- TAGC "" -- "." -- SHORTREF NONE "B " -- Used to trap trailing blanks. --

The above example redefines several of the delimiter roles to use a more IBM GML-like set of delimiters. Notice that the ETAGO is not ":e" as would be expected for GML starter set compatibility. This could be specified as ":e" but this be contrary to the recommendation found in the SGML standard that name start characters not be defined as part of delimiter strings and was therefore not used.

The short reference delimiter string specified allows the trapping of trailing blanks on input lines. The DTD will determine how this is used, if at all.

Reserved Name Use

This portion of the syntax declaration allows you to substitute a name for many of the reserved names used in the reference concrete syntax. None of the reserved names used in the SGML declaration may be replaced since the SGML declaration is always written in the reference concrete syntax. All names defined here must follow all the rules for names (for example, they must be composed of name characters) in the declared concrete syntax.

Examples of these reserved names are:

ANY: Used in element content models
ATTLIST: Used in attribute declarations, data attribute declarations and link attribute declarations
CDATA: Used in entity declarations, element content models, attribute declarations, data attribute declarations, link attribute declarations and marked section keyword status areas.

There are, of course, many others. A complete list may be found in the ISO WG8 paper number N1035, which is included in The SGML Handbook, by Dr. Goldfarb, the editor of the SGML standard.

Once again notice that the reserved names used in the SGML system declaration and SGML declaration cannot be modified.

The reference concrete syntax reserved name is used for any name not replaced by this definition. Here is an example of a specification:

NAMES SGMLREF O "OMIT"

In the example, the reserved name "O" (that indicates that either a start or end tag is omissible in an element declaration) is replaced by the name "OMIT". The keyword SGMLREF is required and is a reminder that unspecified reserved names default to their values in the reference concrete syntax. Note the comments made previously about using numeric character references still apply, they weren't used here for simplicity's sake.

Quantity Set

The quantity set describes a number of limits on properties of SGML-defined objects. The effect of these limits depends on the particular quantity involved. Care should be used when changing these since creating a variant syntax may make it difficult for some SGML systems to process documents created with that syntax. The best means of guaranteeing portability between different SGML systems and applications is to use the reference concrete syntax as much as possible.

A complete list of quantities defined by SGML is found in Figure 6 in clause 13.4.8.

Here is an example of a quantity set specification:

QUANTITY SGMLREF ATTCNT 64 GRPGTCNT 128 NAMELEN 32

In the example, the maximum number of attributes which may be defined in an ATTLIST declaration has been set to 64. The maximum number of content tokens which may occur at all levels of a content model (this includes element names and delimiters) has been set to 128. Finally, the maximum length of names has been raised to 32 characters. Here again, the keyword SGMLREF is required and is a reminder that unspecified quantities default to their value in the reference concrete syntax.

Conclusion

There are a number of reasons you might want to define a concrete syntax: perhaps longer names must be allowed, larger content models must be accommodated or naming characters added. In any case, SGML provides a very flexible facility for making these changes and communicating these changes to consumers of the data.

Back/Next /Contents

Wayne L. Wohler, Dept G82/025Z, Publishing Solutions Development, IBM Corporation, PO Box 1900, Boulder, Colorado 80301-9191
Internet: wohler@vnet.ibm.com
IBMMAIL: USIB29WX@IBMMAIL
Phone: 1-303-924-5943