The SGML standard defines the SGML language indirectly, using what it calls "delimiter roles", rather specific characters, when defining the syntax of the language. It does not use a more direct method like assigning specific characters to represent markup functions (as is done in the C programming language). As an example, a typical end tag in an SGML document "</elemname>" contains two delimiters and a name. "</;" is the delimiter which indicates the start of an end tag; its role is called "etago". ">" is the delimiter which indicates the end of a tag; its role is called "tagc". Whenever the standard refers to a delimiter, it refers to it by name, not by using a specific character assignment to that role. Within the concrete syntax definition, we assign characters or character strings to delimiter roles, thus inventing a new concrete syntax but preserving the basic character of SGML.
In addition to defining delimiters, a concrete syntax can also describes
In the SGML standard, one concrete syntax has been defined. It is called the reference concrete syntax. It performs two functions:
When the reference concrete syntax is discussed, there is often a misconception. Because the ISO 646 character set was used to define the characters that are used in the syntax, it is often inferred that anything encoded using the reference concrete syntax must be encoded using that character set. As we shall see, that is not the case. The syntax defines what characters or character strings are associated with the various syntax roles but characters are defined by their meaning, not their bit combination. Therefore, a document may use the reference concrete syntax when encoded in EBCDIC, ASCII or any other character set which contains all the significant SGML characters used in the concrete syntax.
Before defining any parts of a syntax, a character set, the syntax-reference character set, must be defined and its role understood.
A syntax can be used in markup only after the roles in the syntax are assigned to characters. Remember that the term "character" refers not to the bit combination used to represent a character but to the meaning of the character. Since it is easier to specify bit combinations in the form of character numbers or in character strings, we use them as an intermediate step in the definition. So when defining a syntax, we define roles in terms of bit combinations that the system maps to the character intended. We must tell the system how to map these bit combinations to characters, just as we did when defining the document character set. This is done using the syntax-reference character set. Unlike the document character set however, this mapping is used only within the scope of the syntax definition. For example, assume that in the syntax-reference character set, the bit combination B'01001100' (76 decimal) is mapped to the character "less-than sign". Using ISO 646 as the base set, this would be done by specifying:
The fact that there is a syntax-reference character set leads to a constraint on the document character set: there must be one and only one bit combination in the document character set for every character used in the syntax role definitions or in the naming rules. If this were not true, one would either not be able to use a delimiter or name character (if there were no bit combination in the document character set) or it could lead to a very confusing syntax (if multiple bit combinations mapped to a single character).
Now that we understand how the syntax-reference character set is used, we can now learn how to define a concrete syntax.
"Concrete syntax" is the SGML term used to refer to the mappings of characters to delimiter roles, control characters that have special meaning, naming rules, reserved names and quantities used in the markup of the document. In other words, the concrete syntax describes what characters are used to start and end tags, entity references and all the other special, SGML-defined, character sequences found in markup. All of these items may be changed. In fact, the use of this capability can define a syntax which looks quite different from anything one normally thinks of as SGML data. This is a very powerful facility that can be very useful if used wisely.
To begin with, the start of the syntax definition is identified with the keyword "SYNTAX". There are two alternatives for what follows. Either a public syntax may be referenced using a public identifier or a full syntax specification may follow. The reference to a public concrete syntax is used if there is already a defined syntax which describes the syntax to be used within the document. The most commonly used such public syntax is the reference concrete syntax. If a public syntax is identified, the keyword SWITCHES is allowed. The purpose of this keyword is to identify a pair of character numbers which have been switched. This allows reuse of public syntaxes if there are only trivial differences. Here is an example of such a specification:
In this example, the syntax is declared to be the reference concrete syntax with the line feed and carriage return character meanings reversed (which may have some use on Unix-based systems).
If a public syntax is not referenced, a full syntax definition specification must follow. A syntax definition consists of the following parts:
The Shunned Character Number Identification portion of the syntax definition specifies the characters that the designer of the syntax determines should not be used in any document character set using the syntax. This is done because the indicated bit combinations may cause problems for processing systems using the data. This is not an absolute prohibition, however. Shunned bit combinations may be used to represent markup or minimum data characters (RS, RE, SPACE, lower case letter, upper case letter, digit or special) but should not be used for other ordinary data characters. The reason for these exceptions is that markup characters will not be passed to the application by the parser and therefore will not cause problems. A keyword that may be specified here is "CONTROLS". This keyword indicates that any character number that is used as a control character in the system character set is a shunned character as well.
The standard indicates that shunned characters should not be translated when translating from one character set to another. This particular note has always troubled me because within EBCDIC, the tab character is encoded differently than in ASCII and both bit combinations are typically indicated as shunned characters. It is an indication to me that care must be exercised when specifying characters as shunned.
The following example shows a typical specification for this section:
Note that the bit combinations are identified using character numbers (decimal number representations of the bit combination).
After the shunned character number identification, the next section defines the syntax-reference character set. The function of this character set is to define the character set used when specifying character numbers or when using bit combinations to represent numbers, as was described above in detail.
The specification of the syntax-reference character set is identical to the specification of the document character set which was discussed in the first article. It uses the BASESET specification to identify a known character set. The DESCSET specification is then used to define the syntax-reference character set in terms of the base character set.
The function character identification section identifies the character numbers in the syntax-reference character set that represent several important function characters: RS, RE and SPACE.
In addition to these functions, other special function characters may be defined. To do so, you must define a name for the function and assign a function class and a character number. There are five function classes, here is their definition:
The MSSCHAR, MSICHAR and MSOCHAR function classes may be used to define function characters that inhibit the interpretation of markup. This is desirable in instances where characters are encoded with escapes since the bit combinations that follow the escape may be erroneously interpreted as markup. IBM's Double Byte Character Sets on mainframe and workstation computers are examples of where these function classes must be used to inhibit the erroneous detection of markup characters in double byte encoded data. Care must be used when defining such functions, since markup recognition is inhibited, entity references and all other markup is not recognized. Note that even if markup recognition is inhibited, all bit combinations in an SGML document (even those found in the second byte of a double byte character) must be SGML characters as defined in the document character set definition or the parser will issue non-SGML character error messages.
Here is an example of the identification of function characters:
In this example, character number 13 (carriage return) has been defined as RE (record end), character number 21 (line feed) has been defined as RS (record start) and character number 64 (space) has been defined as SPACE. In addition, character number 5 (tab) has been defined as the function TAB with the function class SEPCHAR. These represent what I would expect to see in an EBCDIC character set definition. The SO (shift out) function is defined as an MSOCHAR function class and is the EBCDIC SO character, which escapes to double byte encoding in mixed (single and double byte encoded) EBCDIC data. The SI (shift in) function is defined as an MSICHAR function class and is the EBCDIC SI character which returns to single byte encoding in mixed data.
The Naming Rules definitions allow you to specify characters (in addition to alphabetic characters and digits) to be used in names and as name start characters. (Name characters are characters that are allowed in anything which requires an SGML name, like an element name, entity name, attribute name, attribute value in a token list, etc. A name start character is a special subset of name characters that are the only characters allowed to start a name.) It also allows the specification of uppercasing rules for the added name characters. The categories that may be defined are:
These characters are considered lowercase and may be used as the first character in a name. The order of the characters in the parameter literal argument must correspond to the uppercase name start definition found in UCNMSTRT.
These characters are considered uppercase and may be used as the first character in a name. These characters must correspond to those found in LCNMSTRT. Characters may occur multiple times in this definition to allow different lowercase characters to map to the same uppercase character. For letters which do not distinguish between lower and uppercase, the same character is used in both LCNMSTRT and UCNMSTRT.
These characters are considered lowercase and may be used within a name. The order of the characters in the parameter literal argument corresponds to the uppercase name character definition found in UCNMCHAR.
These characters are considered uppercase and may be used in names. They correspond to the characters found in LCNMCHAR in the same way that UCNMSTRT characters corresponded to LCNMSTRT characters.
You must specify either a "YES" or a "NO" for both.
Here is an example of a typical naming rules section:
In this example, no name start characters have been added to the defined minimum set (upper and lowercase letters). That is why the LCNMSTRT and UCNMSTRT have null character strings as arguments. The minimum set of name start characters is defined in production 53 of the standard. The name characters (which at a minimum contain the name start characters and the digits, 0 through 9) have two additional characters added, the period and the hyphen. Case is not an issue for these characters so they are specified using the same character numbers in both LCNMCHAR and UCNMCHAR. The NAMECASE specification indicates that all names except entity names (in declarations and references) will be folded to uppercase.
Notice that the LCNMSTRT and the others take a character string called a parameter literal, not character numbers. This allows these characters to be specified directly, which is much easier than using the character number. For this to be valid, however, the bit combination used by the character in the literal must be the same in the document character set (the bit combinations used to encode the document) and the syntax-reference character set. If they are not, the character string specification is misleading since the bit combination will be interpreted as if it were in the syntax-reference character set, not the document character set (and the two may be quite different). A method for avoiding this problem is to use numeric character references in the literal strings to specify the bit combination used in the syntax reference character set. This is shown in the above example. Notice that if this method is used, the syntax definition specification may be translated to a different character encoding without affecting the assignment of characters to roles in the syntax.
The delimiter set definitions assign delimiter roles to specific characters or character strings. There are two classes of delimiters defined:
The keyword GENERAL introduces the section where general delimiters may be defined. A list of all the general delimiters is found in Figure 3 of the SGML standard (it can be found in clause 9.6.1). All SGML delimiters may be changed. If a delimiter is not changed in the SGML declaration, it is assigned to the delimiter defined in the SGML reference concrete syntax. The GENERAL keyword must be followed by the keyword SGMLREF to remind us of this fact.
The SHORTREF keyword introduced the section where the short reference delimiter strings are be defined. Short reference delimiters are character strings that the SGML parser can map to entity references. Notice that for short references to be recognized in a document, a short reference mapping must also be identified that maps the short reference delimiter to an entity reference. This mapping is done in the DTD, not in the SGML declaration. There is one general rule about the recognition of delimiters that is especially important in short reference string definition: when one delimiter is longer than another but is equal for the length of the shorter delimiter, the longer delimiter will be recognized if it is matched. It does not matter if the shorter delimiter is not mapped, it need only be defined. This can lead to unexpected results if not understood. Care must be exercised when defining short reference delimiters.
Following the SHORTREF keyword, two keywords may be specified. The SHORTREF keyword NONE indicates that no delimiter strings are defined as short reference strings except those defined in the list which follows. If no strings are listed, then no short reference strings are defined and the short reference facilities may not be used in the DTD. If the keyword SGMLREF is used, the delimiters defined are in addition to those found in the reference concrete syntax.
When changing delimiter definitions you must keep a few rules in mind:
Here is an example of a delimiter set definition:
The above example redefines several of the delimiter roles to use a more IBM GML-like set of delimiters. Notice that the ETAGO is not ":e" as would be expected for GML starter set compatibility. This could be specified as ":e" but this be contrary to the recommendation found in the SGML standard that name start characters not be defined as part of delimiter strings and was therefore not used.
The short reference delimiter string specified allows the trapping of trailing blanks on input lines. The DTD will determine how this is used, if at all.
This portion of the syntax declaration allows you to substitute a name for many of the reserved names used in the reference concrete syntax. None of the reserved names used in the SGML declaration may be replaced since the SGML declaration is always written in the reference concrete syntax. All names defined here must follow all the rules for names (for example, they must be composed of name characters) in the declared concrete syntax.
Examples of these reserved names are:
Once again notice that the reserved names used in the SGML system declaration and SGML declaration cannot be modified.
The reference concrete syntax reserved name is used for any name not replaced by this definition. Here is an example of a specification:
In the example, the reserved name "O" (that indicates that either a start or end tag is omissible in an element declaration) is replaced by the name "OMIT". The keyword SGMLREF is required and is a reminder that unspecified reserved names default to their values in the reference concrete syntax. Note the comments made previously about using numeric character references still apply, they weren't used here for simplicity's sake.
The quantity set describes a number of limits on properties of SGML-defined objects. The effect of these limits depends on the particular quantity involved. Care should be used when changing these since creating a variant syntax may make it difficult for some SGML systems to process documents created with that syntax. The best means of guaranteeing portability between different SGML systems and applications is to use the reference concrete syntax as much as possible.
A complete list of quantities defined by SGML is found in Figure 6 in clause 13.4.8.
Here is an example of a quantity set specification:
In the example, the maximum number of attributes which may be defined in an ATTLIST declaration has been set to 64. The maximum number of content tokens which may occur at all levels of a content model (this includes element names and delimiters) has been set to 128. Finally, the maximum length of names has been raised to 32 characters. Here again, the keyword SGMLREF is required and is a reminder that unspecified quantities default to their value in the reference concrete syntax.
There are a number of reasons you might want to define a concrete syntax: perhaps longer names must be allowed, larger content models must be accommodated or naming characters added. In any case, SGML provides a very flexible facility for making these changes and communicating these changes to consumers of the data.