Back/Next /Contents

Document Character Sets

The first time you face the problem of different character sets can be a rude awakening. Until you've faced a screen full of very odd looking characters that comprised your document on another system, you probably had not given a great deal of thought to how a computer represents that document. You may have thought an "A" is an "A" is an "A". If only it were so. As it turns out, an "A" is an "A" only because the system you are using interprets a unique bit combination in computer memory as an "A". It may have seemed that when you hit the shift lock and the key under your left little finger (unless you're in France or many other places, but that's another story) the "A" is displayed and that is all there was to it. What really happens is that the keystroke is converted into a bit combination in computer memory which by convention is displayed as an "A". As long as the bit combinations are interpreted the same way, there are no problems. Now imagine this memory is interpreted by another program that has a different convention for representing the characters including the letter "A". You may get something but it won't be what you intended. While this is not usually a problem on a single system, it is most certainly a problem when transporting data between systems. There are two commonly used conventions for representing text in computers: ASCII (American Standard for Coded Information Interchange) which is typically used on personal computers and workstations and EBCDIC (Extended Binary-Coded Decimal Interchange Code) used on many "mainframe" systems. Within both of these systems, there are many permutations, usually depending on the country where the data is created. It is an unpleasant fact of life that different conventions for representing characters are used on different computer systems.

The most obvious solution to this problem is to simply use the same coded representation for the same characters on all systems. In fact, there is an effort underway to propose just such a solution. This effort is worthy of an article in its own right, I won't cover beyond saying that ISO 10646 is a multi-byte standard for representing all characters and ideographs used throughout the world. Another standard called Unicode is designed to accomplish the same goal in two bytes (as opposed to 1 to 4 bytes in ISO 10646). Unicode will be a subset of ISO 10646 and both have been accorded much deserved attention. Unfortunately, these standards won't help us until computers, operating systems and programs have been implemented to exploit this capability. Until then, we are faced with a world in which characters are commonly represented with a maximum of 8 bits. So why is this a problem? There are simply not enough unique bit combinations to represent all the characters that need to be represented, especially in the publishing business.

Several solutions have been used to overcome this problem:

  1. Assign, by convention, mappings of bit combinations to characters, to meet a particular purpose. There are a multitude of such mappings defined. The permutations of ASCII and EBCDIC are two examples.
  2. Define special bit combinations which indicate that following bit combinations have different meanings. These special bit combinations are called escapes. Escapes can then allow changing to a mapping which contains the characters which are needed.

Bit combinations:

I have mentioned the term "bit combinations" several times without defining it. A bit combination is a collection of bits which, by convention, represents a character in a computer. Often these combinations use 8 bits (a byte) but for our purposes, need not do so.


To understand how SGML addresses this problem, it is important to divorce the character and its meaning from its representation on particular system. In my discussion here, I have done that by referring to the representation as a "bit combination" and the meaning of a character as a "character". This concept will become particularly important later when we discuss defining syntaxes but for the moment when you think of characters, keep the meaning of a character separate from its representation on a given system.

Defining Character Sets

To support the definition of a document character set, SGML must allow the association of the bit combination used in the document for a particular character to that character's meaning. Unfortunately, it is verbose to define character meanings directly (although this can be done using glyph identifiers or other text descriptions) so SGML allows them to be defined in terms of another known or standard character set. Let me illustrate this with an example. Suppose I'm writing my document on an EBCDIC system which represents the capital letter A using the bit combination B'11000001' which is 193 in decimal. Let's also assume that I wish to define this character in terms of ISO 646, which is a well-known 7 bit character encoding standard. The character "capital A" is encoded in this standard using the bit combination B'01000001' which is 65 in decimal. The SGML declaration allows this association to be made with the following syntax:

BASESET "ISO 646:1983//CHARSET International Reference Version (IRV)//ESC 2/5 4/0" DESCSET : 193 1 65 -- Map 1 document character starting at 193 (EBCDIC A) to base set character 65 (ISO 646 A) -- :

"BASESET" introduces the public identifier description of the character set being used as a reference or "base". The base set is typically a standard, registered or at least well-known character set. The description must be a form that the expected recipients will understand. It is convenient if the base set contains most, if not all, of the characters the document character set will use. It is also convenient if the base set closely resembles the document character set since this reduces the amount of specification that is required, as we shall see. After the base set specification, "DESCSET" introduces the descriptions or mappings from the document character set bit combinations to the "base" set's bit combinations.

The facilities for specifying this mapping are quite flexible. The user may use whatever base set(s) is understood by their system. If a character does not exist in the base set chosen, a literal may be used to describe the character meaning. The syntax also allows the user to map multiple characters in the document character set to the base character set with a single declaration. In the above example, only one character was mapped, hence the "1" used in the example.

Each mapping found in the DESCSET section consists of three entries:

  1. The document character set character number being mapped. This is the character number of the character being defined in the document's character set.
  2. The number of characters being defined. This parameter allows the SGML declaration writer to associate contiguous characters in the document character set and the base character set in one specification.
  3. The base set character number which corresponds to the document character set number of the same entry. This field may also be the character string "UNUSED" which indicates that the document character set character number(s) are not used in the document. This is known as a non-SGML character. This field may also be the literal which describes the character which exists in the document character set but does not exist in the base set.

Now when creating this mapping, one must define the mapping for all the characters which have bit combinations in the document character set. This set of characters is called a "character repertoire". So a character set is a mapping of each character in a character repertoire to a bit combination. Here is an example which defines a subset of EBCDIC in terms of ISO 646:

BASESET "ISO 646:1983//CHARSET International Reference Version (IRV)//ESC 2/5 4/0" DESCSET 0 5 UNUSED 5 1 9 -- HT -- 6 7 UNUSED 13 1 13 -- CR (RE) -- 14 23 UNUSED 37 1 10 -- LF (RS) -- 38 26 UNUSED 64 1 32 -- SPACE -- 65 9 UNUSED 74 1 "cent sign" 75 1 46 -- period -- 76 1 60 -- less than sign -- 77 1 40 -- left paren -- 78 1 43 -- plus sign -- 79 1 124 -- vertical bar -- 80 1 38 -- ampersand -- 81 48 UNUSED 129 9 97 -- a to i -- 138 7 UNUSED 145 9 106 -- j to r -- 154 8 UNUSED 162 8 115 -- s to z -- 170 23 UNUSED 193 9 65 -- A to I -- 202 7 UNUSED 209 9 74 -- J to R -- 218 8 UNUSED 226 8 83 -- S to Z -- 234 6 UNUSED 240 10 48 -- 0 to 9 -- 250 6 UNUSED

There are several points to note here:

There will be instances when the characters used in a document are defined in multiple, different character sets. A common instance of this is the use of ISO 646 for the most common Latin characters and the so-called "right-half-plane" of another character set standard, ISO 8859 for accented and other special characters. (See Figure 8 of the SGML standard.) This is done by respecifying the BASESET parameter with the new base character set name and then continuing with the DESCSET mapping of document bit combinations. This respecification of BASESET/DESCSET may be done as many times as necessary.


A Simple Case:

The most common case will probably be that the document character set matches exactly a known, standard character set and the specification of the document character set becomes trivial; in the case of ISO 646 it is:

BASESET "ISO 646:1983//CHARSET International Reference Version (IRV)//ESC 2/5 4/0" DESCSET 0 128 0

Even this simple specification is useful since it specifies something that would have to otherwise be assumed or discovered.


SGML Declaration Syntax

Character sets are defined and used in two places in an SGML declaration. To use these specifications to define the document character set in an SGML declaration, they must follow the keyword "CHARSET" at the be ginning of the SGML declaration. The other place character sets are defined is in the SYNTAX section.

System Character Set

Within the SGML standard, there is the concept of the "system character set". This is the character set that is used to represent characters within a given SGML system. For the system to process a document, the document character set must be the same as the system character set. If they are different, one or the other must be changed and like the mountain and Muhammed, this can be accomplished in one of two ways. Either the document can be converted to the system character set for processing or the system can be configured to process the document character set.

The first alternative is a fairly straightforward process of translating each character in the document to its equivalent in the system character set. The SGML syntax for defining character sets could make a robust basis for a general document translation tool but I have not seen such a tool on the market. There is a problem with performing this translation if there is no equivalent for a character in the document character set in the system character set. This is always a problem and is not a problem SGML has solved. One approach to addressing this problem in an SGML translation tool might be to use character entity references, assuming that the appropriate entities are defined in the application being used; in any case, it is problematic. This is a powerful argument for keeping the number of characters represented by bit combinations small and insuring they are a fairly common set of characters, using entity references for all other characters. (This requires that the DTD define a large standard set of entity definitions to work well. ISO Technical Report 9573, part 13 defines a large number of such entities.) The process of translating the document into the system character set must be followed by changing the document character set definition in the SGML declaration in the document, if it included an SGML declaration since the document character set has been changed.

The idea of changing the system to process the document character set is interesting in that it points out an inherit limitation of the self-defining nature of an SGML declaration. The SGML declaration is part of an SGML document and as such is written using the document character set. But the document character set is defined within the SGML declaration. This seems to make this portion of the declaration redundant since the document can not be read unless the reader knows the character encoding used. How can one read the SGML declaration without knowing, a priori, what the document character set is? You can't. Neither can the SGML system. This characteristic of SGML declarations is the reason for the enigmatic note in section 13.1 of the standard which states, among other things, that there are two basic approaches to communicating the document character set definition:

  1. by an identifying name or number
  2. or using a human-readable copy of the SGML declaration

While on the face of it, this seems to be a severe limitation, the only alternative would be to define a character set standard for SGML declarations which would make them impossible to create, edit or even read it on some systems and would divorce them from the documents they are intended to be a part of.

Given this a priori knowledge of the document character set, there is no reason that an SGML system could not process a complete document, even if it was encoded using a "foreign" character set, although handling the data after having done so is problematic.

Using Character Sets From the Far East

SGML does not place any inherent limits on the size of the bit combinations used to represent characters. In that sense it is already prepared for Unicode. It does have a requirement that all bit combinations must be of equal size (that is, the same number of bits in length). I bring this up as an introduction to the discussion of character sets from the Far East because it is in the Far East that the size of character sets make 8 bit characters untenable.

The definition of a large character set to contain all the characters used in the Far East is coming, as I said earlier, with ISO 10646 and Unicode. Of course while SGML as a standard is prepared for these events, most of the systems I am aware of are not currently capable of handling full double byte character sets. When such systems are available and they process an SGML declaration fully, it will be possible to define applications which use ideographs as keywords, delimiters, names, in the SGML syntax, as well as use ideographs in data. This is all within the scope of today's standard. But in the meantime, there are several capabilities of SGML which allow the processing of documents from the Far East without full double-byte enablement. These capabilities are part of the syntax definition which will be discussed in the next article.

Conclusion

In this article we have seen how SGML supports the definition of different characters for use in an SGML document. This allows you to modify the character set used within a document to match what your system can process or to communicate to your system what character set the document uses. It will also provide a foundation needed to define a the syntax used in the SGML document.


Back/Next /Contents

Wayne L. Wohler, Dept G82/025Z, Publishing Solutions Development, IBM Corporation, PO Box 1900, Boulder, Colorado 80301-9191
Internet: wohler@vnet.ibm.com
IBMMAIL: USIB29WX@IBMMAIL
Phone: 1-303-924-5943