[Unicode]  Technical Reports
 

Draft Unicode Technical Report #22

Character Mapping Tables

Version 2.1
Authors Mark Davis (mark.davis@us.ibm.com, home)
Date 2000-08-31
This Version http://www.unicode.org/unicode/reports/tr22/tr22-2.1
Previous Version http://www.unicode.org/unicode/reports/tr22/tr22-2
Latest Version http://www.unicode.org/unicode/reports/tr22/

Summary

This document specifies an XML format for the interchange of mapping data for character encodings. It provides a complete description for such mappings in terms of a defined mapping to and from Unicode.

Status

This document has been approved by the Unicode Technical Committee for public review as a Proposed Draft Unicode Technical Report. Publication does not imply endorsement by the Unicode Consortium. This is a draft document which may be updated, replaced, or superseded by other documents at any time. This is not a stable document; it is inappropriate to cite this document as other than a work in progress.

A list of current Unicode Technical Reports is found on http://www.unicode.org/unicode/reports/. For more information about versions of the Unicode Standard, see http://www.unicode.org/unicode/standard/versions/.

Contents


1 Introduction

The ability to seamlessly handle multiple character encodings is crucial in today's world, where a server may need to handle many different client character encodings covering many different markets. No matter how characters are represented, servers need to be able to process them appropriately. Unicode provides a common model and representation of characters for all the languages of the world. Because of this, Unicode is being adopted by more and more systems as the internal storage processing code. Rather than trying to maintain data in literally hundreds of different encodings, a program can translate the source data into Unicode on entry, process it as required, and translate it into a target character set on request.

Even where Unicode is not used as a process code, it is often used as a pivot encoding. Rather than requiring ten thousand tables to map each of a hundred character encodings to one another, data can be converted first to Unicode and then into the eventual target encoding. This requires only a hundred tables, rather than ten thousand.

Whether or not Unicode is used, it is ever more vital to maintain the consistency of data across conversions between different character encodings. Because of the fluidity of data in a networked world, it is easy for it to be converted from, say, CP930 on a Windows platform, sent to a UNIX server as UTF-8, processed, and converted back to CP930 for representation on another client machine. This requires implementations to have identical mappings for different character encodings, no matter what platform they are working on. It also requires them to use the same name for the same encoding, and different names for different encodings. This is difficult to do unless there is a standard specification for the mappings so that it can be precisely determined what the encoding maps actually to.

This technical report provides such a standard specification for the interchange of mapping data for character encodings. By using this specification, implementations can be assured of providing precisely the same mappings as other implementations on different platforms.

This report is in the initial stages of development; feedback is welcome.

1.1 Illegal and Unassigned Codes

Client software needs to distinguish the different types of mismatches that can occur when converting data between different character encodings. These fall into the following categories:

  1. The sequence is illegal. There are two variants of this.
    First is where the sequence is incomplete. For example, The second variant is where the sequence is complete, but explicitly illegal. For example,
  2. The source sequence represents a valid code point, but is unassigned (aka undefined). This sequence may be given an assignment in some future (evolved) version of the character encoding.
    For example,
  3. The source sequence is assigned, but unmappable: there is no corresponding code point in the target encoding to accurately represent the source sequence.
    For example,

In the case of illegal source sequences, a conversion routine will typically provide the following options:

Note: There is an important difference between the case where a sequence represents a real REPLACEMENT CHARACTER in a legacy encoding, as opposed to just being unassigned, and thereby mapped to REPLACEMENT CHARACTER (using an API substitution option).

Note: An API may choose to signal an illegal sequence in a legacy character set by mapping it to one of the explicit NOT A CHARACTER code points in Unicode (any of the form xxFFFE or xxFFFF). However, this mechanism runs the risk of these values being transmitted in Unicode text (which is non-conformant), and should be used with caution.

Unassigned sequences can be handled with any of the above options, plus some additional ones. They should always be treated as a single code point: for example, 0xA3BF is treated as a single code point when mapping into Unicode from CP950. Especially because unassigned character may actually come from a more recent version of the character encoding, it is often important to preserve round-trip mappings if possible. This can be done with additional options:

For unmappable sequences, all of the above options and one additional options may be available:

It is very important that systems be able to distinguish between the fallback mappings and regular mappings. Systems like XML require the use of hex escape sequences to preserve round-trip integrity; use of fallback characters in that case corrupts the data.

Because illegal values represent some corruption of the data stream, conversion routines may be directed to handle them in a different way than by replacement characters. For example, a routine might map unassigned characters to a substitution character, but throw an exception on illegal values.

1.2 Completeness

It is important that a mapping file be a complete description. From the data in the file, it should be possible to tell for any sequence of bytes whether that sequence is assigned, unassigned, or illegal. It should also be possible to tell if characters need to be rearranged to be in Unicode standard order (visual order, combining marks after base forms, etc).

1.3 Canonical Equivalence

The Unicode Standard has two equivalent ways of representing composite characters such as â. The standard provides for two normalized formats that provide for unique representations of data in UTR #15: Unicode Normalization Forms. The standard format for character encoding specification itself is to map to sequences of Unicode characters in Normalization Form C. However, this does not guarantee that the result of conversion into Unicode will be normalized, since individual characters in the source encoding may separately map to an unnormalized sequence.

For example, suppose the source encoding maps 0x83 to 0x030A in Unicode (combining ring above), and 0x61 to 0x0061 (a). Then the sequence <0x61,0x83> will map to <0x0061,0x030A> in Unicode, which is not in Normalization Form C.

This problem will only arise when the source encoding has separate characters that, in the proper context, would not be present in normalized text. If a process wishes to guarantee that the result is in a particular Unicode normalization form, then it should normalize after conversion. Information is provided below that can determine whether this step is required.

2 XML Formats

A character mapping specification file starts with the following lines. There is a difference between the encoding of the XML file, and the encoding of the mapping data. The encoding of the file can be any valid XML encoding. Only the ASCII repertoire of characters is required in the specification of the mapping data, but comments may be in other character encodings. The example below happens to use UTF-8.

<?xml version="1.0" encoding="UTF-8"?>

<!DOCTYPE characterMapping

  SYSTEM "http://www.unicode.org/unicode/reports/tr22/CharacterMapping.dtd">

Note: In the rest of this specification, short attribute and element names are used just to conserve space where there may be a large number of items, or for consistency with other elements that may have a large number of items.

2.1 Header

A mapping file begins with a comment header. Here is an (artificial) example:

<characterMapping

 name="windows-1252-2000"

 description="Sun variant of CP942 for Japanese"

 tableVersion="2"

 contact="mark@unicode.org"

 registrationAuthority="Microsoft"

 copyright="Microsoft"

 bidiOrder="logical"

 combiningOrder="after"

 normalization="C"

>

characterMapping (required) is the root. It contains a number of attributes:

name (required) is a canonical name which uniquely identifies this mapping table from all others. This name has the form: <source>-<name_on_source>-<version>, such as "iso-8859-1999".

<source> Name of standards authority, government, vendor, or product
<name_on_source> Most common name used on source.
<version> Version number, typically the year the encoding was introduced, a product version, or simply an incremented number.

All three fields must be present, except in the case of Unicode encodings themselves, which to not need a version field. Fields are limited to ASCII letters, digits and "_". Any other characters should be converted to "_" or letters as necessary for uniqueness. The name value is not case-sensitive. It must be unique; if two mapping tables differ in map any characters, in the specification of illegal characters, in their bidi ordering, in their combining character ordering, etc. then they must have a different name (or different version: see below).

Note: The name was chosen so that the resulting string can be used as a filename on most systems.

Note: These names are not meant to compete with the IANA character set registry, which is the most useful collection of cross-platform names available. We foresee registration of many of these mappings in the future with IANA since, unfortunately, the IANA names are not sufficiently precise. For example, many character sets advertise themselves as being "Shift-JIS", but actually have different mappings to and from Unicode on different platforms.

description (optional) is a string which describes the mapping enough to distinguish it from other similar mappings. This string must be limited to the Unicode range 0x0020 - 0x007E and should be in English. The string normally contains the set of mappings, the script, language, or locale for which it is intended, and optionally the variation. For instance, "Windows Japanese JIS-1990", "EBCDIC Latin 1 with Euro", "PC Greek".

tableVersion (optional) is the version of the data, a small integer normally starting at one. Any time the data is modified, the value must be increased. If only additions are made, then the same name can be retained; if not, then a new name must be used. Additions change mappings from "unassigned" to "assigned". Any change in the validity of character sequences requires a new name.

contact (optional) is the person to contact in case errors are found in the data. This must be an e-mail address or URL.

registrationAuthority (optional) is the organization responsible for the encoding.

registrationName (optional) is a string that provides the name and version of the mapping, as known to that authority.

copyright (optional) provides the copyright information. While this can be provided in comments, use of a field allows copyright propagation when converting to a binary form of the table. (Typically the right to use the information is granted, but not the right to erase the copyright or pretend that you created the information.)

bidiOrder (optional) specifies whether the character encoding is to be interpreted in one of two orders: "visual" or "logical". Unicode is strictly logical order. Application of the Unicode Bidirectional Algorithm is required to map to a visual-order character encoding; application of a reverse bidirectional algorithm is required to map back to Unicode. The default value for this attribute is "logical". It is only relevant for character encodings for the Middle East (Arabic and Hebrew). For more information, see UTR #9: The Bidirectional Algorithm.

combiningOrder (optional) specifies the order of combining marks: either "before" or "after". Some character encodings, typically those for bibliographic use, store combining marks before base characters. Unicode stores them uniformly after base characters. The default value for this attribute is "after". This is only relevant for character encodings with combining marks.

normalization (optional) specifies whether the result of conversion into Unicode using this mapping will be automatically in Normalization Form C or D. The possible values are "neither", "C", "D", "CD". While this information can be derived from an analysis of the assignment statements (see UTR #15: Unicode Normalization Forms), providing the information in the header is a useful validity check, and saves processing. Most mappings specifications will have the value "C". Character encodings that contain neither composite characters nor combining marks (such as 7-bit ASCII) will have the value "CD".

2.2 History

 <history supercedes="CP501" derivedFrom="CP500">

  <modified version="2" date="1999-09-25">

   Added Euro.

  </modified>

  <modified version="1" date="1997-01-01">

   Made out of whole cloth for illustration.

  </modified>

 </history>

history (optional) provides information about the changes to the file and relations to other encodings. This is an optional element.

modified provides information about the changes to the file, coordinated with the version. The latest version should be first.

2.3 Names

[Ed note: this should be in a separate section, but is left here in this draft, pending further reorganization.]

The mapping names table is a separate XML file that provides an index of names for character mapping tables. For each character mapping table, they provide display names, aliases, and fallbacks.

mappingNames (required) is the root. It contains any number of mapping elements.

mapping (optional) marks an element that contains any number of display, alias, and fallback elements. It has one required attribute, name. This provides the mapping table name in the canonical format, e.g. "us-ascii-1968".

display (optional) provides names in different languages, suitable for user menus. It has two required attributes, the language (actually locale) and the name in that language.

<display xml:lang="en" name="Western Europe (Latin-1, 8859-1)"/>

alias (optional) provides common aliases for the canonical names. It has one required attribute, which is name. This provides the alias name, which should be all lowercase. The preferredBy attribute is optional. It is a space-delimited list of environments where that particular alias is preferred: e.g. preferred="IANA IBM"

<alias name="iso-8859-1" preferredBy="MIME"/>

Note that because aliases reflect current practice, the same alias may be applied to different mappings.

fallbacks (optional) indicate an alternate mapping to use if the specified mapping is not installed. They are not exact. They may have an optional rank attribute. When multiple fallbacks are available, the one with largest rank should be chosen.

Here is a example of a mapping element.

<mapping name="us-ascii-1968">

  <display xml:lang="en" name="US (ASCII)"/>

  <alias name="us-ascii" preferred="MIME"/>

  <alias name="ansi_x3.4-1968"/>

  <alias name="iso-ir-6"/>

  <alias name="ansi_x3.4-1986"/>

  <alias name="iso_646.irv:1991"/>

  <alias name="ascii"/>

  <alias name="iso646-us"/>

  <alias name="us"/>

  <alias name="ibm367"/>

  <alias name="cp367"/>

  <alias name="csASCII"/>

</table>

A sample file is provided on CharacterMappingNames.xml.

2.4 Imports

 <import source="http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/CP852.XML"/>

It is possible to supply just the differences between one table and a base table. This is done with the import element, which is optional. If this is used, then any further data simply overrides the data in the base table. The value of the source attribute is a valid URL pointing to a valid character encoding table.

2.5 Validity Specification

As discussed above, it is important to be able to distinguish when characters are unassigned vs. when they are invalid. Valid and invalid sequences are specified by the validity element. Here is an example of what this might look like, for the validity specification for Microsoft's SJIS ("windows-932-2000"):

<validity ID="MS-SJIS">

  <state type="FIRST" next="VALID" s="00" e="80" /> 

  <state type="FIRST" next="VALID" s="A0" e="DF" /> 

  <state type="FIRST" next="VALID" s="FD" e="FF" /> 

  <state type="FIRST" next="LAST" s="81" e="9F" /> 

  <state type="FIRST" next="LAST" s="E0" e="FC" /> 

  <state type="LAST" next="VALID" s="40" e="7E" /> 

  <state type="LAST" next="VALID" s="80" e="FC" /> 

</validity>

The subelements are states. Their attributes are:

All values referring to code units are in hexadecimal. If we look at the above table, the first line tells us that the single bytes FD through FF are illegal. The next two lines say that the bytes in the ranges 81 through 9F and E0 through FC are legal, if they are followed by a byte of type="second". More detailed samples for a complex validity specifications are given in Samples.

The validity specification is interpreted by setting the current state to FIRST, and using the following process.

The following is a sample of how this could be implemented in Java. It would be very similar in C or C++ except that type would be an output parameter and not an array, and the mask with 0xFF is unnecessary if byte is a typedef for unsigned char.

Sample Validity Checking
/**

* Checks byte stream for validity

* @return number of valid bytes, and sets a flag.

* @param type VALID, INVALID, PARTIAL indicates invalid sequence.

* PARTIAL occurs at the end of a buffer, and indicates that a new buffer needs to be loaded.

* If there are no more bytes, it is equivalent to INVALID.

* @param length the number of bytes up to <b>and including</b> the final byte

* that caused the problem.

*/

public int check(byte[] source, int position, int limit, byte[] type) {

  int p = position;

  byte state = FIRST;

  try {

    while (p < limit) {     

      state = stateMap[state][source[p++] & 0xFF]; // mask in Java

      if (state < FIRST) { // VALID and INVALID are negative values 

        type[0] = state;

        return p-position;

      }

    }

  } catch (ArrayIndexOutOfBoundsException e) {} // fall through

  type[0] = (state < FIRST) ? state : PARTIAL;

  return p-position;

}

Error Conditions

The following describes conditions under which a validity specification is invalid.

2.6 Assignments

The main part of the table provides the assignments of mappings between byte sequences and Unicode characters. Here is an example:

 <assignments sub="A3">
  <!--Unassignments-->

  <a b="AA"/>

  <a b="AB"/>



  <!--Main mappings-->

  <a b="A1" u="FF61" c="。" /> 

  <a b="A2" u="FF62" c="「" /> 

  <a b="A3" u="FF63" c="」" /> 

  <a b="A4" u="FF64" c="、" /> 

  <a b="81 41" u="3001" c="、" /> 

  <a b="81 42" u="3002" c="。" /> 

  <a b="81 43" u="FF0C" c="," /> 

  <a b="81 44" u="FF0E" c="." /> 

  

  <!--Fallbacks-->

  <fub u="00A1" b="21" ru="0021" c="¡" rc="!" /> 

  <fub u="00A2" b="81 91" ru="FFE0" c="¢" rc="¢" /> 

  <fub u="00A3" b="81 92" ru="FFE1" c="£" rc="£" /> 

  <fub u="00A5" b="5C" ru="005C" c="¥" rc="\" /> 

  <fub u="00A6" b="7C" ru="007C" c="¦" rc="|" /> 

  <fub u="00A9" b="63" ru="0063" c="©" rc="c" /> 

  

 </assignments>

assignments contains a list of any number of a, fub, or fbu elements. It has one optional attribute sub, that specifies the replacement character used in the legacy character encoding. (U+FFFD REPLACEMENT CHARACTER is used in Unicode.) The value is a sequence of bytes, as described under b below. The default is the ASCII control value SUB = "1A".

aspecifies a mapping from byte sequences to Unicode and back. It has the following attributes:

The element fub specifies a fallback mapping from Unicode to bytes, to be used if an API requests a "best effort". It has the same attributes as a, plus two additional optional attributes. These are provided for readability, and are not required.

The element fbu specifies a fallback mapping from bytes to Unicode, to be used if an API requests a "best effort". Normally this element is not required (or desired): byte sequences with no Unicode equivalent should be assigned to private use characters (E000..F8FF, E0000..EFFFD, 100000..10FFFD). This element has the same attributes as a.

Error Conditions

3 Samples

The following provide samples that illustrate features of the format.

3.1 Full Sample

A sample of a mappings constructed programmatically is provided on http://oss.software.ibm.com/icu/charset You can view it directly with Internet Explorer, which will interpret the XML.

3.2 UTF-8 Sample

Here is a simple version of the UTF-8 validity specification, with the shortest-form bounds checking and exact limit bounds checking omitted. While in practice a mapping file is never required for UTF-8 since it is algorithmically derived, it is instructive to see the use of the validity element as a complicated example. As a reminder, first here are the valid ranges for UTF-8:

Figure 2: UTF-8 Boundaries
Unicode Code Points UTF-8 Code Units
00  00
7F  7F
80  C2 80
7FF  DF BF
800  E0 A0 80
FFFF  EF BF BF
010000  F0 90 80 80
10FFFF  F4 8F BF BF

3.2.1 Partial Validity Checks

Here is a simple version of the UTF-8 validity specification, with the shortest-form bounds checking and exact limit bounds checking omitted. This specification only checks the bounds for the first byte, and that there are the appropriate number (0, 1, 2, or 3) of following bytes in the right ranges. The single byte form does not need to be explicitly set; it is simply any single byte that neither is illegal nor requires additional bytes.

<validity>



 <!--Validity specification for UTF-8, partial boundary checks-->

 <state type="FIRST" next="VALID" s="00" e = "7F"/>



 <!-- 2 byte form -->

 <legal type="FIRST" s="C0" e="DF" next="final" />

 <legal type="final" s="80" e="BF" />



 <!-- 3 byte form -->

 <legal type="FIRST" s="DF" e="EF" next="prefinal" />

 <legal type="prefinal" s="80" e="BF" next="final" /> 



 <!-- 4 byte form -->

 <legal type="FIRST" s="F0" e="F4" next="preprefinal" />

 <legal type="preprefinal" s="80" e="BF" next="prefinal" />



</validity> 

3.2.2 Full Validity Checks

The following provides the full validity specification for UTF-8, as shown in Figure 2: UTF-8 Boundaries.

<validity>



 <!--Validity specification for UTF-8, full boundary checks-->

 <state type="FIRST" next="VALID" s="00" e = "7F"/>



 <!-- 2 byte form -->

 <state type="FIRST" s="C2" e="DF" next="final" />

 <state type="final" s="80" e="BF" next="VALID"/>



 <!-- 3 byte form; Low range is special-->

 <state type="FIRST" s="E0"        next="prefinalLow" /> 

 <state type="prefinalLow" s="A0" e="BF" next="final" /> 



 <!-- 3 byte form, Normal -->

 <state type="FIRST" s="E1" e="EF" next="prefinal"  />

 <state type="prefinal"  s="80" e="BF" next="final" /> 



 <!-- 4 byte form, Low range is special -->

 <state type="FIRST" s="F0"        next="preprefinalLow" /> 

 <state type="preprefinalLow" s="90" e="BF" next="prefinal"/>



 <!-- 4 byte form, Normal -->

 <state type="FIRST" s="F1" e="F3" next="preprefinal"   />

 <state type="preprefinal" s="80" e="BF" next="prefinal" />



 <!-- 4 byte form, High range is special-->

 <state type="FIRST" s="F4"        next="preprefinalHigh" />

 <state type="preprefinalHigh" s="80" e="8F" next="prefinal"/> 



</validity>

Modification History

1.1 draft 1

The aliases and display names have been moved into a separate, centralized table. A sample is also provided. The syntax of the fallback assignments and validity specification have been simplified, and some of the identifiers changed for clarity. Pointers are provided to sample tables.

Acknowledgments

Thanks to Kent Karlsson, Ken Borgendale, Bertrand Damiba, Mark Leisher, Tony Graham, and Ken Whistler for their feedback on the document.


Copyright © 1999-2000 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.