[December 04, 2000] Mark Davis posted an announcement for the publication of the Unicode Character Mapping Markup Language (CharMapML) as a full Technical Report. Reference: Unicode Technical Report #22, by Mark Davis (with contributions from Kent Karlsson, Ken Borgendale, Bertrand Damiba, Mark Leisher, Tony Graham, Markus Scherer, Peter Constable, Martin Duerst, Martin Hoskin, and Ken Whistler). This Unicode technical report "specifies an XML format for the interchange of mapping data for character encodings. It provides a complete description for such mappings in terms of a defined mapping to and from Unicode, and a description of alias tables for the interchange of mapping table names." The Unicode Technical Committee "intends to continue development of this TR to also encompass complex mappings such as 2022 and glyph-based mappings." Background: "The ability to seamlessly handle multiple character encodings is crucial in today's world, where a server may need to handle many different client character encodings covering many different markets. No matter how characters are represented, servers need to be able to process them appropriately. Unicode provides a common model and representation of characters for all the languages of the world. Because of this, Unicode is being adopted by more and more systems as the internal storage processing code. Rather than trying to maintain data in literally hundreds of different encodings, a program can translate the source data into Unicode on entry, process it as required, and translate it into a target character set on request. Even where Unicode is not used as a process code, it is often used as a pivot encoding. Data can be converted first to Unicode and then into the eventual target encoding. This requires only a hundred tables, rather than ten thousand. Whether or not Unicode is used, it is ever more vital to maintain the consistency of data across conversions between different character encodings. Because of the fluidity of data in a networked world, it is easy for it to be converted from, say, CP930 on a Windows platform, sent to a UNIX server as UTF-8, processed, and converted back to CP930 for representation on another client machine. This requires implementations to have identical mappings for a character encoding, no matter what platform they are working on. It also requires them to use the same name for the same encoding, and different names for different encodings. This is difficult to do unless there is a standard specification for the mappings so that it can be precisely determined what the encoding actually maps to. This technical report provides such a standard specification for the interchange of mapping data for character encodings. By using this specification, implementations can be assured of providing precisely the same mappings as other implementations on different platforms."
The report references several related data files, including DTDs and example documents.
References:
Character Mapping Markup Language (CharMapML). Full Technical Report. Reference: Unicode Technical Report #22, by Mark Davis (with contributions from Kent Karlsson, Ken Borgendale, Bertrand Damiba, Mark Leisher, Tony Graham, Markus Scherer, Peter Constable, Martin Duerst, Martin Hoskin, and Ken Whistler). 2000-12-01. Latest version: http://www.unicode.org/unicode/reports/tr22/.
DTD file for the Character Mapping Data format (CharacterMapping.dtd); [cache]
DTD file for the Character Mapping Alias format (CharacterMappingAliases.dtd); [cache]
Sample mapping file (SampleMappings.xml); [cache]
Sample alias file (SampleAliases.xml); [cache]
Sample alias file #2 (SampleAliases2.xml); [cache]
See also: "XML and Unicode."