The Cover PagesThe OASIS Cover Pages: The Online Resource for Markup Language Technologies
Advanced Search
Site Map
CP RSS Channel
Contact Us
Sponsoring CP
About Our Sponsors

Cover Stories
Articles & Papers
Press Releases

XML Query

XML Applications
General Apps
Government Apps
Academic Apps

Technology and Society
Tech Topics
Related Standards
Created: July 08, 2005.
News: Cover StoriesPrevious News ItemNext News Item

New Unicode Consortium Technical Report on Unicode Security Considerations.


Unicode Technical Report #36 on Unicode Security Considerations "describes some of the security considerations that programmers, system analysts, standards developers, and users should take into account [when using the Unicode Standard], and provides specific recommendations to reduce the risk of problems."

Unicode is the basis for XML: legal XML characters "are tab, carriage return, line feed, and the legal characters of Unicode and ISO/IEC 10646, and all XML processors must accept the UTF-8 and UTF-16 encodings of Unicode 3.1. The Extensible Markup Language (XML) 1.0 Third Edition specification, as a W3C Recommendation, normatively references associated standards (Unicode and ISO/IEC 10646 for characters, Internet RFC 3066 for language identification tags, ISO 639 for language name codes, and ISO 3166 for country name codes) to provide all the information necessary to understand XML Version 1.0 and construct computer programs to process it."

The Unicode Standard is "a character coding system designed to support the worldwide interchange, processing, and display of the written texts of the diverse languages and technical disciplines of the modern world. It is a fundamental component of all modern software and information technology protocols. In addition, it supports classical and historical texts of many written languages. It provides a uniform, universal architecture and encoding (with over 96,000 characters currently encoded) and is the basis for processing, storage, and seamless data interchange of text data worldwide. Unicode is required by modern standards such as XML, Java, C#, ECMAScript (JavaScript), LDAP, CORBA 3.0, WML, IDN, etc., and is the official way to implement ISO/IEC 10646."

A number of visual security issues have arisen in connection with (visual) spoofing, and this threat provides the basis for the UTR #36 technical report from the Unicode Consortium. The new Unicode Security Considerations Technical Report "provides an initial step towards reducing the risk of such problems while preserving the ability to have internationalized domain names for all the modern languages of the world."

Security issues identified and addressed in the report include Internationalized Domain Names, Mixed-Script Spoofing, Single-Script Spoofing, Inadequate Rendering Support, Bidirectional Text Spoofing, Syntax Spoofing, and Numeric Spoofs.

In many ways, acording to the TR introduction, "the use of Unicode makes programs much more robust and secure. When systems used a hodge-podge of different charsets for representing characters, there were security and corruption problems that resulted from differences between those charsets, or from the way in which programs converted to and from them. But because Unicode contains such a large number of characters, and because it incorporates the varied writing systems of the world, incorrect usage can expose programs or systems to possible security attacks."

Visual spoofs "depend on the use of visually confusable strings: two different strings of Unicode characters whose appearance in common fonts in small sizes at screen resolutions is sufficiently close that people easily mistake one for the other." The report notes that the visual spoofing problem "is not new to Unicode: it was possible to spoof even with ASCII characters alone. For example, '' uses a capital I instead of an L, and a now infamous example involves '... Not only was '' very convincing, but the scam artist even goes one step further. He or she is apparently emailing PayPal customers, saying they have a large payment waiting for them in their account. The message then offers up a link, urging the recipient to claim the funds. But the URL that is displayed for the unwitting victim uses a capital 'i' (I), which looks just like a lowercase 'L' (l), in many computer fonts."

The Technical Report's lead example of visual spoofing based upon Unicode involves the character 'c' in citibank and the URL Visual spoofing exploits a similarity in visual appearance which fools a user and causes him or her to take unsafe actions:

"... the user gets an email notification about an apparent problem in their citibank account. Security-savvy users realize that it might be a spoof; the HTML email might be presenting the URL visually, but might be hiding the real URL. They realize that even what shows up in the status bar might be a lie, since clever Javascript or ActiveX can work around that. And users may have these turned on unless they know to turn them off. They click on the link, and carefully examine the browser's address box to make sure that it is actually going to They see that it is, and use their password. But what they saw was wrong — it is actually going to a spoof site with a fake '', using the Cyrillic letter that looks precisely like a 'c'. They use the site without suspecting, and the password ends up compromised."

The authors of the Unicode Security Considerations Technical Report envision that the document "should grow over time, adding additional sections as needed. Initially, it is organized into two sections: visual security issues and non-visual security issues. Each section presents background information on the kinds of problems that can occur, then lists specific recommendations for reducing the risk of such problems." A Revision 4 of UTR #36 underway as of 2005-07-08 contains modifications towards making the a draft Unicode Technical Standard (UTS).

Reportedly, the growing controversy about security risks in internationalized domain names has given rise to debate, and some "tension between two extremes: those who want to withdraw back to ASCII and see security as a reason (or an excuse) to do so, and those who see no reason to distinguish among any characters, even where that can cause significant security risks." The topic of IDN/spoofing is to be discussed at a July 2005 ICANN meeting in Luxembourg.

Bibliographic Information

Unicode Security Considerations. By Mark Davis (IBM) and Michel Suignard (Microsoft). Unicode Technical Report #36. Revision 3 date: 2005-07-07. Revision 3 URL: Previous Version URL: Latest Version URL: Latest Working Draft URL:

About Unicode and XML

"The Unicode Standard defines the universal character set. Its primary goal is to provide an unambiguous encoding of the content of plain text, ultimately covering all languages in the world. Currently in its fourth major version, Unicode contains a large number of characters covering most of the currently used scripts in the world. It also contains additional characters for interoperability with older character encodings, and characters with control-like functions included primarily for reasons of providing unambiguous interpretation of plain text. Unicode provides specifications for use of all of these characters.

For document and data interchange, the Internet and the World Wide Web are more and more making use of marked-up text such as HTML and XML. In many instances, markup provides the same, or essentially similar features to those provided by format characters in the Unicode Standard for use in plain text. Another special character category provided by Unicode are compatibility characters. While there may be valid reasons to support these characters and their specifications in plain text, their use in marked-up text can conflict with the rules of the markup language. Formatting characters are discussed in [Unicode Standard] chapters 2 and 3, compatibility characters in chapter 4.

Issues resulting from canonical equivalences and Normalization as well as the interaction of character encoding and methods of escaping characters in markup are discussed in the Character Model for the World Wide Web.

The issues of using Unicode characters with marked-up text depend to some degree on the rules of the markup language in question and the set of elements it contains. In a narrow sense, this document concerns itself only with XML, and to some extent HTML. However, much of the general information presented here should be useful in a broader context, including some page layout languages..." [from the Introduction in Unicode in XML and Other Markup Languages]

Other XML-Related Unicode Technical Reports

  • Character Encoding Model. Unicode Technical Report #17. 2004-09-09. "This report describes a model for the structure of character encodings. The Unicode Character Encoding Model places the Unicode Standard in the context of other character encodings of all types, as well as existing models such as the character architecture promoted by the Internet Architecture Board (IAB) for use on the internet, or the Character Data Representation Architecture (CDRA) defined by IBM for organizing and cataloging its own vendor-specific array of character encodings... The mapping from a sequence of members of an abstract character repertoire to a serialized sequence of bytes is called a Character Map (CM). A simple character map thus implicitly includes a CCS, a CEF, and a CES, mapping from abstract characters to code units to bytes. A compound character map includes a compound CES, and thus includes more than one CCS and CEF. In that case, the abstract character repertoire for the character map is the union of the repertoires covered by the coded character sets involved. UTR-22 'Character Mapping Markup Language' defines an XML specification for representing the details of Character Maps."

  • Unicode in XML and Other Markup Languages. Unicode Technical Report #20. W3C Note 13 June 2003. Technical Report published jointly by the Unicode Technical Committee and by the W3C Internationalization Working Group/Interest Group in the context of the W3C Internationalization Activity. See the W3C version.

  • Character Mapping Markup Language (CharMapML). Unicode Technical Standard #22. "This document specifies an XML format for the interchange of mapping data for character encodings, and describes some of the issues connected with the use of character conversion. It provides a complete description for such mappings in terms of a defined mapping to and from Unicode, and a description of alias tables for the interchange of mapping table names..." See the local reference.

  • The Unicode CHARACTER Property Model. Unicode Technical Report #23. 2004-07-12. This report presents a conceptual model of character properties defined in the Unicode Standard..

  • Unicode Support for Mathematics. Unicode Technical Report #25. "Starting with version 3.2, Unicode includes virtually all of the standard characters used in mathematics. This set supports a variety of math applications on computers, including document presentation languages like TeX, math markup languages like W3C MathML and OpenMath, internal representations of mathematics in systems like Mathematica, Maple, and MathCAD, computer programs, and plain text. This technical report describes the Unicode mathematics character groups and gives some of their imputed default math properties..."

  • Locale Data Markup Language (LDML). Unicode Technical Standard #35. Version 1.3. 2005-06-02. This document describes an XML format (vocabulary) for the exchange of structured locale data. A locale [in this document] "is an ID that refers to a set of user preferences that tend to be shared across significant swaths of the world. Traditionally, the data associated with this id provides support for formatting and parsing of dates, times, numbers, and currencies; for measurement units, for sort-order (collation), plus translated names for timezones, languages, countries, and scripts. They can also include text boundaries (character, word, line, and sentence), text transformations (including transliterations), and support for other services... There are many different equally valid ways in which data can be judged to be 'correct' for a particular locale. The goal for the common locale data is to make it as consistent as possible with existing locale data, and acceptable to users in that locale. This document describes one of those pieces, an XML format for the communication of locale data. With it, for example, collation rules can be exchanged, allowing two implementations to exchange a specification of collation. Using the same specification, the two implementations will achieve the same results in comparing strings..." See "Unicode Consortium Hosts the Common Locale Data Repository (CLDR) Project" and "Unicode Releases Common Locale Data Repository, Version 1.3."

About the Unicode Standard Version 4.1

In March 2005, the Unicode Consortium announced the release of Version 4.1.0 of the Unicode Standard. This new version adds 1,273 new characters, including those necessary to complete roundtrip mapping of the HKSCS and GB 18030 standards, five new currency signs, some characters for Indic and Korean, and eight new scripts (New Tai Lue, Buginese, Glagolitic, Coptic, Tifinagh, Syloti Nagri, Old Persian, Kharoshthi). There are additions for Biblical Hebrew and editorial marks for Biblical Text annotation. Unicode 4.1.0 adds two new Unicode Standard Annexes: UAX #31 (Identifier and Pattern Syntax) and UAX #34 (Unicode Named Character Sequences). Significant additions and changes to the Unicode Character Database properties have been made which determine the behavior of characters in modern software. The release of Unicode 4.1 [was to be] soon followed by a new release of the Unicode Collation Algorithm, for language-sensitive sorting, searching, and matching; by Unicode Regular Expressions, setting the standard for handling Unicode character in regular expressions; and by a new draft of Unicode Security Considerations.

About the Unicode Consortium

"The Unicode Consortium is a non-profit organization originally founded to develop, extend and promote use of the Unicode Standard, which specifies the representation of text in modern software products and standards. The Unicode Consortium actively develops standards in the area of internationalization including defining the behavior and relationships between Unicode characters. The Consortium cooperates with W3C and ISO and has liaison status "C" with ISO/IEC/JTC 1/SC2/WG2, which is responsible for refining the specification and expanding the character set of ISO/IEC 10646.

Full members of the Unicode Consortium (the highest level) are: Adobe Systems, L'Agence intergouvernementale de la Francophonie, Apple Computer, Government of India — Ministry of Information Technology, Government of Pakistan — National Language Authority, HP, IBM, Justsystem, Microsoft, Monotype Imaging, Oracle, RLG, SAP, Sun Microsystems, and Sybase. In addition, there are about 100 Supporting, Associate, Liaison, and Individual members.

Hosted By
OASIS - Organization for the Advancement of Structured Information Standards

Sponsored By

IBM Corporation
ISIS Papyrus
Microsoft Corporation
Oracle Corporation


XML Daily Newslink
Receive daily news updates from Managing Editor, Robin Cover.

 Newsletter Subscription
 Newsletter Archives
Bottom Globe Image

Document URI:  —  Legal stuff
Robin Cover, Editor: